Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/12004#discussion_r114056158 --- Diff: docs/cloud-integration.md --- @@ -0,0 +1,512 @@ +--- +layout: global +displayTitle: Integration with Cloud Infrastructures +title: Integration with Cloud Infrastructures +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT +--- +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +* This will become a table of contents (this text will be scraped). +{:toc} + +## <a name="introduction"></a>Introduction + + +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer +persistent data storage systems, "object stores". These are not quite the same as classic file +systems: in order to scale to hundreds of Petabytes, without any single points of failure +or size limits, object stores, "blobstores", have a simpler model of `name => data`. + +Apache Spark can read or write data in object stores for data access. +through filesystem connectors implemented in Apache Hadoop or provided by third-parties. +These libraries make the object stores look *almost* like filesystems, with directories and +operations on files (rename) and directories (create, rename, delete) which mimic +those of a classic filesystem. Because of this, Spark and Spark-based applications +can work with object stores, generally treating them as as if they were slower-but-larger filesystems. + +With these connectors, Apache Spark supports object stores as the source +of data for analysis, including Spark Streaming and DataFrames. + + +## <a name="quick_start"></a>Quick Start + +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials, +objects in an object store can be can be read or written through URLs which uses the name of the +object store client as the schema and the bucket/container as the hostname. + + +### Dependencies + +The Spark application neeeds the relevant Hadoop libraries, which can +be done by including the `spark-hadoop-cloud` module for the specific version of spark used. + +The Spark application should include <code>hadoop-openstack</code> dependency, which can +be done by including the `spark-hadoop-cloud` module for the specific version of spark used. +For example, for Maven support, add the following to the <code>pom.xml</code> file: + +{% highlight xml %} +<dependencyManagement> + ... + <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-hadoop-cloud_2.11</artifactId> + <version>${spark.version}</version> + </dependency> + ... +</dependencyManagement> +{% endhighlight %} + +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`. + +### Basic Use + +You can refer to data in an object store just as you would data in a filesystem, by +using a URL to the data in methods like `SparkContext.textFile()` to read data, +`saveAsTextFile()` to write it back. + + +Because object stores are viewed by Spark as filesystems, object stores can +be used as the source or destination of any spark work âbe it batch, SQL, DataFrame, +Streaming or something else. + +The steps to do so are as follows + +1. Use the full URI to refer to a bucket, including the prefix for the client-side library +to use. Example: `s3a://landsat-pds/scene_list.gz` +1. Have the Spark context configured with the authentication details of the object store. +In a YARN cluster, this may also be done in the `core-site.xml` file. + + +## <a name="output"></a>Object Stores as a substitute for HDFS + +As the examples show, you can write data to object stores. However, that does not mean +That they can be used as replacements for a cluster-wide filesystem. + +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems). + +The brief summary is: + +| Object Store Connector | Replace HDFS? | +|-----------------------------|--------------------| +| `s3a://` `s3n://` from the ASF | No | +| Amazon EMR `s3://` | Yes | +| Microsoft Azure `wasb://` | Yes | +| OpenStack `swift://` | No | + +It is possible to use any of the object stores as a destination of work, i.e. use +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow +and, unreliable in the presence of failures. + +It is faster and safer to use the cluster filesystem as the destination of Spark jobs, +using that data as the data for follow-on work. The final results can +be persisted in to the object store using `distcp`. + +#### <a name="checkpointing"></a>Spark Streaming and object stores + +Spark Streaming can monitor files added to object stores, by +creating a `FileInputDStream` DStream monitoring a path under a bucket through +`StreamingContext.textFileStream()`. + + +1. The time to scan for new files is proportional to the number of files +under the path ânot the number of *new* files, and that it can become a slow operation. +The size of the window needs to be set to handle this. + +1. Files only appear in an object store once they are completely written; there +is no need for a worklow of write-then-rename to ensure that files aren't picked up +while they are still being written. Applications can write straight to the monitored directory. + +1. Streams should only be checkpointed to an object store considered compatible with +HDFS. Otherwise the checkpointing will be slow and potentially unreliable. + +### Recommended settings for writing to object stores + +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm +for committing files âwhich does less renaming than the v1 algorithm. Speculative execution is +disabled to avoid multiple writers corrupting the output. + +``` +spark.speculation false +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true +``` + +There's also the option of skipping the cleanup of temporary files in the output directory. +Enabling this option eliminates a small delay caused by listing and deleting any such files. + +``` +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true +``` + +Bear in mind that storing temporary files can run up charges; Delete +directories called `"_temporary"` on a regular basis to avoid this. + + +### YARN Scheduler settings + +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays --- End diff -- Followup: * Hive now [treats localhost as "anywhere"](https://issues.apache.org/jira/browse/HIVE-14060) * As [does Tez](https://issues.apache.org/jira/browse/TEZ-3291) That's a recent change in both projects; someone would need to test the Spark placement code to see what decisions it made here. That YARN scheduling is essentially a workaround.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org