[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

steveloughran Wed, 23 Nov 2016 08:32:38 -0800

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89352877
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,953 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark 
SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google 
GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the 
same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single 
points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name 
=> data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by 
third-parties.
    +These libraries make the object stores look *almost* like filesystems, 
with directories and
    +operations on files (rename) and directories (create, rename, delete) 
which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based 
applications
    +can work with object stores, generally treating them as as if they were 
slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is 
configured with your credentials,
    +objects in an object store can be can be read or written through URLs 
which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-cloud` module for the specific version of 
spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> 
dependency, which can
    +be done by including the `spark-cloud` module for the specific version of 
spark used.
    +For example, for Maven support, add the following to the 
<code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of 
course `spark-cloud_2.10`.
    +
    +### Basic Use
    +
    +
    +
    +To refer to a path in Amazon S3, use `s3a://` as the scheme (Hadoop 2.7+) 
or `s3n://` on older versions.
    +
    +{% highlight scala %}
    +sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()
    +{% endhighlight %}
    +
    +Similarly, an RDD can be saved to an object store via `saveAsTextFile()`
    +
    +
    +{% highlight scala %}
    +val numbers = sparkContext.parallelize(1 to 1000)
    +
    +// save to Amazon S3 (or compatible implementation)
    +numbers.saveAsTextFile("s3a://testbucket/counts")
    +
    +// Save to Azure Object store
    
+numbers.saveAsTextFile("wasb://testbuc...@example.blob.core.windows.net/counts")
    +
    +// save to an OpenStack Swift implementation
    +numbers.saveAsTextFile("swift://testbucket.openstack1/counts")
    +{% endhighlight %}
    +
    +That's essentially it: object stores can act as a source and destination 
of data, using exactly
    +the same APIs to load and save data as one uses to work with data in HDFS 
or other filesystems.
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work âbe it batch, 
SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the 
client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of 
the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +1. Have the JAR containing the filesystem classes on the classpath 
âalong with all of its dependencies.
    +
    +### <a name="dataframes"></a>Example: DataFrames
    +
    +DataFrames can be created from and saved to object stores through the 
`read()` and `write()` methods.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.types.StringType
    +
    +val spark = SparkSession
    +    .builder
    +    .appName("DataFrames")
    +    .config(sparkConf)
    +    .getOrCreate()
    +import spark.implicits._
    +val numRows = 1000
    +
    +// generate test data
    +val sourceData = spark.range(0, numRows).select($"id".as("l"), 
$"id".cast(StringType).as("s"))
    +
    +// define the destination
    +val dest = 
"wasb://yourcontai...@youraccount.blob.core.windows.net/dataframes"
    +
    +// write the data
    +val orcFile = dest + "/data.orc"
    +sourceData.write.format("orc").save(orcFile)
    +
    +// now read it back
    +val orcData = spark.read.format("orc").load(orcFile)
    +
    +// finally, write the data as Parquet
    +orcData.write.format("parquet").save(dest + "/data.parquet")
    +spark.stop()
    +{% endhighlight %}
    +
    +### <a name="streaming"></a>Example: Spark Streaming and Cloud Storage
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.streaming._
    +
    +val sparkConf = new SparkConf()
    +val ssc = new StreamingContext(sparkConf, Milliseconds(5000))
    +try {
    +  val lines = ssc.textFileStream("s3a://bucket/incoming")
    +  val matches = lines.filter(_.endsWith("3"))
    +  matches.print()
    +  ssc.start()
    +  ssc.awaitTermination()
    +} finally {
    +  ssc.stop(true)
    +}
    +{% endhighlight %}
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path ânot the number of *new* files, and that it can become a 
slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; 
there
    +is no need for a worklow of write-then-rename to ensure that files aren't 
picked up
    +while they are still being written. Applications can write straight to the 
monitored directory.
    +
    +#### <a name="checkpointing"></a>Checkpointing Streams to object stores
    +
    +Streams should only be checkpointed to an object store considered 
compatible with
    +HDFS. As the checkpoint operation includes a `rename()` operation, 
checkpointing to
    +an object store can be so slow that streaming throughput collapses.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    --- End diff --
    
    I should add that you ought you point this out [to your doc 
team](https://www.cloudera.com/documentation/enterprise/5-8-x/topics/spark_s3.html)
 âespecially the bit about speculation. Our docs are (based on those in this 
PR)(http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-spark/index.html),
 including all the warnings. S3 works great as a source of data, the S3A phase 
II work benefits the column formats (ORC, Spark) a lot, other tuning coming 
along. It's the rename-in-commit which is the enemy. 
    
    Eventual consistency? not much of an issue for static/infrequently updated 
data, though it does surface in tests



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Reply via email to