Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3 <http://wiki.apache.org/hadoop/AmazonS3>, etc. Spark supports text files, SequenceFiles <http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html>, and any other Hadoop InputFormat <http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html>.

Text file RDDs can be created using |SparkContext|’s |textFile| method. This method takes an URI for the file (either a local path on the machine, or a |hdfs://|, |s3n://|, etc URI) and reads it as a collection of lines. Here is an example invocation


I could not find an concrete statement where it says either the read (more than one file) is distributed or not.

On 26.04.2016 18:00, Hyukjin Kwon wrote:
then this would not be distributed

Reply via email to