Spark can create distributed datasets from any storage source supported
by Hadoop, including your local file system, HDFS, Cassandra, HBase,
Amazon S3 <http://wiki.apache.org/hadoop/AmazonS3>, etc. Spark supports
text files, SequenceFiles
<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html>,
and any other Hadoop InputFormat
<http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html>.
Text file RDDs can be created using |SparkContext|’s |textFile| method.
This method takes an URI for the file (either a local path on the
machine, or a |hdfs://|, |s3n://|, etc URI) and reads it as a collection
of lines. Here is an example invocation
I could not find an concrete statement where it says either the read
(more than one file) is distributed or not.
On 26.04.2016 18:00, Hyukjin Kwon wrote:
then this would not be distributed