Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-28 Thread Xiangrui Meng
It implements CombineInputFormat from Hadoop. isSplittable=false means each individual file cannot be split. If you only see one partition even with a large minPartitions, perhaps the total size of files is not big enough. Those are configurable in Hadoop conf. -Xiangrui On Tue, Apr 26, 2016,

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
EDIT: not mapper but a task for HadoopRDD maybe as far as I know. I think the most clear way is just to run a job on multiple files with the API and check the number of tasks in the job. On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon" wrote: wholeTextFile() API uses

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
wholeTextFile() API uses WholeTextFileInputFormat, https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala, which returns false for isSplittable. In this case, only single mapper appears for the entire

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Vadim Vararu
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3 , etc. Spark supports text files, SequenceFiles

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
And also https://spark.apache.org/docs/1.6.0/programming-guide.html If the file is single file, then this would not be distributed. On 26 Apr 2016 11:52 p.m., "Ted Yu" wrote: > Please take a look at: > core/src/main/scala/org/apache/spark/SparkContext.scala > >* Do `val

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Ted Yu
Please take a look at: core/src/main/scala/org/apache/spark/SparkContext.scala * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`, * * then `rdd` contains * {{{ * (a-hdfs-path/part-0, its content) * (a-hdfs-path/part-1, its content) * ... *

Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Vadim Vararu
Hi guys, I'm trying to read many filed from s3 using JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed manner? Please give me a link to the place in documentation where it's specified. Thanks, Vadim.