It implements CombineInputFormat from Hadoop. isSplittable=false means each individual file cannot be split. If you only see one partition even with a large minPartitions, perhaps the total size of files is not big enough. Those are configurable in Hadoop conf. -Xiangrui
On Tue, Apr 26, 2016, 8:32 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > EDIT: not mapper but a task for HadoopRDD maybe as far as I know. > > I think the most clear way is just to run a job on multiple files with the > API and check the number of tasks in the job. > On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon" <gurwls...@gmail.com> wrote: > > wholeTextFile() API uses WholeTextFileInputFormat, > https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala, > which returns false for isSplittable. In this case, only single mapper > appears for the entire file as far as I know. > > And also https://spark.apache.org/docs/1.6.0/programming-guide.html > > If the file is single file, then this would not be distributed. > On 26 Apr 2016 11:52 p.m., "Ted Yu" <yuzhih...@gmail.com> wrote: > >> Please take a look at: >> core/src/main/scala/org/apache/spark/SparkContext.scala >> >> * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`, >> * >> * <p> then `rdd` contains >> * {{{ >> * (a-hdfs-path/part-00000, its content) >> * (a-hdfs-path/part-00001, its content) >> * ... >> * (a-hdfs-path/part-nnnnn, its content) >> * }}} >> ... >> * @param minPartitions A suggestion value of the minimal splitting >> number for input data. >> >> def wholeTextFiles( >> path: String, >> minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = >> withScope { >> >> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu <vadim.var...@adswizz.com> >> wrote: >> >>> Hi guys, >>> >>> I'm trying to read many filed from s3 using >>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed >>> manner? Please give me a link to the place in documentation where it's >>> specified. >>> >>> Thanks, Vadim. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>