Re: Is JavaSparkContext.wholeTextFiles distributed?

Xiangrui Meng Thu, 28 Apr 2016 07:27:11 -0700

It implements CombineInputFormat from Hadoop. isSplittable=false means each
individual file cannot be split. If you only see one partition even with a
large minPartitions, perhaps the total size of files is not big enough.
Those are configurable in Hadoop conf. -Xiangrui


On Tue, Apr 26, 2016, 8:32 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> EDIT: not mapper but a task for HadoopRDD maybe as far as I know.
>
> I think the most clear way is just to run a job on multiple files with the
> API and check the number of tasks in the job.
> On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon" <gurwls...@gmail.com> wrote:
>
> wholeTextFile() API uses WholeTextFileInputFormat,
> https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala,
> which returns false for isSplittable. In this case, only single mapper
> appears for the entire file as far as I know.
>
> And also https://spark.apache.org/docs/1.6.0/programming-guide.html
>
> If the file is single file, then this would not be distributed.
> On 26 Apr 2016 11:52 p.m., "Ted Yu" <yuzhih...@gmail.com> wrote:
>
>> Please take a look at:
>> core/src/main/scala/org/apache/spark/SparkContext.scala
>>
>>    * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>>    *
>>    * <p> then `rdd` contains
>>    * {{{
>>    *   (a-hdfs-path/part-00000, its content)
>>    *   (a-hdfs-path/part-00001, its content)
>>    *   ...
>>    *   (a-hdfs-path/part-nnnnn, its content)
>>    * }}}
>> ...
>>   * @param minPartitions A suggestion value of the minimal splitting
>> number for input data.
>>
>>   def wholeTextFiles(
>>       path: String,
>>       minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
>> withScope {
>>
>> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu <vadim.var...@adswizz.com>
>> wrote:
>>
>>> Hi guys,
>>>
>>> I'm trying to read many filed from s3 using
>>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>>> manner? Please give me a link to the place in documentation where it's
>>> specified.
>>>
>>> Thanks, Vadim.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>

Re: Is JavaSparkContext.wholeTextFiles distributed?

Reply via email to