Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Cheolsoo Park Tue, 12 Jan 2016 15:49:23 -0800

Alex, see this jira-
https://issues.apache.org/jira/browse/SPARK-9926


On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:

> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
> input.fileinputformat.list-status.num-threads"?
>
> Thanks.
>
> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <piaozhe...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am wondering if anyone has successfully enabled
>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>> and Pig). But for some reason, this property does not take effect in Spark
>> HadoopRDD resulting in serious delay in file listing.
>>
>> I verified that the property is indeed set in HadoopRDD by logging the
>> value of the property in the getPartitions() function. I also tried to
>> attach VisualVM to Spark and Pig clients, which look as follows-
>>
>> In Pig, I can see 25 threads running in parallel for file listing-
>> [image: Inline image 1]
>>
>> In Spark, I only see 2 threads running in parallel for file listing-
>> [image: Inline image 2]
>>
>> What's strange is that the # of concurrent threads in Spark is throttled
>> no matter how high I
>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>
>> Is anyone using Spark with this property enabled? If so, can you please
>> share how you do it?
>>
>> Thanks!
>> Cheolsoo
>>
>
>

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Reply via email to