Thanks. I was actually able to get mapreduce.input.
fileinputformat.list-status.num-threads working in Spark against a regular
fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive.

On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Park <piaozhe...@gmail.com> wrote:

> Alex, see this jira-
> https://issues.apache.org/jira/browse/SPARK-9926
>
> On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
> alex.nastet...@vervemobile.com> wrote:
>
>> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
>> input.fileinputformat.list-status.num-threads"?
>>
>> Thanks.
>>
>> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <piaozhe...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am wondering if anyone has successfully enabled
>>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>>> and Pig). But for some reason, this property does not take effect in Spark
>>> HadoopRDD resulting in serious delay in file listing.
>>>
>>> I verified that the property is indeed set in HadoopRDD by logging the
>>> value of the property in the getPartitions() function. I also tried to
>>> attach VisualVM to Spark and Pig clients, which look as follows-
>>>
>>> In Pig, I can see 25 threads running in parallel for file listing-
>>> [image: Inline image 1]
>>>
>>> In Spark, I only see 2 threads running in parallel for file listing-
>>> [image: Inline image 2]
>>>
>>> What's strange is that the # of concurrent threads in Spark is throttled
>>> no matter how high I
>>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>>
>>> Is anyone using Spark with this property enabled? If so, can you please
>>> share how you do it?
>>>
>>> Thanks!
>>> Cheolsoo
>>>
>>
>>
>

Reply via email to