Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
input.fileinputformat.list-status.num-threads"?

Thanks.

On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <piaozhe...@gmail.com> wrote:

> Hi,
>
> I am wondering if anyone has successfully enabled
> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
> usually set this property to 25 to speed up file listing in MR jobs (Hive
> and Pig). But for some reason, this property does not take effect in Spark
> HadoopRDD resulting in serious delay in file listing.
>
> I verified that the property is indeed set in HadoopRDD by logging the
> value of the property in the getPartitions() function. I also tried to
> attach VisualVM to Spark and Pig clients, which look as follows-
>
> In Pig, I can see 25 threads running in parallel for file listing-
> [image: Inline image 1]
>
> In Spark, I only see 2 threads running in parallel for file listing-
> [image: Inline image 2]
>
> What's strange is that the # of concurrent threads in Spark is throttled
> no matter how high I
> set "mapreduce.input.fileinputformat.list-status.num-threads".
>
> Is anyone using Spark with this property enabled? If so, can you please
> share how you do it?
>
> Thanks!
> Cheolsoo
>

Reply via email to