Ran into this need myself. Does Spark have an equivalent of "mapreduce. input.fileinputformat.list-status.num-threads"?
Thanks. On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <piaozhe...@gmail.com> wrote: > Hi, > > I am wondering if anyone has successfully enabled > "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I > usually set this property to 25 to speed up file listing in MR jobs (Hive > and Pig). But for some reason, this property does not take effect in Spark > HadoopRDD resulting in serious delay in file listing. > > I verified that the property is indeed set in HadoopRDD by logging the > value of the property in the getPartitions() function. I also tried to > attach VisualVM to Spark and Pig clients, which look as follows- > > In Pig, I can see 25 threads running in parallel for file listing- > [image: Inline image 1] > > In Spark, I only see 2 threads running in parallel for file listing- > [image: Inline image 2] > > What's strange is that the # of concurrent threads in Spark is throttled > no matter how high I > set "mapreduce.input.fileinputformat.list-status.num-threads". > > Is anyone using Spark with this property enabled? If so, can you please > share how you do it? > > Thanks! > Cheolsoo >