Alex, see this jira- https://issues.apache.org/jira/browse/SPARK-9926
On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Ran into this need myself. Does Spark have an equivalent of "mapreduce. > input.fileinputformat.list-status.num-threads"? > > Thanks. > > On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <piaozhe...@gmail.com> > wrote: > >> Hi, >> >> I am wondering if anyone has successfully enabled >> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I >> usually set this property to 25 to speed up file listing in MR jobs (Hive >> and Pig). But for some reason, this property does not take effect in Spark >> HadoopRDD resulting in serious delay in file listing. >> >> I verified that the property is indeed set in HadoopRDD by logging the >> value of the property in the getPartitions() function. I also tried to >> attach VisualVM to Spark and Pig clients, which look as follows- >> >> In Pig, I can see 25 threads running in parallel for file listing- >> [image: Inline image 1] >> >> In Spark, I only see 2 threads running in parallel for file listing- >> [image: Inline image 2] >> >> What's strange is that the # of concurrent threads in Spark is throttled >> no matter how high I >> set "mapreduce.input.fileinputformat.list-status.num-threads". >> >> Is anyone using Spark with this property enabled? If so, can you please >> share how you do it? >> >> Thanks! >> Cheolsoo >> > >