Thanks. I was actually able to get mapreduce.input. fileinputformat.list-status.num-threads working in Spark against a regular fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive.
On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Park <piaozhe...@gmail.com> wrote: > Alex, see this jira- > https://issues.apache.org/jira/browse/SPARK-9926 > > On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky < > alex.nastet...@vervemobile.com> wrote: > >> Ran into this need myself. Does Spark have an equivalent of "mapreduce. >> input.fileinputformat.list-status.num-threads"? >> >> Thanks. >> >> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <piaozhe...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I am wondering if anyone has successfully enabled >>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I >>> usually set this property to 25 to speed up file listing in MR jobs (Hive >>> and Pig). But for some reason, this property does not take effect in Spark >>> HadoopRDD resulting in serious delay in file listing. >>> >>> I verified that the property is indeed set in HadoopRDD by logging the >>> value of the property in the getPartitions() function. I also tried to >>> attach VisualVM to Spark and Pig clients, which look as follows- >>> >>> In Pig, I can see 25 threads running in parallel for file listing- >>> [image: Inline image 1] >>> >>> In Spark, I only see 2 threads running in parallel for file listing- >>> [image: Inline image 2] >>> >>> What's strange is that the # of concurrent threads in Spark is throttled >>> no matter how high I >>> set "mapreduce.input.fileinputformat.list-status.num-threads". >>> >>> Is anyone using Spark with this property enabled? If so, can you please >>> share how you do it? >>> >>> Thanks! >>> Cheolsoo >>> >> >> >