[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21137:
------------------------------
    Affects Version/s:     (was: 2.2.1)
                       2.1.1
             Priority: Minor  (was: Major)
           Issue Type: Improvement  (was: Bug)
              Summary: Spark reads many small files slowly  (was: Spark cannot 
read many small files (wholeTextFiles))

So just to move this along, I did the thread dump. Yeah, it's spending a huge 
amount of time examining the input files in the Hadoop {{InputFormat}}:

{code}
"main" #1 prio=5 os_prio=31 tid=0x00007fe85f004000 nid=0x1c03 runnable 
[0x0000700009a5e000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
        at java.lang.ProcessImpl.start(ProcessImpl.java:134)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:522)
        at org.apache.hadoop.util.Shell.run(Shell.java:478)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
        at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
        at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:587)
        at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:562)
        at 
org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:47)
        at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1701)
        at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303)
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
        at 
org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
        at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:49)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        ...
{code}

This is slow because it's single-threaded, and in the case of a local file 
system, actually uses things like {{ls}} to traverse the directories. I suspect 
it can only be worse on S3; not sure about HDFS.

Does this work need to be done at all? well... Spark here is trying to figure 
out the max split size to configure, in order to enforce a minimum number of 
partitions. To me, this {{minPartitions}} argument probably should never have 
been in there: just repartition as desired. It's even capped at 2. But this arg 
exists, and the default behavior is still something used by a lot of methods.

I found there's a Hadoop option to list the dirs in parallel, and that sped 
things up a lot -- still took a minute or so to crunch on my laptop, but much 
better than about 10. I think that's valid to set this listing parallelism to 
something like {{Runtime.getRuntime.availableProcessors}} just for these 
methods. They're expecting to encounter a bunch of files, after all, even a 
million is probably not a great idea.

That much I think is an unobtrusive change that makes a big difference. I'd be 
OK with that.

> Spark reads many small files slowly
> -----------------------------------
>
>                 Key: SPARK-21137
>                 URL: https://issues.apache.org/jira/browse/SPARK-21137
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.1.1
>            Reporter: sam
>            Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to