[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053977#comment-16053977 ]
sam commented on SPARK-21137: ----------------------------- [~srowen] So I've provided full reproduce steps here (including code and cluster setup) https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can easily just clone, and follow the README to reproduce exactly! > This reads the files into memory. Yes I'm aware that `wholeTextFiles` reads the entire files, but all the files are rather small. My hand cranked code also slurps the entire files. > Also, slow compared to what? The link includes two versions of the same code, I killed the Spark version (after 5 hours of running), my hand cranked version takes 11 minutes. > Don't reopen this please. Someone will do that if it's appropriate. Sorry, like I said, I just thought this was a known issue no one had bothered to add to JIRA because most people just hand crank their own work arounds. > in the Hadoop APIs Yes it's likely that the underlying Hadoop APIs have some yucky code that does something silly, I have delved down their before and my stomach cannot handle it. Nevertheless Spark made the choice to inherit the complexities of the Hadoop APIs and reading multiple small files seems like a pretty basic use case for Spark (come on Sean this is Enron data!). It would feel a bit perverse to just close this and blame the layer cake underneath. Spark should use it's own extensions of the Hadoop APIs where the Hadoop APIs don't work (and the Hadoop code is easily extensible). > Spark cannot read many small files (wholeTextFiles) > --------------------------------------------------- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.1 > Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org