[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054111#comment-16054111 ]
sam edited comment on SPARK-21137 at 6/19/17 2:35 PM: ------------------------------------------------------ [~srowen] > what stages are executing if any? None, no tasks are started. The logs do NOT output any line of the form: {code} 17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ... {code} As I have explained, *it takes 2 hours to just output the text (twice) "1,227,645 input paths to process"*. My hand cranked code single threaded, does the job in 11 minutes. It's not rocket science : it's reading some files, and writing them back out again. > If no stages are executing, what is the driver executing (thread dump)? I don't know what the driver is doing, I'm not trying to debug the issue here, I'm just trying to raise a bug. When the bug is accepted we can start trying to de-bug it and fix it. > (This is the kind of thing that should go into a mailing list exchange) If there is a configuration setting along the lines of "Don't spend 2 hours just to count how many files to process=true" then indeed I can see this is just a silly user error :) Right now I find it hard to believe such a configuration setting exists and so still believe this belongs in a bug JIRA. Perhaps we ought to ask some other people to take a look as we don't seem to be reaching a conclusion. was (Author: sams): [~srowen] > what stages are executing if any? *None, no tasks are started*. The logs do NOT output any line of the form: {code} 17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ... {code} As I have explained, *it takes 2 hours to just output the text (twice) "1,227,645 input paths to process"*. My hand cranked code single threaded, does the job in 11 minutes. It's not rocket science : it's reading some files, and writing them back out again. > If no stages are executing, what is the driver executing (thread dump)? I don't know what the driver is doing, I'm not trying to debug the issue here, I'm just trying to raise a bug. When the bug is accepted we can start trying to de-bug it and fix it. > (This is the kind of thing that should go into a mailing list exchange) If there is a configuration setting along the lines of "Don't spend 2 hours just to count how many files to process=true" then indeed I can see this is just a silly user error :) Right now I find it hard to believe such a configuration setting exists and so still believe this belongs in a bug JIRA. Perhaps we ought to ask some other people to take a look as we don't seem to be reaching a conclusion. > Spark cannot read many small files (wholeTextFiles) > --------------------------------------------------- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.1 > Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org