[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

sam (JIRA) Mon, 19 Jun 2017 07:36:14 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054111#comment-16054111
 ]


sam edited comment on SPARK-21137 at 6/19/17 2:35 PM:
------------------------------------------------------

[~srowen] 

> what stages are executing if any?

None, no tasks are started. The logs do NOT output any line of the form:

{code}
17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ...
{code}

As I have explained, *it takes 2 hours to just output the text (twice) 
"1,227,645 input paths to process"*.  My hand cranked code single threaded, 
does the job in 11 minutes. It's not rocket science : it's reading some files, 
and writing them back out again.

> If no stages are executing, what is the driver executing (thread dump)? 

I don't know what the driver is doing, I'm not trying to debug the issue here, 
I'm just trying to raise a bug.  When the bug is accepted we can start trying 
to de-bug it and fix it.

> (This is the kind of thing that should go into a mailing list exchange)

If there is a configuration setting along the lines of "Don't spend 2 hours 
just to count how many files to process=true" then indeed I can see this is 
just a silly user error :)  Right now I find it hard to believe such a 
configuration setting exists and so still believe this belongs in a bug JIRA.  
Perhaps we ought to ask some other people to take a look as we don't seem to be 
reaching a conclusion.



was (Author: sams):
[~srowen] 

> what stages are executing if any?

*None, no tasks are started*. The logs do NOT output any line of the form:

{code}
17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ...
{code}

As I have explained, *it takes 2 hours to just output the text (twice) 
"1,227,645 input paths to process"*.  My hand cranked code single threaded, 
does the job in 11 minutes. It's not rocket science : it's reading some files, 
and writing them back out again.

> If no stages are executing, what is the driver executing (thread dump)? 

I don't know what the driver is doing, I'm not trying to debug the issue here, 
I'm just trying to raise a bug.  When the bug is accepted we can start trying 
to de-bug it and fix it.

> (This is the kind of thing that should go into a mailing list exchange)

If there is a configuration setting along the lines of "Don't spend 2 hours 
just to count how many files to process=true" then indeed I can see this is 
just a silly user error :)  Right now I find it hard to believe such a 
configuration setting exists and so still believe this belongs in a bug JIRA.  
Perhaps we ought to ask some other people to take a look as we don't seem to be 
reaching a conclusion.


> Spark cannot read many small files (wholeTextFiles)
> ---------------------------------------------------
>
>                 Key: SPARK-21137
>                 URL: https://issues.apache.org/jira/browse/SPARK-21137
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.1
>            Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

Reply via email to