sam created SPARK-21137:
---------------------------

             Summary: Spark cannot read many small files (wholeTextFiles)
                 Key: SPARK-21137
                 URL: https://issues.apache.org/jira/browse/SPARK-21137
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.2.1
            Reporter: sam


A very common use case in big data is to read a large number of small files.  
For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues.  
Firstly, even if the data is small (each file only say 1K) any job can take a 
very long time (I have a simple job that has been running for 3 hours and has 
not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and 
not well optimised.  When I hand crank the code and don't use Spark, my job 
runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda 
surprising to me that Spark cannot read Enron data given that it's such a 
quintessential example.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to