Peter Lee created SPARK-17661:
---------------------------------

             Summary: Consolidate various listLeafFiles implementations
                 Key: SPARK-17661
                 URL: https://issues.apache.org/jira/browse/SPARK-17661
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Peter Lee


There are 4 listLeafFiles-related functions in Spark:

- ListingFileCatalog.listLeafFiles (which calls 
HadoopFsRelation.listLeafFilesInParallel if the number of paths passed in is 
greater than a threshold; if it is lower, then it has its own serial version 
implemented)
- HadoopFsRelation.listLeafFiles (called only by 
HadoopFsRelation.listLeafFilesInParallel)
- HadoopFsRelation.listLeafFilesInParallel (called only by 
ListingFileCatalog.listLeafFiles)

It is actually very confusing and error prone because there are effectively two 
distinct implementations for the serial version of listing leaf files. This 
code can be improved by:

- Move all file listing code into ListingFileCatalog, since it is the only 
class that needs this.
- Keep only one function for listing files in serial.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to