Rob Russo created SPARK-27801:
---------------------------------

             Summary: InMemoryFileIndex.listLeafFiles should use 
listLocatedStatus for DistributedFileSystem
                 Key: SPARK-27801
                 URL: https://issues.apache.org/jira/browse/SPARK-27801
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.3
            Reporter: Rob Russo


Currently in InMemoryFileIndex, all directory listings are done using 
FileSystem.listStatus following by individual calls to 
FileSystem.getFileBlockLocations. This is painstakingly slow for folders that 
have large numbers of files because this process happens serially and 
parallelism is only applied at the folder level, not the file level.

FileSystem also provides another API listLocatedStatus which returns the 
LocatedFileStatus objects that already have the block locations. In FileSystem 
main class this just delegates to listStatus and getFileBlockLocations 
similarly to the way Spark does it. However when HDFS specifically is the 
backing file system, DistributedFileSystem overrides this method and simply 
makes one single call to the namenode to retrieve the directory listing with 
the block locations. This avoids potentially thousands or more calls to 
namenode and also is more consistent because files will either exist with 
locations or not exist instead of having the FileNotFoundException exception 
case. 

For our example directory with 6500 files, the load time of spark.read.parquet 
was reduced 96x from 76 seconds to .8 seconds. This savings only goes up with 
the number of files in the directory.

In the pull request instead of using this method always which could lead to a 
FileNotFoundException that could be tough to decipher in the default FileSystem 
implementation, this method is only used when the FileSystem is a 
DistributedFileSystem and otherwise the old logic still applies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to