[ https://issues.apache.org/jira/browse/SPARK-27801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-27801: ------------------------------------ Assignee: (was: Apache Spark) > InMemoryFileIndex.listLeafFiles should use listLocatedStatus for > DistributedFileSystem > -------------------------------------------------------------------------------------- > > Key: SPARK-27801 > URL: https://issues.apache.org/jira/browse/SPARK-27801 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.3 > Reporter: Rob Russo > Priority: Major > > Currently in InMemoryFileIndex, all directory listings are done using > FileSystem.listStatus following by individual calls to > FileSystem.getFileBlockLocations. This is painstakingly slow for folders that > have large numbers of files because this process happens serially and > parallelism is only applied at the folder level, not the file level. > FileSystem also provides another API listLocatedStatus which returns the > LocatedFileStatus objects that already have the block locations. In > FileSystem main class this just delegates to listStatus and > getFileBlockLocations similarly to the way Spark does it. However when HDFS > specifically is the backing file system, DistributedFileSystem overrides this > method and simply makes one single call to the namenode to retrieve the > directory listing with the block locations. This avoids potentially thousands > or more calls to namenode and also is more consistent because files will > either exist with locations or not exist instead of having the > FileNotFoundException exception case. > For our example directory with 6500 files, the load time of > spark.read.parquet was reduced 96x from 76 seconds to .8 seconds. This > savings only goes up with the number of files in the directory. > In the pull request instead of using this method always which could lead to a > FileNotFoundException that could be tough to decipher in the default > FileSystem implementation, this method is only used when the FileSystem is a > DistributedFileSystem and otherwise the old logic still applies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org