[ https://issues.apache.org/jira/browse/SPARK-27232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-27232: ------------------------------------ Assignee: Apache Spark > Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to > -------------------------------------------------------------------------- > > Key: SPARK-27232 > URL: https://issues.apache.org/jira/browse/SPARK-27232 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0 > Reporter: EdisonWang > Assignee: Apache Spark > Priority: Minor > > `InMemoryFileIndex` needs to request file block location information in order > to do locality schedule in `TaskSetManager`. > Usually this is a time-cost task. For example, In our production env, there > are 24 partitions, with totally 149925 files and 83TB in size. It costs about > 10 minutes to request file block locations before submit a spark job. Even > though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24 > to make it parallelized, it also needs 2 minutes. > Anyway, this is a waste if we don't care about the locality of files(for > example, storage and computation are separate). > So there should be a conf to control whether we need to send > `getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait` > to 0, file block location information is meaningless. > Here in this PR, if `spark.locality.wait` is set to 0, it will not request > file location information anymore, which will save several seconds to minutes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org