[ 
https://issues.apache.org/jira/browse/SPARK-27232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27232:
------------------------------------

    Assignee:     (was: Apache Spark)

> Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to
> --------------------------------------------------------------------------
>
>                 Key: SPARK-27232
>                 URL: https://issues.apache.org/jira/browse/SPARK-27232
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: EdisonWang
>            Priority: Minor
>
> `InMemoryFileIndex` needs to request file block location information in order 
> to do locality schedule in `TaskSetManager`. 
> Usually this is a time-cost task.  For example, In our production env, there 
> are 24 partitions, with totally 149925 files and 83TB in size. It costs about 
> 10 minutes to request file block locations before submit a spark job. Even 
> though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24 
> to make it parallelized, it also needs 2 minutes. 
> Anyway, this is a waste if we don't care about the locality of files(for 
> example, storage and computation are separate).
> So there should be a conf to control whether we need to send 
> `getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait` 
> to 0, file block location information is meaningless. 
> Here in this PR, if `spark.locality.wait` is set to 0, it will not request 
> file location information anymore, which will save several seconds to minutes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to