Xingbo Jiang created SPARK-43043: ------------------------------------ Summary: Improve the performance of MapOutputTracker.updateMapOutput Key: SPARK-43043 URL: https://issues.apache.org/jira/browse/SPARK-43043 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.2 Reporter: Xingbo Jiang
Inside of MapOutputTracker, there is a line of code which does a linear find through a mapStatuses collection: https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 (plus a similar search a few lines down at https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174) This scan is necessary because we only know the mapId of the updated status and not its mapPartitionId. We perform this scan once per migrated block, so if a large proportion of all blocks in the map are migrated then we get O(n^2) total runtime across all of the calls. I think we might be able to fix this by extending ShuffleStatus to have an OpenHashMap mapping from mapId to mapPartitionId. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org