[jira] [Updated] (SPARK-43043) Improve the performance of MapOutputTracker.updateMapOutput
[ https://issues.apache.org/jira/browse/SPARK-43043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43043: --- Labels: pull-request-available (was: ) > Improve the performance of MapOutputTracker.updateMapOutput > --- > > Key: SPARK-43043 > URL: https://issues.apache.org/jira/browse/SPARK-43043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Inside of MapOutputTracker, there is a line of code which does a linear find > through a mapStatuses collection: > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 > (plus a similar search a few lines down at > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174) > This scan is necessary because we only know the mapId of the updated status > and not its mapPartitionId. > We perform this scan once per migrated block, so if a large proportion of all > blocks in the map are migrated then we get O(n^2) total runtime across all of > the calls. > I think we might be able to fix this by extending ShuffleStatus to have an > OpenHashMap mapping from mapId to mapPartitionId. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43043) Improve the performance of MapOutputTracker.updateMapOutput
[ https://issues.apache.org/jira/browse/SPARK-43043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-43043: - Fix Version/s: 3.5.0 (was: 3.4.1) > Improve the performance of MapOutputTracker.updateMapOutput > --- > > Key: SPARK-43043 > URL: https://issues.apache.org/jira/browse/SPARK-43043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > Fix For: 3.5.0 > > > Inside of MapOutputTracker, there is a line of code which does a linear find > through a mapStatuses collection: > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 > (plus a similar search a few lines down at > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174) > This scan is necessary because we only know the mapId of the updated status > and not its mapPartitionId. > We perform this scan once per migrated block, so if a large proportion of all > blocks in the map are migrated then we get O(n^2) total runtime across all of > the calls. > I think we might be able to fix this by extending ShuffleStatus to have an > OpenHashMap mapping from mapId to mapPartitionId. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org