[ https://issues.apache.org/jira/browse/SPARK-33781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422955#comment-17422955 ]
Apache Spark commented on SPARK-33781: -------------------------------------- User 'rmcyang' has created a pull request for this issue: https://github.com/apache/spark/pull/34158 > Improve caching of MergeStatus on the executor side to save memory > ------------------------------------------------------------------ > > Key: SPARK-33781 > URL: https://issues.apache.org/jira/browse/SPARK-33781 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Affects Versions: 3.1.0 > Reporter: Min Shen > Priority: Major > > In MapOutputTrackerWorker, it would cache the retrieved MapStatus or > MergeStatus array for a given shuffle received from the driver in memory so > that all tasks doing shuffle fetch for that shuffle can reuse the cached > metadata. > However, different from MapStatus array, where each task would need to access > every single instance in the array, each task would only need one or just a > few MergeStatus objects from the MergeStatus array depending on which shuffle > partitions the task is processing. > For large shuffles with 10s or 100s of thousands of shuffle partitions, > caching the entire deserialized and decompressed MergeStatus array on the > executor side, while perhaps only 0.1% of them are going to be used by the > tasks running in this executor is a huge waste of memory. > We could improve this by caching the serialized and compressed bytes for > MergeStatus array instead and only cache the needed deserialized MergeStatus > object on the executor side. In addition to saving memory, it also helps with > reducing GC pressure on executor side. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org