[ https://issues.apache.org/jira/browse/SPARK-36967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496509#comment-17496509 ]
Apache Spark commented on SPARK-36967: -------------------------------------- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/35619 > Report accurate shuffle block size if its skewed > ------------------------------------------------ > > Key: SPARK-36967 > URL: https://issues.apache.org/jira/browse/SPARK-36967 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.3.0 > Reporter: Wan Kun > Assignee: Wan Kun > Priority: Major > Fix For: 3.3.0 > > Attachments: map_status.png, map_status2.png > > > Now map task will report accurate shuffle block size if the block size is > greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But > if there are a large number of map tasks and the shuffle block sizes of these > tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be > unrecognized data skew. > For example, there are 10000 map task and 10000 reduce task, and each map > task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the > left reduce tasks, reduce 0 is data skew, but the stat of this plan do not > have this information. > !map_status2.png! > I think we need to judge if a shuffle block is huge and need to be accurate > reported while running. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org