[ 
https://issues.apache.org/jira/browse/SPARK-36967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496509#comment-17496509
 ] 

Apache Spark commented on SPARK-36967:
--------------------------------------

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/35619

> Report accurate shuffle block size if its skewed
> ------------------------------------------------
>
>                 Key: SPARK-36967
>                 URL: https://issues.apache.org/jira/browse/SPARK-36967
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Wan Kun
>            Assignee: Wan Kun
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: map_status.png, map_status2.png
>
>
> Now map task will report accurate shuffle block size if the block size is 
> greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But 
> if there are a large number of map tasks and the shuffle block sizes of these 
> tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be 
> unrecognized data skew.
> For example, there are 10000 map task and 10000 reduce task, and each map 
> task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the 
> left reduce tasks, reduce 0 is data skew, but the stat of this plan do not 
> have this information.
>     !map_status2.png!
> I think we need to judge if a shuffle block is huge and need to be accurate 
> reported while running.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to