[ 
https://issues.apache.org/jira/browse/SPARK-30751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-30751.
---------------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

> Combine the skewed readers into one in AQE skew join optimizations
> ------------------------------------------------------------------
>
>                 Key: SPARK-30751
>                 URL: https://issues.apache.org/jira/browse/SPARK-30751
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Wei Xue
>            Assignee: Wenchen Fan
>            Priority: Major
>             Fix For: 3.0.0
>
>
> Assume we have N partitions based on the original join keys, and for a 
> specific partition id {{Pi}} (i = 1 to N), we slice the left partition into 
> {{Li}} sub-partitions (L = 1 if no skew; L > 1 if skewed), the right 
> partition into {{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). 
> With the current approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 
> to N where Li > 1 or Mi > 1) plus one (for the rest of the partitions without 
> skew) joins. *This can be a serious performance concern as the size of the 
> query plan now depends on the number and size of skewed partitions.*
> Now instead of generating so many joins we can create a “repeated” reader for 
> either side of the join so that:
>  # for the left side, with each partition id Pi and any given slice {{Sj}} in 
> {{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective 
> join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}}
>  # for the right side, with each partition id Pi and any given slice {{Tk}} 
> in {{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with 
> respective join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}}
> That way, we can have one SMJ for all the partitions and only one type of 
> special reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to