[ https://issues.apache.org/jira/browse/SPARK-30751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Herman van Hövell resolved SPARK-30751. --------------------------------------- Fix Version/s: 3.0.0 Resolution: Fixed > Combine the skewed readers into one in AQE skew join optimizations > ------------------------------------------------------------------ > > Key: SPARK-30751 > URL: https://issues.apache.org/jira/browse/SPARK-30751 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Wei Xue > Assignee: Wenchen Fan > Priority: Major > Fix For: 3.0.0 > > > Assume we have N partitions based on the original join keys, and for a > specific partition id {{Pi}} (i = 1 to N), we slice the left partition into > {{Li}} sub-partitions (L = 1 if no skew; L > 1 if skewed), the right > partition into {{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). > With the current approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 > to N where Li > 1 or Mi > 1) plus one (for the rest of the partitions without > skew) joins. *This can be a serious performance concern as the size of the > query plan now depends on the number and size of skewed partitions.* > Now instead of generating so many joins we can create a “repeated” reader for > either side of the join so that: > # for the left side, with each partition id Pi and any given slice {{Sj}} in > {{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective > join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}} > # for the right side, with each partition id Pi and any given slice {{Tk}} > in {{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with > respective join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}} > That way, we can have one SMJ for all the partitions and only one type of > special reader. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org