Wei Xue created SPARK-30751:
-------------------------------

             Summary: Combine the skewed readers into one in AQE skew join 
optimizations
                 Key: SPARK-30751
                 URL: https://issues.apache.org/jira/browse/SPARK-30751
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Wei Xue


Assume we have N partitions based on the original join keys, and for a specific 
partition id Pi (i = 1 to N), we slice the left partition into L(i) 
sub-partitions (L = 1 if no skew; L > 1 if skewed), the right partition into 
M(i) sub-partitions (M = 1 if no skew; M > 1 if skewed). With the current 
approach, we’ll end up with a sum of L(i) * M(i) (i = 1 to N where L(i) > 1 or 
M(i) > 1) plus one joins. *This can be a serious performance concern as the 
size of the query plan now depends on the number and size of skewed partitions.*

Now instead of generating so many joins we can create a “repeated” reader for 
either side of the join so that:
 # for the left side, with each partition id Pi and any given slice Sj in Pi (j 
= 1 to L(i)), it generates M(i) repeated partitions with respective join keys 
as PiSjT1, PiSjT2, …, PiSjTm

 # for the right side, with each partition id Pi and any given slice Tk in Pi 
(k = 1 to M(i)), it generates L(i) repeated partitions with respective join 
keys as PiS1Tk, PiS2Tk, …, PiSlTk

That way, we can have one SMJ for all the partitions and only one type of 
special reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to