[ https://issues.apache.org/jira/browse/SPARK-12602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin closed SPARK-12602. ------------------------------- Resolution: Later Closing as later. Will revisit when we have more review bandwidth. > Join Reordering: Pushing Inner Join Through Outer Join > ------------------------------------------------------ > > Key: SPARK-12602 > URL: https://issues.apache.org/jira/browse/SPARK-12602 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL > Affects Versions: 1.6.0 > Reporter: Xiao Li > Priority: Critical > > If applicable, we can push Inner Join through Outer Join. The basic idea is > built on the associativity property of outer and inner joins: > {code} > R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23 > R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = > (R1 inner R3 on p13) left R2 on p23 > (R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12 > (R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = > (R2 inner R3 on p23) left R1 on p12 > {code} > The reordering can reduce the number of processed rows since the Inner Join > always can generate less (or equivalent) rows than Left/Right Outer Join. > This change can improve the query performance in most cases. > When cost-based optimization is available, we can switch the order of tables > in each join type based on their costs. The order of joined tables in the > inner join does not affect the results and the right outer join can be > changed to the left outer join. This part is out of scope here. > For example, given the following eligible query: > {code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === > $"b.int", "inner"){code} > Before the fix, the logical plan is like > {code} > Join Inner, Some((int#15 = int#9)) > :- Join RightOuter, Some((int#3 = int#9)) > : :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]] > : +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]] > +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]] > {code} > After the fix, the logical plan is like > {code} > Join LeftOuter, Some((int#3 = int#9)) > :- Join Inner, Some((int#15 = int#9)) > : :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]] > : +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]] > +- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org