[ 
https://issues.apache.org/jira/browse/SPARK-12602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12602.
-------------------------------
    Resolution: Later

Closing as later. Will revisit when we have more review bandwidth.


> Join Reordering: Pushing Inner Join Through Outer Join
> ------------------------------------------------------
>
>                 Key: SPARK-12602
>                 URL: https://issues.apache.org/jira/browse/SPARK-12602
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, SQL
>    Affects Versions: 1.6.0
>            Reporter: Xiao Li
>            Priority: Critical
>
> If applicable, we can push Inner Join through Outer Join. The basic idea is 
> built on the associativity property of outer and inner joins:
> {code}
> R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23
> R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = 
> (R1 inner R3 on p13) left R2 on p23
> (R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12
> (R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = 
> (R2 inner R3 on p23) left R1 on p12
> {code}
> The reordering can reduce the number of processed rows since the Inner Join 
> always can generate less (or equivalent) rows than Left/Right Outer Join. 
> This change can improve the query performance in most cases.
> When cost-based optimization is available, we can switch the order of tables 
> in each join type based on their costs. The order of joined tables in the 
> inner join does not affect the results and the right outer join can be 
> changed to the left outer join. This part is out of scope here.
> For example, given the following eligible query:
> {code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
> $"b.int", "inner"){code}
> Before the fix, the logical plan is like
> {code}
> Join Inner, Some((int#15 = int#9))
> :- Join RightOuter, Some((int#3 = int#9))
> :  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> :  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> {code}
> After the fix, the logical plan is like
> {code}
> Join LeftOuter, Some((int#3 = int#9))
> :- Join Inner, Some((int#15 = int#9))
> :  :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> :  +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> +- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to