[ 
https://issues.apache.org/jira/browse/SPARK-12602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12602:
----------------------------
    Description: 
If applicable, we can push Inner Join through Outer Join. 

The reordering can reduce the number of processed rows since the `Inner Join` 
always can generate less rows than `Left/Right Outer Join`. Thus, it can 
improve the query performance.

For example, given the following eligible query:
{code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
$"b.int", "inner"){code}

Before the fix, the logical plan is like
{code}
Join Inner, Some((int#15 = int#9))
:- Join RightOuter, Some((int#3 = int#9))
:  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
:  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
+- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}
After the fix, the logical plan should be like
{code}
Join RightOuter, Some((int#3 = int#9))
:- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
+- Join Inner, Some((int#15 = int#9))
   :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
   +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}

  was:
If applicable, we can push Inner Join through Outer Join. This can reduce the 
number of rows since the Inner Join always can generate less rows than Outer 
Join. 

This PR is to push `Inner Join` through `Left/Right Outer Join`. The reordering 
can reduce the number of processed rows since the `Inner Join` always can 
generate less rows than `Left/Right Outer Join`. 

This PR can improve the query performance, if applicable.

For example, given the following eligible query:
{code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
$"b.int", "inner"){code}

Before the fix, the logical plan is like
{code}
Join Inner, Some((int#15 = int#9))
:- Join RightOuter, Some((int#3 = int#9))
:  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
:  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
+- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}
After the fix, the logical plan should be like
{code}
Join RightOuter, Some((int#3 = int#9))
:- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
+- Join Inner, Some((int#15 = int#9))
   :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
   +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}


> Join Reordering: Pushing Inner Join Through Outer Join
> ------------------------------------------------------
>
>                 Key: SPARK-12602
>                 URL: https://issues.apache.org/jira/browse/SPARK-12602
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: Xiao Li
>            Priority: Critical
>
> If applicable, we can push Inner Join through Outer Join. 
> The reordering can reduce the number of processed rows since the `Inner Join` 
> always can generate less rows than `Left/Right Outer Join`. Thus, it can 
> improve the query performance.
> For example, given the following eligible query:
> {code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
> $"b.int", "inner"){code}
> Before the fix, the logical plan is like
> {code}
> Join Inner, Some((int#15 = int#9))
> :- Join RightOuter, Some((int#3 = int#9))
> :  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> :  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> {code}
> After the fix, the logical plan should be like
> {code}
> Join RightOuter, Some((int#3 = int#9))
> :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> +- Join Inner, Some((int#15 = int#9))
>    :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
>    +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to