GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/14661

    [SPARK-16991][SQL] Fix Outer Join Elimination when Filter's isNotNull 
Constraints Unable to Filter Out All Null-supplying Rows

    ### What changes were proposed in this pull request?
    This PR is to fix an incorrect outer join elimination when filter's 
`isNotNull` constraints is unable to filter out all null-supplying rows. For 
example, `isnotnull(coalesce(b#227, c#238))`.
    
    Users can hit this error when they try to use `using/natural outer join`, 
which is converted to a normal outer join with a `coalesce` expression on the 
`using columns`. For example, 
    ```Scala
        val a = Seq((1, 2), (2, 3)).toDF("a", "b")
        val b = Seq((2, 5), (3, 4)).toDF("a", "c")
        val c = Seq((3, 1)).toDF("a", "d")
        val ab = a.join(b, Seq("a"), "fullouter")
        ab.join(c, "a").explain(true)
    ```
    The dataframe `ab` is doing `using full-outer join`, which is converted to 
a normal outer join with a `coalesce` expression. Constraints inference 
generates a `Filter` with constraints `isnotnull(coalesce(b#227, c#238))`. 
Then, it triggers a wrong outer join elimination and generates a wrong result.
    ```
    Project [a#251, b#227, c#237, d#247]
    +- Join Inner, (a#251 = a#246)
       :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
       :  +- Join FullOuter, (a#226 = a#236)
       :     :- Project [_1#223 AS a#226, _2#224 AS b#227]
       :     :  +- LocalRelation [_1#223, _2#224]
       :     +- Project [_1#233 AS a#236, _2#234 AS c#237]
       :        +- LocalRelation [_1#233, _2#234]
       +- Project [_1#243 AS a#246, _2#244 AS d#247]
          +- LocalRelation [_1#243, _2#244]
    
    == Optimized Logical Plan ==
    Project [a#251, b#227, c#237, d#247]
    +- Join Inner, (a#251 = a#246)
       :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
       :  +- Filter isnotnull(coalesce(a#226, a#236))
       :     +- Join FullOuter, (a#226 = a#236)
       :        :- LocalRelation [a#226, b#227]
       :        +- LocalRelation [a#236, c#237]
       +- LocalRelation [a#246, d#247]
    ```
    
    **A note to the `Committer`**, please also give the credit to 
@dongjoon-hyun who submitted another PR for fixing this issue. 
https://github.com/apache/spark/pull/14580
    
    
    ### How was this patch tested?
    Added test cases

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark fixOuterJoinElimination

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14661.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14661
    
----
commit a537dd07823339e63a10e6628f333f44fa528b54
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-08-16T07:17:09Z

    fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to