GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/16067

    [SPARK-17897] [SQL] Fixed IsNotNull Inference Rule

    ### What changes were proposed in this pull request?
    The `constraints` of an operator is the expressions that evaluate to `true` 
for all the rows produced. That means, the expression result should be neither 
`false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the 
constraints, which are generated by its own predicates or propagated from the 
children. The constraint can be a complex expression. For better usage of these 
constraints, we try to push down `IsNotNull` to the lowest-level expressions 
(i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is 
null intolerant. (When the input is NULL, the null-intolerant expression always 
evaluates to NULL.)
    
    Below is the code we have for `IsNotNull` pushdown.
    ```Scala
      private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = 
expr match {
        case a: Attribute => Seq(a)
        case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
          expr.children.flatMap(scanNullIntolerantExpr)
        case _ => Seq.empty[Attribute]
      }
    ```
    
    **`IsNotNull` itself is not null-intolerant.** It converts `null` to 
`false`. If the expression does not include any `Not`-like expression, it 
works; otherwise, it could generate a wrong result. This PR is to fix the above 
function by removing the `IsNotNull` from the inference. After the fix, when a 
constraint has `IsNotNull`, we only infer attribute-level `IsNotNull` if it 
appears in the root. 
    
    Without the fix, the following test case will return empty.
    ```Scala
    val data = Seq[java.lang.Integer](1, null).toDF("key")
    data.filter("not key is not null").show()
    ```
    Before the fix, the optimized plan is like
    ```
    == Optimized Logical Plan ==
    Project [value#1 AS key#3]
    +- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
       +- LocalRelation [value#1]
    ```
    
    After the fix, the optimized plan is like
    ```
    == Optimized Logical Plan ==
    Project [value#1 AS key#3]
    +- Filter NOT isnotnull(value#1)
       +- LocalRelation [value#1]
    ```
    
    ### How was this patch tested?
    Added a test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark isNotNull2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16067.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16067
    
----
commit 33c10a0994c9802df901f211e1f28c52e34df27f
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-11-29T08:00:55Z

    fix.

commit 025632a6897abd4901254688a049079ed7358e93
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-11-29T21:26:02Z

    fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to