[GitHub] spark pull request #21882: [SPARK-24934][SQL] Handle missing upper/lower bou...

HyukjinKwon Thu, 26 Jul 2018 06:15:46 -0700

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/21882


    [SPARK-24934][SQL] Handle missing upper/lower bounds case in in-memory 
partition pruning

    ## What changes were proposed in this pull request?
    
    Looks we intentionally set `null` for lower bounds for complex types and 
don't use it. However, these look used in in-memory partition pruning, which 
ends up with incorrect results.
    
    This PR proposes to don't filter on `null` when both bounds are `null`. 
This can be false-positive but still better than not working.
    
    ```scala
    val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
    df.cache().filter("arrayCol > array('a', 'b')").show()
    ```
    
    **Before:**
    
    ```
    Predicate isnotnull(arrayCol#3) generates partition filter: 
((arrayCol.count#18 - arrayCol.nullCount#17) > 0)
    Predicate (arrayCol#3 > [a,b]) generates partition filter: ([a,b] < 
arrayCol.upperBound#15)
    Skipping partition based on stats arrayCol.lowerBound: null, 
arrayCol.upperBound: null, arrayCol.nullCount: 0, arrayCol.count: 1, 
arrayCol.sizeInBytes: 52
    
    +--------+
    |arrayCol|
    +--------+
    +--------+
    ```
    
    **After:**
    
    ```
    Predicate isnotnull(arrayCol#3) generates partition filter: 
((arrayCol.count#18 - arrayCol.nullCount#17) > 0)
    Predicate (arrayCol#3 > [a,b]) generates partition filter: 
(isnull(arrayCol.upperBound#15) || ([a,b] < arrayCol.upperBound#15))
    
    +--------+
    |arrayCol|
    +--------+
    |  [c, d]|
    +--------+
    ```
    
    ## How was this patch tested?
    
    Unit tests were added and manually tested.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark stats-filter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21882.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21882
    
----
commit ea38b56dcdd3010c70fcfbddd4405917278431cd
Author: hyukjinkwon <gurwls223@...>
Date:   2018-07-26T12:34:44Z

    Handle missing upper/lower bounds case in inmemory partition pruning

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21882: [SPARK-24934][SQL] Handle missing upper/lower bou...

Reply via email to