GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/16184

    [SPARK-18753][SQL] Keep pushed-down null literal as a filter in Spark-side 
post-filter for FileFormat datasources

    ## What changes were proposed in this pull request?
    
    Currently, `FileSourceStrategy` does not handle the case when the 
pushed-down filter is `Literal(null)` and removes it at the post-filter in 
Spark-side.
    
    For example, the codes below:
    
    ```scala
    val ds = Seq(Tuple1(Some(true)), Tuple1(None), Tuple1(Some(false))).toDS()
    ds.filter($"_1" === "true").explain(true)
    ```
    
    shows it keeps `null` properly.
    
    ```
    == Parsed Logical Plan ==
    'Filter ('_1 = true)
    +- LocalRelation [_1#17]
    
    == Analyzed Logical Plan ==
    _1: boolean
    Filter (cast(_1#17 as double) = cast(true as double))
    +- LocalRelation [_1#17]
    
    == Optimized Logical Plan ==
    Filter (isnotnull(_1#17) && null)
    +- LocalRelation [_1#17]
    
    == Physical Plan ==
    *Filter (isnotnull(_1#17) && null)
    +- LocalTableScan [_1#17]
    ```
    
    However, when we read it back from Parquet,
    
    ```
    ds.write.parquet(path)
    spark.read.parquet(path).filter($"_1" === "true").explain(true)
    ```
    
    `null` is removed at the post-filter.
    
    ```
    == Parsed Logical Plan ==
    'Filter ('_1 = true)
    +- Relation[_1#11] parquet
    
    == Analyzed Logical Plan ==
    _1: boolean
    Filter (cast(_1#11 as double) = cast(true as double))
    +- Relation[_1#11] parquet
    
    == Optimized Logical Plan ==
    Filter (isnotnull(_1#11) && null)
    +- Relation[_1#11] parquet
    
    == Physical Plan ==
    *Project [_1#11]
    +- *Filter isnotnull(_1#11)
       +- *FileScan parquet [_1#11] Batched: true, Format: ParquetFormat, 
Location: InMemoryFileIndex[file:/tmp/testfile], PartitionFilters: [null], 
PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
    ```
    
    This PR fixes it to keep it properly. In more details,
    
    ```scala
    val partitionKeyFilters =
      
ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
    ```
    
    This keeps this `null` in `partitionKeyFilters` as `Literal` always don't 
have `children` and `references` is being empty  which is always the subset of 
`partitionSet`.
    
    And then in 
    
    ```scala
    val afterScanFilters = filterSet -- partitionKeyFilters
    ```
    
    `null` is always removed from the post filter. So, if the referenced fields 
are empty, it should be applied into both for partitioned columns and data 
columns.
    
    After this PR, it becomes as below:
    
    ```
    == Parsed Logical Plan ==
    'Filter ('_1 = true)
    +- Relation[_1#276] parquet
    
    == Analyzed Logical Plan ==
    _1: boolean
    Filter (cast(_1#276 as double) = cast(true as double))
    +- Relation[_1#276] parquet
    
    == Optimized Logical Plan ==
    Filter (isnotnull(_1#276) && null)
    +- Relation[_1#276] parquet
    
    == Physical Plan ==
    *Project [_1#276]
    +- *Filter (isnotnull(_1#276) && null)
       +- *FileScan parquet [_1#276] Batched: true, Format: ParquetFormat, 
Location: 
InMemoryFileIndex[file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-a5d59bdb-5b...,
 PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: 
struct<_1:boolean>
    ```
    
    ## How was this patch tested?
    
    Unit test in `FileSourceStrategySuite`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-18753

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16184
    
----
commit c6fe34511fc1ea5c36713d435dc64673deceae7f
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-12-07T02:39:26Z

    keep pushed-down null literal as a filter in Spark-side post-filter

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to