[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

HyukjinKwon Wed, 19 Apr 2017 00:18:07 -0700

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/17680


    [SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots

    ## What changes were proposed in this pull request?
    
    Currently, if there are dots in the column name, predicate pushdown seems 
being failed in Parquet.
    
    **With dots**
    
    ```
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("col.dots").write.parquet(path)
    spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    +--------+
    ```
    
    **Without dots**
    
    ```
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("coldots").write.parquet(path)
    spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
    ```
    
    ```
    +-------+
    |coldots|
    +-------+
    |      1|
    +-------+
    ```
    
    It seems dot in the column names via `FilterApi` tries to separate the 
field name with dot (`ColumnPath` with multiple column paths) whereas the 
actual column name is `col.dots`. (See 
[FilterApi.java#L71](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71)
 and it calls 
[ColumnPath.java#L44](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44).
    
    I just tried to come up with ways to resolve it and I came up with two as 
below:
    
     - One is simply to don't push down filters when there are dots in column 
names so that it reads all and filters in Spark-side.
    
     - The other way creates Spark's `FilterApi` for those columns (it seems 
final) to get always use single column path it in Spark-side (this seems hacky) 
as we are not pushing down nested columns currently. So, it looks we can get a 
field name via `ColumnPath.get` not `ColumnPath.fromDotString` in this way.
    
    This PR proposes the latter way because I think we need to be sure on that 
it passes the tests.
    
    **After**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("col.dots").write.parquet(path)
    spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    |       1|
    +--------+
    ```
    
    
    ## How was this patch tested?
    
    Existing tests should cover this. Some tests additionally were added in 
`ParquetFilterSuite.scala`. Manually, I ran related tests and Jenkins tests 
will cover this.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-20364

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17680
    
----
commit 297c70bd47cb859d343345849776bd7c21396078
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2017-04-19T06:24:15Z

    Parquet predicate pushdown on columns with dots return empty results

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

Reply via email to