GitHub user xwu0226 opened a pull request:

    https://github.com/apache/spark/pull/16156

    [SPARK-18539][SQL]: tolerate pushed-down filter on non-existing parquet 
columns

    ## What changes were proposed in this pull request?
    
    When `spark.sql.parquet.filterPushdown` is on, SparkSQL's Parquet reader 
pushes down filters to Parquet file when creating reader, in order to start 
with filtered blocks. However, when the parquet file does not have the 
predicate column(s), Parquet-mr throw exceptions complaining the filter column 
does not existing. This issue will be fixed in parquet-mr 1.9. But Spark 2.1 is 
still on parquet 1.8. 
    
    This PR is to tolerate such exception thrown by Parquet-mr and just return 
all the blocks from the current parquet file to the created SparkSQL parquet 
reader. Filters will be applied again anyway in later physical plan operation. 
According to following example physical plan:
    
    ```
    == Physical Plan ==
    *Project [a#2805, b#2806, c#2807]
    +- *Filter ((isnotnull(a#2805) && isnull(c#2807)) && (a#2805 < 2))
      +- *FileScan parquet [a#2805,b#2806,c#2807] Batched: true, Format: 
ParquetFormat, Location: 
InMemoryFileIndex[file:/Users/xinwu/spark/target/tmp/spark-ed6f0c12-6494-4ac5-b485-5b986ef475cc],
 PartitionFilters: [], PushedFilters: [IsNotNull(a), IsNull(c), LessThan(a,2)], 
ReadSchema: struct<a:int,b:string,c:int>
    ```
    
    ## How was this patch tested?
    A unit test case is added. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xwu0226/spark SPARK-18539

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16156.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16156
    
----
commit 20401598b7661cf9bd18f2b4dbd977c7a5b76832
Author: Xin Wu <xi...@us.ibm.com>
Date:   2016-12-05T21:18:12Z

    SPARK-18539: fix filtered on non-existing parquet column

commit 096ab18887c40761eb7ba79e9c406fe8ca6ce7c0
Author: Xin Wu <xi...@us.ibm.com>
Date:   2016-12-05T21:24:13Z

    SPARK-18539: update testcases

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to