[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

ajacques Thu, 26 Jul 2018 17:29:39 -0700

GitHub user ajacques opened a pull request:

    https://github.com/apache/spark/pull/21889


    [SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt)

    (Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502)
    
    **This is a restart of apache/spark#21320. Most of the discussion has taken 
place over there. I've only taken it as a based to make stylistic changes based 
on the code review to help move things along.**
    
    Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this 
PR to collect any feedback. This can be closed if you prefer to continue to the 
work in the original PR.
    
    ### What changes were proposed in this pull request?
    One of the hallmarks of a column-oriented data storage format is the 
ability to read data from a subset of columns, efficiently skipping reads from 
other columns. Spark has long had support for pruning unneeded top-level schema 
fields from the scan of a parquet file. For example, consider a table, 
contacts, backed by parquet with the following Spark SQL schema:
    
    ```
    root
     |-- name: struct
     |    |-- first: string
     |    |-- last: string
     |-- address: string
    ```
    Parquet stores this table's data in three physical columns: name.first, 
name.last and address. To answer the query
    
    `select address from contacts`
    Spark will read only from the address column of parquet data. However, to 
answer the query
    
    `select name.first from contacts`
    Spark will read name.first and name.last from parquet.
    
    This PR modifies Spark SQL to support a finer-grain of schema pruning. With 
this patch, Spark reads only the name.first column to answer the previous query.
    
    ### Implementation
    There are two main components of this patch. First, there is a 
ParquetSchemaPruning optimizer rule for gathering the required schema fields of 
a PhysicalOperation over a parquet file, constructing a new schema based on 
those required fields and rewriting the plan in terms of that pruned schema. 
The pruned schema fields are pushed down to the parquet requested read schema. 
ParquetSchemaPruning uses a new ProjectionOverSchema extractor for rewriting a 
catalyst expression in terms of a pruned schema.
    
    Second, the ParquetRowConverter has been patched to ensure the ordinals of 
the parquet columns read are correct for the pruned schema. ParquetReadSupport 
has been patched to address a compatibility mismatch between Spark's built in 
vectorized reader and the parquet-mr library's reader.
    
    ### Limitation
    Among the complex Spark SQL data types, this patch supports parquet column 
pruning of nested sequences of struct fields only.
    
    ### How was this patch tested?
    Care has been taken to ensure correctness and prevent regressions. A more 
advanced version of this patch incorporating optimizations for rewriting 
queries involving aggregations and joins has been running on a production Spark 
cluster at VideoAmp for several years. In that time, one bug was found and 
fixed early on, and we added a regression test for that bug.
    
    We forward-ported this patch to Spark master in June 2016 and have been 
running this patch against Spark 2.x branches on ad-hoc clusters since then.
    
    ### Known Issues
    Highlighted in 
https://github.com/apache/spark/pull/21320#issuecomment-408271470

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajacques/apache-spark 
spark-4502-parquet_column_pruning-foundation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21889.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21889
    
----
commit 86c572a26918e6380762508d1d868fd5ea231ed1
Author: Michael Allman <michael@...>
Date:   2016-06-24T17:21:24Z

    [SPARK-4502][SQL] Parquet nested column pruning

commit 1ffbc6c8f8eaa27b50048e2124afdd2ff9e1e2b4
Author: Michael Allman <msa@...>
Date:   2018-06-04T09:59:34Z

    Refactor SelectedFieldSuite to make its tests simpler and more
    comprehensible

commit 97cd30b221fe78a4ea61c6c24be370fc2dfdd498
Author: Michael Allman <msa@...>
Date:   2018-06-04T10:02:28Z

    Remove test "select function over nested data" of unknown origin and
    purpose

commit c27e87924ed788a4253d077691401d0852306406
Author: Michael Allman <msa@...>
Date:   2018-06-04T10:45:28Z

    Improve readability of ParquetSchemaPruning and
    ParquetSchemaPruningSuite. Add test to exercise whether the
    requested root fields in a query exclude any attributes

commit 2fc1173499e1fc4ca4205db80e78ca6f110fb5dd
Author: Michael Allman <msa@...>
Date:   2018-06-04T12:04:23Z

    Don't handle non-data-field partition column names specially when
    constructing the new projections in ParquetSchemaPruning. These
    column projections are left unchanged by the transformDown function when
    the projectionOverSchema pattern does not match

commit 40dae02ec31e4817d6afe9e738a20e3c6e6ddd03
Author: Michael Allman <msa@...>
Date:   2018-06-04T12:07:35Z

    Add test coverage for ParquetSchemaPruning for partitioned tables whose
    data schema includes the partition column

commit f4e11770ffe006baaec136ea77bc2cba38c8aad2
Author: Michael Allman <msa@...>
Date:   2018-06-12T00:22:40Z

    Remove the ColumnarFileFormat type to put it in another PR

commit 9749d8cf644a35f28621b12920be6de912b76763
Author: Michael Allman <msa@...>
Date:   2018-06-12T02:34:19Z

    Add test coverage for the enhancements to "is not null" constraint
    inference for complex type extractors

commit 3421041b2da13c269f8811e78719e2d8eaaf6628
Author: Michael Allman <msa@...>
Date:   2018-06-24T19:49:33Z

    Revert changes to QueryPlanConstraints.scala and 
basicPhysicalOperators.scala. Remove related unit tests. This change removes 
support for schema pruning on filters involving attributes of complex type

commit 4ddd78a159300cbb19afb7d12c63279f17f9a32b
Author: Michael Allman <msa@...>
Date:   2018-06-24T19:55:49Z

    Revert a whitespace change in DataSourceScanExec.scala

commit 9fe4208091950e6f7af4b4ebaca6de0b720015f9
Author: Michael Allman <msa@...>
Date:   2018-07-21T09:41:11Z

    Remove modifications to ParquetFileFormat.scala and
    ParquetReadSupport.scala, marking broken tests in
    ParquetSchemaPruningSuite.scala as "ignored"

commit 9a1f80817abfa03430594ef3a9447f030f885bd5
Author: Michael Allman <msa@...>
Date:   2018-07-21T21:47:45Z

    PR review: simplify some syntax and add a code doc

commit 7ee616076f93d6cfd55b6646314f3d4a6d1530d3
Author: Adam Jacques <adam@...>
Date:   2018-07-26T01:14:35Z

    Refactor code slightly based on code review comments

commit e1836d13cfd285a385e8b65f66adb7574d121429
Author: Adam Jacques <adam@...>
Date:   2018-07-26T01:49:07Z

    Minor refactorings for code review

commit cd6ae2c7457b646df73fea04c33d38581a63952f
Author: Adam Jacques <adam@...>
Date:   2018-07-26T05:11:33Z

    Refactorings for code reviews

commit 4b847acd8279d8c40115638faac30a4bc1736307
Author: Adam Jacques <adam@...>
Date:   2018-07-27T00:27:22Z

    Mark as private

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

Reply via email to