[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

mallman Mon, 14 May 2018 03:00:08 -0700

GitHub user mallman opened a pull request:

    https://github.com/apache/spark/pull/21320


    [SPARK-4502][SQL] Parquet nested column pruning - foundation

    (Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502)
    
    _N.B. This is a restart of PR #16578 which includes everything in that PR 
except the aggregation and join schema pruning rules. Relevant review comments 
from that PR should be considered incorporated by reference. Please avoid 
duplication in review by reviewing that PR first. The summary below is an 
edited copy of the summary of the previous PR._
    
    ## What changes were proposed in this pull request?
    
    One of the hallmarks of a column-oriented data storage format is the 
ability to read data from a subset of columns, efficiently skipping reads from 
other columns. Spark has long had support for pruning unneeded top-level schema 
fields from the scan of a parquet file. For example, consider a table, 
`contacts`, backed by parquet with the following Spark SQL schema:
    
    ```
    root
     |-- name: struct
     |    |-- first: string
     |    |-- last: string
     |-- address: string
    ```
    
    Parquet stores this table's data in three physical columns: `name.first`, 
`name.last` and `address`. To answer the query
    
    ```SQL
    select address from contacts
    ```
    
    Spark will read only from the `address` column of parquet data. However, to 
answer the query
    
    ```SQL
    select name.first from contacts
    ```
    
    Spark will read `name.first` and `name.last` from parquet.
    
    This PR modifies Spark SQL to support a finer-grain of schema pruning. With 
this patch, Spark reads only the `name.first` column to answer the previous 
query.
    
    ### Implementation
    
    There are two main components of this patch. First, there is a 
`ParquetSchemaPruning` optimizer rule for gathering the required schema fields 
of a `PhysicalOperation` over a parquet file, constructing a new schema based 
on those required fields and rewriting the plan in terms of that pruned schema. 
The pruned schema fields are pushed down to the parquet requested read schema. 
`ParquetSchemaPruning` uses a new `ProjectionOverSchema` extractor for 
rewriting a catalyst expression in terms of a pruned schema.
    
    Second, the `ParquetRowConverter` has been patched to ensure the ordinals 
of the parquet columns read are correct for the pruned schema. 
`ParquetReadSupport` has been patched to address a compatibility mismatch 
between Spark's built in vectorized reader and the parquet-mr library's reader.
    
    ### Limitation
    
    Among the complex Spark SQL data types, this patch supports parquet column 
pruning of nested sequences of struct fields only.
    
    ## How was this patch tested?
    
    Care has been taken to ensure correctness and prevent regressions. A more 
advanced version of this patch incorporating optimizations for rewriting 
queries involving aggregations and joins has been running on a production Spark 
cluster at VideoAmp for several years. In that time, one bug was found and 
fixed early on, and we added a regression test for that bug.
    
    We forward-ported this patch to Spark master in June 2016 and have been 
running this patch against Spark 2.x branches on ad-hoc clusters since then.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/VideoAmp/spark-public 
spark-4502-parquet_column_pruning-foundation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21320.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21320
    
----
commit 9e301b37bb5863ef5fad8ec15f1d8f3d4b5e6a6f
Author: Michael Allman <michael@...>
Date:   2016-06-24T17:21:24Z

    [SPARK-4502][SQL] Parquet nested column pruning

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

Reply via email to