[GitHub] spark pull request #22880: [SPARK-25407][SQL] Ensure we pass a compatible pr...

mallman Mon, 29 Oct 2018 12:17:08 -0700

GitHub user mallman opened a pull request:

    https://github.com/apache/spark/pull/22880


    [SPARK-25407][SQL] Ensure we pass a compatible pruned schema to 
ParquetRowConverter

    ## What changes were proposed in this pull request?
    
    (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-25407)
    
    As part of schema clipping in `ParquetReadSupport.scala`, we add fields in 
the Catalyst requested schema which are missing from the Parquet file schema to 
the Parquet clipped schema. However, nested schema pruning requires we ignore 
unrequested field data when reading from a Parquet file. Therefore we pass two 
schema to `ParquetRecordMaterializer`: the schema of the file data we want to 
read and the schema of the rows we want to return. The reader is responsible 
for reconciling the differences between the two.
    
    Aside from checking whether schema pruning is enabled, there is an 
additional complication to constructing the Parquet requested schema. The 
manner in which Spark's two Parquet readers reconcile the differences between 
the Parquet requested schema and the Catalyst requested schema differ. Spark's 
vectorized reader does not (currently) support reading Parquet files with 
complex types in their schema. Further, it assumes that the Parquet requested 
schema includes all fields requested in the Catalyst requested schema. It 
includes logic in its read path to skip fields in the Parquet requested schema 
which are not present in the file.
    
    Spark's parquet-mr based reader supports reading Parquet files of any kind 
of complex schema, and it supports nested schema pruning as well. Unlike the 
vectorized reader, the parquet-mr reader requires that the Parquet requested 
schema include only those fields present in the underlying Parquet file's 
schema. Therefore, in the case where we use the parquet-mr reader we intersect 
the Parquet clipped schema with the Parquet file's schema to construct the 
Parquet requested schema that's set in the `ReadContext`.
    
    ## How was this patch tested?
    
    A previously ignored test case which exercises the failure scenario this PR 
addresses has been enabled.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/VideoAmp/spark-public 
spark-25407-parquet_column_pruning-fix_ignored_pruning_test

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22880.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22880
    
----
commit e5e60ad2d9c130050925220eb4ae93ae3c949e95
Author: Michael Allman <msa@...>
Date:   2018-08-15T23:48:25Z

    Ensure we pass a compatible pruned schema to ParquetRowConverter

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22880: [SPARK-25407][SQL] Ensure we pass a compatible pr...

Reply via email to