GitHub user mallman opened a pull request:

    https://github.com/apache/spark/pull/22905

    [SPARK-25894][SQL] Add a ColumnarFileFormat type which returns the column 
count for a given schema

    (link to Jira: https://issues.apache.org/jira/browse/SPARK-25894)
    
    ## What changes were proposed in this pull request?
    
    Knowing the number of physical columns Spark will read from a columnar file 
format (such as Parquet) is extremely helpful (if not critical) in validating 
an assumption about that number of columns based on a given query. For example, 
take a `contacts` table with a `name` struct column like `(name.first, 
name.last)`. Without schema pruning the following query reads both columns in 
the name struct:
    
    ```
    select name.first from contacts
    ```
    
    With schema pruning, the same query reads only the `name.first` column.
    
    This PR includes an additional metadata field for `FileSourceScanExec` 
which identifies the number of columns Spark will read from that file source. 
This metadata will be printed as part of a physical plan explanation.
    
    ## How was this patch tested?
    
    A new test was added to `ParquetSchemaPruningSuite.scala`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/VideoAmp/spark-public 
spark-25894-file_source_scan_exec_column_count_metadata

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22905.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22905
    
----
commit 4aa8d0454be723f8318e1d0a3ea4e4c138ed5861
Author: Michael Allman <msa@...>
Date:   2018-10-31T12:27:00Z

    Add a ColumnarFileFormat type and implementation for ParquetFileFormat
    which specifies a method for returning the physical column count
    associated with a given StructType. We include this count as metadata in
    FileSourceScanExec

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to