GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/22905
[SPARK-25894][SQL] Add a ColumnarFileFormat type which returns the column count for a given schema (link to Jira: https://issues.apache.org/jira/browse/SPARK-25894) ## What changes were proposed in this pull request? Knowing the number of physical columns Spark will read from a columnar file format (such as Parquet) is extremely helpful (if not critical) in validating an assumption about that number of columns based on a given query. For example, take a `contacts` table with a `name` struct column like `(name.first, name.last)`. Without schema pruning the following query reads both columns in the name struct: ``` select name.first from contacts ``` With schema pruning, the same query reads only the `name.first` column. This PR includes an additional metadata field for `FileSourceScanExec` which identifies the number of columns Spark will read from that file source. This metadata will be printed as part of a physical plan explanation. ## How was this patch tested? A new test was added to `ParquetSchemaPruningSuite.scala`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-25894-file_source_scan_exec_column_count_metadata Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22905.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22905 ---- commit 4aa8d0454be723f8318e1d0a3ea4e4c138ed5861 Author: Michael Allman <msa@...> Date: 2018-10-31T12:27:00Z Add a ColumnarFileFormat type and implementation for ParquetFileFormat which specifies a method for returning the physical column count associated with a given StructType. We include this count as metadata in FileSourceScanExec ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org