GitHub user ajacques opened a pull request: https://github.com/apache/spark/pull/21889
[SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt) (Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502) **This is a restart of apache/spark#21320. Most of the discussion has taken place over there. I've only taken it as a based to make stylistic changes based on the code review to help move things along.** Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this PR to collect any feedback. This can be closed if you prefer to continue to the work in the original PR. ### What changes were proposed in this pull request? One of the hallmarks of a column-oriented data storage format is the ability to read data from a subset of columns, efficiently skipping reads from other columns. Spark has long had support for pruning unneeded top-level schema fields from the scan of a parquet file. For example, consider a table, contacts, backed by parquet with the following Spark SQL schema: ``` root |-- name: struct | |-- first: string | |-- last: string |-- address: string ``` Parquet stores this table's data in three physical columns: name.first, name.last and address. To answer the query `select address from contacts` Spark will read only from the address column of parquet data. However, to answer the query `select name.first from contacts` Spark will read name.first and name.last from parquet. This PR modifies Spark SQL to support a finer-grain of schema pruning. With this patch, Spark reads only the name.first column to answer the previous query. ### Implementation There are two main components of this patch. First, there is a ParquetSchemaPruning optimizer rule for gathering the required schema fields of a PhysicalOperation over a parquet file, constructing a new schema based on those required fields and rewriting the plan in terms of that pruned schema. The pruned schema fields are pushed down to the parquet requested read schema. ParquetSchemaPruning uses a new ProjectionOverSchema extractor for rewriting a catalyst expression in terms of a pruned schema. Second, the ParquetRowConverter has been patched to ensure the ordinals of the parquet columns read are correct for the pruned schema. ParquetReadSupport has been patched to address a compatibility mismatch between Spark's built in vectorized reader and the parquet-mr library's reader. ### Limitation Among the complex Spark SQL data types, this patch supports parquet column pruning of nested sequences of struct fields only. ### How was this patch tested? Care has been taken to ensure correctness and prevent regressions. A more advanced version of this patch incorporating optimizations for rewriting queries involving aggregations and joins has been running on a production Spark cluster at VideoAmp for several years. In that time, one bug was found and fixed early on, and we added a regression test for that bug. We forward-ported this patch to Spark master in June 2016 and have been running this patch against Spark 2.x branches on ad-hoc clusters since then. ### Known Issues Highlighted in https://github.com/apache/spark/pull/21320#issuecomment-408271470 You can merge this pull request into a Git repository by running: $ git pull https://github.com/ajacques/apache-spark spark-4502-parquet_column_pruning-foundation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21889.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21889 ---- commit 86c572a26918e6380762508d1d868fd5ea231ed1 Author: Michael Allman <michael@...> Date: 2016-06-24T17:21:24Z [SPARK-4502][SQL] Parquet nested column pruning commit 1ffbc6c8f8eaa27b50048e2124afdd2ff9e1e2b4 Author: Michael Allman <msa@...> Date: 2018-06-04T09:59:34Z Refactor SelectedFieldSuite to make its tests simpler and more comprehensible commit 97cd30b221fe78a4ea61c6c24be370fc2dfdd498 Author: Michael Allman <msa@...> Date: 2018-06-04T10:02:28Z Remove test "select function over nested data" of unknown origin and purpose commit c27e87924ed788a4253d077691401d0852306406 Author: Michael Allman <msa@...> Date: 2018-06-04T10:45:28Z Improve readability of ParquetSchemaPruning and ParquetSchemaPruningSuite. Add test to exercise whether the requested root fields in a query exclude any attributes commit 2fc1173499e1fc4ca4205db80e78ca6f110fb5dd Author: Michael Allman <msa@...> Date: 2018-06-04T12:04:23Z Don't handle non-data-field partition column names specially when constructing the new projections in ParquetSchemaPruning. These column projections are left unchanged by the transformDown function when the projectionOverSchema pattern does not match commit 40dae02ec31e4817d6afe9e738a20e3c6e6ddd03 Author: Michael Allman <msa@...> Date: 2018-06-04T12:07:35Z Add test coverage for ParquetSchemaPruning for partitioned tables whose data schema includes the partition column commit f4e11770ffe006baaec136ea77bc2cba38c8aad2 Author: Michael Allman <msa@...> Date: 2018-06-12T00:22:40Z Remove the ColumnarFileFormat type to put it in another PR commit 9749d8cf644a35f28621b12920be6de912b76763 Author: Michael Allman <msa@...> Date: 2018-06-12T02:34:19Z Add test coverage for the enhancements to "is not null" constraint inference for complex type extractors commit 3421041b2da13c269f8811e78719e2d8eaaf6628 Author: Michael Allman <msa@...> Date: 2018-06-24T19:49:33Z Revert changes to QueryPlanConstraints.scala and basicPhysicalOperators.scala. Remove related unit tests. This change removes support for schema pruning on filters involving attributes of complex type commit 4ddd78a159300cbb19afb7d12c63279f17f9a32b Author: Michael Allman <msa@...> Date: 2018-06-24T19:55:49Z Revert a whitespace change in DataSourceScanExec.scala commit 9fe4208091950e6f7af4b4ebaca6de0b720015f9 Author: Michael Allman <msa@...> Date: 2018-07-21T09:41:11Z Remove modifications to ParquetFileFormat.scala and ParquetReadSupport.scala, marking broken tests in ParquetSchemaPruningSuite.scala as "ignored" commit 9a1f80817abfa03430594ef3a9447f030f885bd5 Author: Michael Allman <msa@...> Date: 2018-07-21T21:47:45Z PR review: simplify some syntax and add a code doc commit 7ee616076f93d6cfd55b6646314f3d4a6d1530d3 Author: Adam Jacques <adam@...> Date: 2018-07-26T01:14:35Z Refactor code slightly based on code review comments commit e1836d13cfd285a385e8b65f66adb7574d121429 Author: Adam Jacques <adam@...> Date: 2018-07-26T01:49:07Z Minor refactorings for code review commit cd6ae2c7457b646df73fea04c33d38581a63952f Author: Adam Jacques <adam@...> Date: 2018-07-26T05:11:33Z Refactorings for code reviews commit 4b847acd8279d8c40115638faac30a4bc1736307 Author: Adam Jacques <adam@...> Date: 2018-07-27T00:27:22Z Mark as private ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org