GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/22880
[SPARK-25407][SQL] Ensure we pass a compatible pruned schema to ParquetRowConverter ## What changes were proposed in this pull request? (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-25407) As part of schema clipping in `ParquetReadSupport.scala`, we add fields in the Catalyst requested schema which are missing from the Parquet file schema to the Parquet clipped schema. However, nested schema pruning requires we ignore unrequested field data when reading from a Parquet file. Therefore we pass two schema to `ParquetRecordMaterializer`: the schema of the file data we want to read and the schema of the rows we want to return. The reader is responsible for reconciling the differences between the two. Aside from checking whether schema pruning is enabled, there is an additional complication to constructing the Parquet requested schema. The manner in which Spark's two Parquet readers reconcile the differences between the Parquet requested schema and the Catalyst requested schema differ. Spark's vectorized reader does not (currently) support reading Parquet files with complex types in their schema. Further, it assumes that the Parquet requested schema includes all fields requested in the Catalyst requested schema. It includes logic in its read path to skip fields in the Parquet requested schema which are not present in the file. Spark's parquet-mr based reader supports reading Parquet files of any kind of complex schema, and it supports nested schema pruning as well. Unlike the vectorized reader, the parquet-mr reader requires that the Parquet requested schema include only those fields present in the underlying Parquet file's schema. Therefore, in the case where we use the parquet-mr reader we intersect the Parquet clipped schema with the Parquet file's schema to construct the Parquet requested schema that's set in the `ReadContext`. ## How was this patch tested? A previously ignored test case which exercises the failure scenario this PR addresses has been enabled. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-25407-parquet_column_pruning-fix_ignored_pruning_test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22880.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22880 ---- commit e5e60ad2d9c130050925220eb4ae93ae3c949e95 Author: Michael Allman <msa@...> Date: 2018-08-15T23:48:25Z Ensure we pass a compatible pruned schema to ParquetRowConverter ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org