[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822111#comment-15822111 ]
Michael Allman commented on SPARK-4502: --------------------------------------- Hi Guys, I'm going to submit a PR for this shortly. We've had a patch for this functionality in production for a year now but are just now getting around to contributing it. I've examined the other two PR's. Our patch is substantially different from the other two and provides a superset of their functionality. We've added over two dozen new unit tests to guard against regressions and test expected pruning. We've built and tested the latest patch, and found a significant number of test failures from our suite. I also found test failures in the unmodified codebase when enabling the schema pruning functionality. I do not take the idea of submitting a parallel, "competing" PR lightly, but in this case I think we can offer a better foundation for review. Please examine our PR and judge for yourself. Cheers. > Spark SQL reads unneccesary nested fields from Parquet > ------------------------------------------------------ > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.1.0 > Reporter: Liwen Sun > Priority: Critical > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org