[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Liwen Sun updated SPARK-4502: ----------------------------- Description: When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only needs 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read. was: When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only need 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read. > Spark SQL reads unneccesary fields from Parquet > ----------------------------------------------- > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.1.0 > Reporter: Liwen Sun > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org