Rajesh Balamohan created SPARK-32225: ----------------------------------------
Summary: Parquet footer information is read twice Key: SPARK-32225 URL: https://issues.apache.org/jira/browse/SPARK-32225 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Rajesh Balamohan Attachments: spark_parquet_footer_reads.png When running queries, spark reads parquet footer information twice. In cloud env, this would turn out to be expensive (depending on the jobs, # of splits). It would be nice to reuse the footer information already read via "ParquetInputFormat::buildReaderWithPartitionValues" !image-2020-07-08-14-24-23-470.png|width=726,height=730! Lines of interest: [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L271] [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L326] [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105] [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L111] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org