[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21814 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21814#discussion_r203638620 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -384,12 +385,10 @@ class ParquetFileFormat // *only* if the file was created by something other than "parquet-mr", so check the actual // writer here for this file. We have to do this per-file, as each file in the table may // have different writers. - def isCreatedByParquetMr(): Boolean = { -val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS) -footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr") - } + val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr") --- End diff -- Also, let's leave a comment here saying that we avoid reading it by short circuit. I already saw multiple people confused here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21814#discussion_r203638199 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -384,12 +385,10 @@ class ParquetFileFormat // *only* if the file was created by something other than "parquet-mr", so check the actual // writer here for this file. We have to do this per-file, as each file in the table may // have different writers. - def isCreatedByParquetMr(): Boolean = { -val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS) -footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr") - } + val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr") --- End diff -- Hm? `isCreatedByParquetMr` will be evaluated here. We should make `isCreatedByParquetMr` lazy too .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21814#discussion_r203622451 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -384,12 +385,10 @@ class ParquetFileFormat // *only* if the file was created by something other than "parquet-mr", so check the actual // writer here for this file. We have to do this per-file, as each file in the table may // have different writers. - def isCreatedByParquetMr(): Boolean = { -val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS) -footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr") - } + val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr") --- End diff -- I think making `footerFileMetaData` lazy value is better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21814#discussion_r203621872 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -384,12 +385,10 @@ class ParquetFileFormat // *only* if the file was created by something other than "parquet-mr", so check the actual // writer here for this file. We have to do this per-file, as each file in the table may // have different writers. - def isCreatedByParquetMr(): Boolean = { -val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS) -footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr") - } + val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr") --- End diff -- Yes, I just push the code to make it lazy value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21814#discussion_r203621554 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -384,12 +385,10 @@ class ParquetFileFormat // *only* if the file was created by something other than "parquet-mr", so check the actual // writer here for this file. We have to do this per-file, as each file in the table may // have different writers. - def isCreatedByParquetMr(): Boolean = { -val footer = ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS) -footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr") - } + val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr") --- End diff -- Is `lazy` better? `timestampConversion` default is false. ```scala lazy val isCreatedByParquetMr = footerFileMetaData.getCreatedBy().startsWith("parquet-mr") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21814: [SPARK-24858][SQL] Avoid unnecessary parquet foot...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/21814 [SPARK-24858][SQL] Avoid unnecessary parquet footer reads ## What changes were proposed in this pull request? Currently the same Parquet footer is read twice in the function `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is enabled. Fix it with simple changes. ## How was this patch tested? Unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark parquetFooter Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21814.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21814 commit 5667cc57022d840aecb6c7d0c967e2a3448a4928 Author: Gengliang Wang Date: 2018-07-19T06:43:45Z Avoid unnecessary parquet footer reading --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org