[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

via GitHub Thu, 30 Mar 2023 09:14:57 -0700


yabola commented on code in PR #39950:
URL: https://github.com/apache/spark/pull/39950#discussion_r1153492439



##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:
##########
@@ -89,17 +90,28 @@
   @Override
   public void initialize(InputSplit inputSplit, TaskAttemptContext 
taskAttemptContext)
       throws IOException, InterruptedException {
+    initialize(inputSplit, taskAttemptContext, Option.empty());
+  }
+
+  public void initialize(
+      InputSplit inputSplit,
+      TaskAttemptContext taskAttemptContext,
+      Option<ParquetMetadata> fileFooter) throws IOException, 
InterruptedException {
     Configuration configuration = taskAttemptContext.getConfiguration();
     FileSplit split = (FileSplit) inputSplit;
     this.file = split.getPath();
-
-    ParquetReadOptions options = HadoopReadOptions
-      .builder(configuration, file)
-      .withRange(split.getStart(), split.getStart() + split.getLength())
-      .withCodecFactory(new ParquetCodecFactory(configuration, 0))
-      .build();
-    ParquetFileReader fileReader = new ParquetFileReader(
-        HadoopInputFile.fromPath(file, configuration), options);
+    ParquetFileReader fileReader;
+    if (fileFooter.isDefined()) {
+      fileReader = new ParquetFileReader(configuration, file, 
fileFooter.get());

Review Comment:
   when we pass footer here, then we can't pass `CodecFactory` any more. 
Parquet will create in default `HadoopCodecs.newFactory(conf, 0)`.
   
   This may need to be solved by upgrading parquet in the future, and I will 
also create a pr to remove @Deprecated tag.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

Reply via email to