Re: [PR] PARQUET-2365 : Fixes NPE when rewriting column without column index [parquet-mr]

via GitHub Fri, 03 Nov 2023 02:44:06 -0700


wgtmac commented on code in PR #1173:
URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1381380912



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -722,14 +722,7 @@ public void writeDataPage(
     LOG.debug("{}: write data page content {}", out.getPos(), 
compressedPageSize);
     bytes.writeAllTo(out);
 
-    // Copying the statistics if it is not initialized yet so we have the 
correct typed one
-    if (currentStatistics == null) {
-      currentStatistics = statistics.copy();
-    } else {
-      currentStatistics.mergeStatistics(statistics);
-    }
-
-    columnIndexBuilder.add(statistics);
+    mergeColumnStatistics(statistics);

Review Comment:
   Do we need to fix here and below: 
https://github.com/apache/parquet-mr/blob/514cc6c257fe8e618b100a19d86d304f6442cb94/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L215
 



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -392,6 +402,7 @@ private void processChunk(ColumnChunkMetaData chunk,
     DictionaryPage dictionaryPage = null;
     long readValues = 0;
     Statistics<?> statistics = null;
+    boolean needOverwriteColumnStatistics = false;

Review Comment:
   ```suggestion
       boolean isColumnStatisticsMalformed = false;
   ```



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -324,15 +324,20 @@ private void processBlocksFromReader(IndexCache 
indexCache) throws IOException {
 
           // Translate compression and/or encryption
           writer.startColumn(descriptor, 
crStore.getColumnReader(descriptor).getTotalValueCount(), newCodecName);
-          processChunk(
+          boolean needOverwriteStatistics = processChunk(
                   chunk,
                   newCodecName,
                   columnChunkEncryptorRunTime,
                   encryptColumn,
                   indexCache.getBloomFilter(chunk),
                   indexCache.getColumnIndex(chunk),
                   indexCache.getOffsetIndex(chunk));
-          writer.endColumn();
+          if (needOverwriteStatistics) {
+            // All the column statistics are invalid, so we need to overwrite 
the column statistics
+            writer.endColumn(chunk.getStatistics());

Review Comment:
   Is it cleaner to handle this inside processChunk()? Then we don't have to 
bother with endColumn()



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -988,6 +988,21 @@ void writeColumnChunk(ColumnDescriptor descriptor,
     endColumn();
   }
 
+  /**
+   * Overwrite the column total statistics. This special used when the column 
total statistics
+   * is known while all the page statistics are invalid, for example when 
rewriting the column.
+   *
+   * @param totalStatistics the column total statistics
+   * @throws IOException if there is an error while writing
+   */
+  public void endColumn(Statistics<?> totalStatistics) throws IOException {

Review Comment:
   Since this is used only for invalid stats, is it better to add a `void 
invalidateStatistics()` which simply include line 988 and 990? Then you just 
need to call invalidateStatistics() and endColumn() in this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PARQUET-2365 : Fixes NPE when rewriting column without column index [parquet-mr]

Reply via email to