[jira] [Commented] (PARQUET-2365) Fixes NPE when rewriting column without column index

ASF GitHub Bot (Jira) Fri, 03 Nov 2023 02:45:10 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782491#comment-17782491
 ]


ASF GitHub Bot commented on PARQUET-2365:
-----------------------------------------

wgtmac commented on code in PR #1173:
URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1381380912


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -722,14 +722,7 @@ public void writeDataPage(
     LOG.debug("{}: write data page content {}", out.getPos(), 
compressedPageSize);
     bytes.writeAllTo(out);
 
-    // Copying the statistics if it is not initialized yet so we have the 
correct typed one
-    if (currentStatistics == null) {
-      currentStatistics = statistics.copy();
-    } else {
-      currentStatistics.mergeStatistics(statistics);
-    }
-
-    columnIndexBuilder.add(statistics);
+    mergeColumnStatistics(statistics);

Review Comment:
   Do we need to fix here and below: 
https://github.com/apache/parquet-mr/blob/514cc6c257fe8e618b100a19d86d304f6442cb94/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L215
 



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -392,6 +402,7 @@ private void processChunk(ColumnChunkMetaData chunk,
     DictionaryPage dictionaryPage = null;
     long readValues = 0;
     Statistics<?> statistics = null;
+    boolean needOverwriteColumnStatistics = false;

Review Comment:
   ```suggestion
       boolean isColumnStatisticsMalformed = false;
   ```



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -324,15 +324,20 @@ private void processBlocksFromReader(IndexCache 
indexCache) throws IOException {
 
           // Translate compression and/or encryption
           writer.startColumn(descriptor, 
crStore.getColumnReader(descriptor).getTotalValueCount(), newCodecName);
-          processChunk(
+          boolean needOverwriteStatistics = processChunk(
                   chunk,
                   newCodecName,
                   columnChunkEncryptorRunTime,
                   encryptColumn,
                   indexCache.getBloomFilter(chunk),
                   indexCache.getColumnIndex(chunk),
                   indexCache.getOffsetIndex(chunk));
-          writer.endColumn();
+          if (needOverwriteStatistics) {
+            // All the column statistics are invalid, so we need to overwrite 
the column statistics
+            writer.endColumn(chunk.getStatistics());

Review Comment:
   Is it cleaner to handle this inside processChunk()? Then we don't have to 
bother with endColumn()



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -988,6 +988,21 @@ void writeColumnChunk(ColumnDescriptor descriptor,
     endColumn();
   }
 
+  /**
+   * Overwrite the column total statistics. This special used when the column 
total statistics
+   * is known while all the page statistics are invalid, for example when 
rewriting the column.
+   *
+   * @param totalStatistics the column total statistics
+   * @throws IOException if there is an error while writing
+   */
+  public void endColumn(Statistics<?> totalStatistics) throws IOException {

Review Comment:
   Since this is used only for invalid stats, is it better to add a `void 
invalidateStatistics()` which simply include line 988 and 990? Then you just 
need to call invalidateStatistics() and endColumn() in this case.





> Fixes NPE when rewriting column without column index
> ----------------------------------------------------
>
>                 Key: PARQUET-2365
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2365
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Xianyang Liu
>            Priority: Major
>
> The ColumnIndex could be null in some scenes, for example, the float/double 
> column contains NaN or the size has exceeded the expected value. And the page 
> header statistics are not written anymore after we supported ColumnIndex. So 
> we will get NPE when rewriting the column without ColumnIndex due to we need 
> to get NULL page statistics when converted from the ColumnIndex(NULL) or page 
> header statistics(NULL). Such as the following:
> ```java
> java.lang.NullPointerException
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:727)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.innerWriteDataPage(ParquetFileWriter.java:663)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:650)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processChunk(ParquetRewriter.java:453)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:317)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2365) Fixes NPE when rewriting column without column index

Reply via email to