[jira] [Commented] (PARQUET-2365) Fixes NPE when rewriting column without column index

ASF GitHub Bot (Jira) Wed, 18 Oct 2023 20:33:05 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776990#comment-17776990
 ]


ASF GitHub Bot commented on PARQUET-2365:
-----------------------------------------

wgtmac commented on code in PR #1173:
URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1364833487


##########
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##########
@@ -543,6 +546,11 @@ public static ColumnIndex build(
    *          the statistics to be added
    */
   public void add(Statistics<?> stats) {
+    if (stats.isEmpty()) {

Review Comment:
   Let me try to understand what happens here. `convertStatistics` is used to 
recover page statistics from ColumnIndex or original page header if the 
ColumnIndex is unavailable. The problem emerges when ColumnIndex is 
unavailable. Am I correct? If true, then why do we need those changes in the 
ColumnIndexBuilder?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -335,10 +340,15 @@ private void processBlocksFromReader() throws IOException 
{
     }
   }
 
-  private void processChunk(ColumnChunkMetaData chunk,
-                            CompressionCodecName newCodecName,
-                            ColumnChunkEncryptorRunTime 
columnChunkEncryptorRunTime,
-                            boolean encryptColumn) throws IOException {
+  /**
+   * Rewrite a single column with the given new compression codec or new 
encryptor

Review Comment:
   ```suggestion
      * Rewrite a single column with the given new compression codec and/or new 
encryptor
   ```



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -612,13 +612,13 @@ public void writeDataPage(
    * @throws IOException if any I/O error occurs during writing the file
    */
   public void writeDataPage(
-    int valueCount, int uncompressedPageSize,
-    BytesInput bytes,
-    Statistics<?> statistics,
-    long rowCount,
-    Encoding rlEncoding,
-    Encoding dlEncoding,
-    Encoding valuesEncoding) throws IOException {
+      int valueCount, int uncompressedPageSize,
+      BytesInput bytes,
+      Statistics<?> statistics,
+      long rowCount,
+      Encoding rlEncoding,
+      Encoding dlEncoding,
+      Encoding valuesEncoding) throws IOException {

Review Comment:
   Could you avoid these style changes? They are unrelated.





> Fixes NPE when rewriting column without column index
> ----------------------------------------------------
>
>                 Key: PARQUET-2365
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2365
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Xianyang Liu
>            Priority: Major
>
> The ColumnIndex could be null in some scenes, for example, the float/double 
> column contains NaN or the size has exceeded the expected value. And the page 
> header statistics are not written anymore after we supported ColumnIndex. So 
> we will get NPE when rewriting the column without ColumnIndex due to we need 
> to get NULL page statistics when converted from the ColumnIndex(NULL) or page 
> header statistics(NULL). Such as the following:
> ```java
> java.lang.NullPointerException
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:727)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.innerWriteDataPage(ParquetFileWriter.java:663)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:650)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processChunk(ParquetRewriter.java:453)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:317)
>       at 
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2365) Fixes NPE when rewriting column without column index

Reply via email to