[
https://issues.apache.org/jira/browse/PARQUET-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782499#comment-17782499
]
ASF GitHub Bot commented on PARQUET-2365:
-----------------------------------------
ConeyLiu commented on code in PR #1173:
URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1381424815
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -324,15 +324,20 @@ private void processBlocksFromReader(IndexCache
indexCache) throws IOException {
// Translate compression and/or encryption
writer.startColumn(descriptor,
crStore.getColumnReader(descriptor).getTotalValueCount(), newCodecName);
- processChunk(
+ boolean needOverwriteStatistics = processChunk(
chunk,
newCodecName,
columnChunkEncryptorRunTime,
encryptColumn,
indexCache.getBloomFilter(chunk),
indexCache.getColumnIndex(chunk),
indexCache.getOffsetIndex(chunk));
- writer.endColumn();
+ if (needOverwriteStatistics) {
+ // All the column statistics are invalid, so we need to overwrite
the column statistics
+ writer.endColumn(chunk.getStatistics());
Review Comment:
I think that would be better. And just need to call the API: `public void
invalidateStatistics(Statistics<?> totalStatistics)` in the `processChunk`.
> Fixes NPE when rewriting column without column index
> ----------------------------------------------------
>
> Key: PARQUET-2365
> URL: https://issues.apache.org/jira/browse/PARQUET-2365
> Project: Parquet
> Issue Type: Bug
> Reporter: Xianyang Liu
> Priority: Major
>
> The ColumnIndex could be null in some scenes, for example, the float/double
> column contains NaN or the size has exceeded the expected value. And the page
> header statistics are not written anymore after we supported ColumnIndex. So
> we will get NPE when rewriting the column without ColumnIndex due to we need
> to get NULL page statistics when converted from the ColumnIndex(NULL) or page
> header statistics(NULL). Such as the following:
> ```java
> java.lang.NullPointerException
> at
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:727)
> at
> org.apache.parquet.hadoop.ParquetFileWriter.innerWriteDataPage(ParquetFileWriter.java:663)
> at
> org.apache.parquet.hadoop.ParquetFileWriter.writeDataPage(ParquetFileWriter.java:650)
> at
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processChunk(ParquetRewriter.java:453)
> at
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocksFromReader(ParquetRewriter.java:317)
> at
> org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:250)
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)