wgtmac commented on code in PR #1173:
URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1381380912
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -722,14 +722,7 @@ public void writeDataPage(
LOG.debug("{}: write data page content {}", out.getPos(),
compressedPageSize);
bytes.writeAllTo(out);
- // Copying the statistics if it is not initialized yet so we have the
correct typed one
- if (currentStatistics == null) {
- currentStatistics = statistics.copy();
- } else {
- currentStatistics.mergeStatistics(statistics);
- }
-
- columnIndexBuilder.add(statistics);
+ mergeColumnStatistics(statistics);
Review Comment:
Do we need to fix here and below:
https://github.com/apache/parquet-mr/blob/514cc6c257fe8e618b100a19d86d304f6442cb94/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L215
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -392,6 +402,7 @@ private void processChunk(ColumnChunkMetaData chunk,
DictionaryPage dictionaryPage = null;
long readValues = 0;
Statistics<?> statistics = null;
+ boolean needOverwriteColumnStatistics = false;
Review Comment:
```suggestion
boolean isColumnStatisticsMalformed = false;
```
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -324,15 +324,20 @@ private void processBlocksFromReader(IndexCache
indexCache) throws IOException {
// Translate compression and/or encryption
writer.startColumn(descriptor,
crStore.getColumnReader(descriptor).getTotalValueCount(), newCodecName);
- processChunk(
+ boolean needOverwriteStatistics = processChunk(
chunk,
newCodecName,
columnChunkEncryptorRunTime,
encryptColumn,
indexCache.getBloomFilter(chunk),
indexCache.getColumnIndex(chunk),
indexCache.getOffsetIndex(chunk));
- writer.endColumn();
+ if (needOverwriteStatistics) {
+ // All the column statistics are invalid, so we need to overwrite
the column statistics
+ writer.endColumn(chunk.getStatistics());
Review Comment:
Is it cleaner to handle this inside processChunk()? Then we don't have to
bother with endColumn()
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -988,6 +988,21 @@ void writeColumnChunk(ColumnDescriptor descriptor,
endColumn();
}
+ /**
+ * Overwrite the column total statistics. This special used when the column
total statistics
+ * is known while all the page statistics are invalid, for example when
rewriting the column.
+ *
+ * @param totalStatistics the column total statistics
+ * @throws IOException if there is an error while writing
+ */
+ public void endColumn(Statistics<?> totalStatistics) throws IOException {
Review Comment:
Since this is used only for invalid stats, is it better to add a `void
invalidateStatistics()` which simply include line 988 and 990? Then you just
need to call invalidateStatistics() and endColumn() in this case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]