wgtmac commented on code in PR #1173:
URL: https://github.com/apache/parquet-mr/pull/1173#discussion_r1371930026
##########
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##########
@@ -543,6 +546,11 @@ public static ColumnIndex build(
* the statistics to be added
*/
public void add(Statistics<?> stats) {
+ if (stats.isEmpty()) {
Review Comment:
This is used by the rewriter to rebuild the ColumnIndex. The change enables
it to invalidate the ColumnIndex if any statistics is invalid, right? This is
also used by the general file writer. Does it affect the behavior? If yes,
probably we need add a test case.
##########
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java:
##########
@@ -543,6 +546,11 @@ public static ColumnIndex build(
* the statistics to be added
*/
public void add(Statistics<?> stats) {
+ if (stats.isEmpty()) {
Review Comment:
From what you have said, it seems that the problem comes from the input file
which has valid aggregate statistics for the column chunk but does not write
page statistics in the page header. Should we just fix the NPE in the page
header and leave other parts as is?
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -988,6 +988,21 @@ void writeColumnChunk(ColumnDescriptor descriptor,
endColumn();
}
+ /**
+ * Overwrite the column total statistics. This special used when the column
total statistics
+ * is known while all the page statistics are invalid, for example when
rewriting the column.
+ *
+ * @param totalStatistics the column total statistics
+ * @throws IOException if there is an error while writing
+ */
+ public void endColumn(Statistics<?> totalStatistics) throws IOException {
+ Preconditions.checkArgument(totalStatistics != null, "Column total
statistics can not be null");
+ currentStatistics = totalStatistics;
+ // Invalid the ColumnIndex
+ columnIndexBuilder = ColumnIndexBuilder.getNoOpBuilder();
Review Comment:
This looks hacky, TBH.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]