zhongyujiang commented on code in PR #6517:
URL: https://github.com/apache/iceberg/pull/6517#discussion_r1061474054


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java:
##########
@@ -580,6 +608,10 @@ static boolean hasNonNullButNoMinMax(Statistics 
statistics, long valueCount) {
         && (statistics.getMaxBytes() == null || statistics.getMinBytes() == 
null);
   }
 
+  static boolean minMaxUndefined(Statistics statistics) {
+    return !statistics.isEmpty() && !statistics.hasNonNullValue();

Review Comment:
   I am thinking we can simplify the handling of unreliable statistics by 
moving the judgment of all-null situation forward, as I mentioned 
[here](https://github.com/apache/iceberg/issues/6516#issuecomment-1370862183),we
 can use `Statistics#getNumNulls() = ColumnChunkMetadata#getValueCount()` to 
determine if all values are null (which will not be affected by null or 
undinfined min, max statistics); Then we handle the case of 
`statistics.hasNonNullValue()` is false(`statistics.hasNonNullValue()` will 
return false when statistics has null min, max or unreliable min, max), we 
should always return `ROWS_MIGHT_MATCH` when `statistics.hasNonNullValue()` is 
false bacause that means statistics is not reliable; Finally we can safely use 
min, max for comparative evaluation.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to