Re: [PR] API: Include size unit in avg/max value size fields [iceberg]

via GitHub Mon, 13 Apr 2026 04:25:33 -0700


nastra commented on code in PR #15939:
URL: https://github.com/apache/iceberg/pull/15939#discussion_r3072653544



##########
api/src/main/java/org/apache/iceberg/stats/FieldStatistic.java:
##########
@@ -125,13 +125,13 @@ public static Types.StructType 
fieldStatsFor(Types.NestedField field, int baseFi
               baseFieldId + AVG_VALUE_SIZE.offset(),
               AVG_VALUE_SIZE.fieldName(),
               Types.IntegerType.get(),
-              "Avg value size of variable-length types (String, Binary)"));
+              "Avg value size in bytes of variable-length types (String, 
Binary)"));
       fields.add(
           optional(
               baseFieldId + MAX_VALUE_SIZE.offset(),
               MAX_VALUE_SIZE.fieldName(),
               Types.IntegerType.get(),
-              "Max value size of variable-length types (String, Binary)"));
+              "Max value size in bytes of variable-length types (String, 
Binary)"));

Review Comment:
   > I believe they are compressed and encoded size for variable length types, 
do we wanna add that too there had been past confusions about this
   
   I've been looking more into this and I think it depends on our future 
calculation whether we use the compressed vs uncompressed sizes from Parquet. I 
think avg / max value sizes should represent the uncompressed / unencoded size 
to have better estimates for CBO in Spark/Trino.
   
   In v1-v3 we have the columnSizes metrics, which indeed uses the compressed 
column size, but for v4 we don't have that metric anymore and we should most 
likely use `getTotalUncompressedSize()` from `ColumnChunkMetadata` for the 
calculations of avg / max value sizes.
   
   @singhpk234 does that make sense to you?
   Also cc @anoopj 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] API: Include size unit in avg/max value size fields [iceberg]

Reply via email to