nastra commented on code in PR #15939:
URL: https://github.com/apache/iceberg/pull/15939#discussion_r3072653544
##########
api/src/main/java/org/apache/iceberg/stats/FieldStatistic.java:
##########
@@ -125,13 +125,13 @@ public static Types.StructType
fieldStatsFor(Types.NestedField field, int baseFi
baseFieldId + AVG_VALUE_SIZE.offset(),
AVG_VALUE_SIZE.fieldName(),
Types.IntegerType.get(),
- "Avg value size of variable-length types (String, Binary)"));
+ "Avg value size in bytes of variable-length types (String,
Binary)"));
fields.add(
optional(
baseFieldId + MAX_VALUE_SIZE.offset(),
MAX_VALUE_SIZE.fieldName(),
Types.IntegerType.get(),
- "Max value size of variable-length types (String, Binary)"));
+ "Max value size in bytes of variable-length types (String,
Binary)"));
Review Comment:
> I believe they are compressed and encoded size for variable length types,
do we wanna add that too there had been past confusions about this
I've been looking more into this and I think it depends on our future
calculation whether we use the compressed vs uncompressed sizes from Parquet. I
think avg / max value sizes should represent the uncompressed / unencoded size
to have better estimates for CBO in Spark/Trino.
In v1-v3 we have the columnSizes metrics, which indeed uses the compressed
column size, but for v4 we don't have that metric anymore and we should most
likely use `getTotalUncompressedSize()` from `ColumnChunkMetadata` for the
calculations of avg / max value sizes.
@singhpk234 does that make sense to you?
Also cc @anoopj
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]