singhpk234 commented on code in PR #15939:
URL: https://github.com/apache/iceberg/pull/15939#discussion_r3074141487
##########
api/src/main/java/org/apache/iceberg/stats/FieldStatistic.java:
##########
@@ -125,13 +125,13 @@ public static Types.StructType
fieldStatsFor(Types.NestedField field, int baseFi
baseFieldId + AVG_VALUE_SIZE.offset(),
AVG_VALUE_SIZE.fieldName(),
Types.IntegerType.get(),
- "Avg value size of variable-length types (String, Binary)"));
+ "Avg value size in bytes of variable-length types (String,
Binary)"));
fields.add(
optional(
baseFieldId + MAX_VALUE_SIZE.offset(),
MAX_VALUE_SIZE.fieldName(),
Types.IntegerType.get(),
- "Max value size of variable-length types (String, Binary)"));
+ "Max value size in bytes of variable-length types (String,
Binary)"));
Review Comment:
Thanks for taking a look into @nastra, I believe it will be a bit tricky if
based on table versions we store different sizes (compressed encoded for v1 -
v3 and then v4 uncompressed and un encoded), think of tables migrated from v3
to v4, i think there have been proposals to just have a dedicated metric for
this as well like in the doc https://github.com/apache/iceberg/issues/10703,
which would be a bit easier to reason about specially for migrated table
For CBO integration if you see engines such as Trino are already having a
constant multiplier added to compensate for this (based on TPC-DS) which is
just a type check on the column Handle
https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L237-L250
Spark CBO by default is off, but i think where injection point is considered
we have estimateStats in the connnector which would also be suff easy to
reason.
I believe from integ pov to have a dedicated metric would be very nice,
specially for migrated tables v3 -> v4
Please let me know wdyt ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]