emkornfield commented on code in PR #10955: URL: https://github.com/apache/iceberg/pull/10955#discussion_r1779248853
########## format/spec.md: ########## @@ -230,11 +233,31 @@ Schemas may be evolved by type promotion or adding, deleting, renaming, or reord Evolution applies changes to the table's current schema to produce a new schema that is identified by a unique schema ID, is added to the table's list of schemas, and is set as the table's current schema. -Valid type promotions are: +Valid primitive type promotions are: + +| Primitive type | v1, v2 valid type promotions | v3+ valid type promotions | Requirements | +|------------------|------------------------------|------------------------------|--------------| +| `unknown` | | _any type_ | | +| `int` | `long` | `long` | | +| `date` | | `timestamp`, `timestamp_ns` | Promotion to `timestamptz` or `timestamptz_ns` is **not** allowed | +| `float` | `double` | `double` | | +| `decimal(P, S)` | `decimal(P', S)` if `P' > P` | `decimal(P', S)` if `P' > P` | Widen precision only | + +Iceberg's Avro manifest format does not store the type of lower and upper bounds, and type promotion does not rewrite existing bounds. For example, when a `float` is promoted to `double`, existing data file bounds are encoded as 4 little-endian bytes rather than 8 little-endian bytes for `double`. To correctly decode the value, the original type at the time the file was written must be inferred according to the following table: -* `int` to `long` -* `float` to `double` -* `decimal(P, S)` to `decimal(P', S)` if `P' > P` -- widen the precision of decimal types. +| Current type | Length of bounds | Inferred type at write time | +|------------------|------------------|-----------------------------| +| `long` | 4 bytes | `int` | +| `long` | 8 bytes | `long` | +| `double` | 4 bytes | `float` | +| `double` | 8 bytes | `double` | +| `timestamp` | 4 bytes | `date` | +| `timestamp` | 8 bytes | `timestamp` | +| `timestamp_ns` | 4 bytes | `date` | +| `timestamp_ns` | 8 bytes | `timestamp_ns` | +| `decimal(P, S)` | _any_ | `decimal(P', S)`; `P' <= P` | + +Type promotion is not allowed for a field that is referenced by `source-id` or `source-ids` of a partition field if the partition transform would produce a different value after promoting the type. For example, `bucket[N]` produces different hash values for `34` and `"34"` (2017239379 != -427558391) but the same value for `34` and `34L`; when an `int` field is the source for a bucket partition field, it may be promoted to `long` but not to `string`. Review Comment: Have we analyzed the impact on other places type IDs (i.e. should this be expanded). equality deletes (I think this is OK but there might be edge cases for dropped columns), table level statistics since they use theta sketches. Also, it might pay to spell out the list of type promotion + transform pairings that are explicitly disallowed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
