sfc-gh-bhannel commented on code in PR #12644:
URL: https://github.com/apache/iceberg/pull/12644#discussion_r2014937438
##########
format/spec.md:
##########
@@ -1414,12 +1414,16 @@ Each partition field in `fields` is stored as a JSON
object with the following p
| V1 | V2 | V3 | Field | JSON representation |
Example |
|----------|----------|----------|------------------|---------------------|--------------|
-| required | required | omitted | **`source-id`** | `JSON int` | 1
|
-| | | required | **`source-ids`** | `JSON list of ints` |
`[1,2]` |
+| required | required | required¹ | **`source-id`** | `JSON int` | 1
|
+| | | required¹ | **`source-ids`** | `JSON list of ints` |
`[1,2]` |
| | required | required | **`field-id`** | `JSON int` |
1000 |
| required | required | required | **`name`** | `JSON string` |
`id_bucket` |
| required | required | required | **`transform`** | `JSON string` |
`bucket[16]` |
+Notes:
+
+1. For partition fields with a transform with a single argument, the ID of the
source field is set on `source-id`, and `source-ids` is omitted.
Review Comment:
Would it be more consistent to always use source-ids for v3 writers? Or do
you want to use source-id to allow v3 writers to be compatible with v2 readers
if you don't use v3 features?
##########
format/spec.md:
##########
@@ -1453,13 +1457,15 @@ Each sort field in the fields list is stored as an
object with the following pro
| V1 | V2 | V3 | Field | JSON representation |
Example |
|----------|----------|----------|------------------|---------------------|-------------|
-| required | required | required | **`transform`** | `JSON string` |
`bucket[4]` |
-| required | required | omitted | **`source-id`** | `JSON int` | 1
|
+| required | required | required¹ | **`transform`** | `JSON string` |
`bucket[4]` |
+| required | required | required¹ | **`source-id`** | `JSON int` | 1
|
| | | required | **`source-ids`** | `JSON list of ints` |
`[1,2]` |
| required | required | required | **`direction`** | `JSON string` |
`asc` |
| required | required | required | **`null-order`** | `JSON string` |
`nulls-last`|
-In v3 metadata, writers must use only `source-ids` because v3 requires reader
support for multi-arg transforms.
+Notes:
+
+1. For sort fields with a transform with a single argument, the ID of the
source field is set on `source-id`, and `source-ids` is omitted.
Review Comment:
Did you mean to make the `transform` row refer to this footnote? I would
expect it to be the `source-id` and `source-ids` rows.
##########
format/spec.md:
##########
@@ -540,7 +540,7 @@ Notes:
2. The width, `W`, used to truncate decimal values is applied using the scale
of the decimal column to avoid additional (and potentially conflicting)
parameters.
3. Strings are truncated to a valid UTF-8 string with no more than `L` code
points.
4. In contrast to strings, binary values do not have an assumed encoding and
are truncated to `L` bytes.
-
+5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕
hash(col2) ⊕ ... ⊕ hash(colN)) % W`.
Review Comment:
I think it would be good to be explicit about how to handle nulls and
dropped columns. Should it be calculated as if `hash(null) == 0`?
It also might be worth adding parentheses for those who don't recall the
precedence order between XOR and modulo.
Also - good to meet you, and thanks for working on this! I'm a developer at
Snowflake working on Iceberg functionality, and we really appreciate how
detailed the spec is!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]