Re: [PR] Spec: Clarify multi-arg transform behavior for different versions [iceberg]

via GitHub Fri, 09 Feb 2024 13:49:01 -0800


aokolnychyi commented on code in PR #9661:
URL: https://github.com/apache/iceberg/pull/9661#discussion_r1484814023



##########
format/spec.md:
##########
@@ -301,12 +301,14 @@ Tables are configured with a **partition spec** that 
defines how to produce a tu
 *   A **transform** that is applied to the source column(s) to produce a 
partition value
 *   A **partition name**
 
-The source column, selected by id, must be a primitive type and cannot be 
contained in a map or list, but may be nested in a struct. For details on how 
to serialize a partition spec to JSON, see Appendix C.
+The source column(s), selected by id(s), must be a primitive type and cannot 
be contained in a map or list, but may be nested in a struct. The ability to 
have multiple source columns is added in V3, with a flag to allow tables in V2 
to use this feature.  For serialization and backward compatibility details, see 
Appendix C.

Review Comment:
   Extra space before `For serialization`?



##########
format/spec.md:
##########
@@ -1130,14 +1144,11 @@ Each partition field in the fields list is stored as an 
object. See the table fo
 |**`hour`**|`JSON string: "hour"`|`"hour"`|
 |**`Partition Field`** [1,2]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": 
<id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br 
/>&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": 
<transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br 
/>&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br 
/>&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|
 
-In some cases partition specs are stored using only the field list instead of 
the object format that includes the spec ID, like the deprecated 
`partition-spec` field in table metadata. The object format should be used 
unless otherwise noted in this spec.
-
-The `field-id` property was added for each partition field in v2. In v1, the 
reference implementation assigned field ids sequentially in each spec starting 
at 1,000. See Partition Evolution for more details.
-
 Notes:
-
-1. For partition fields with a transform with a single argument, the ID of the 
source field is set on `source-id`, and `source-ids` is omitted.
-2. For partition fields with a transform of multiple arguments, the IDs of the 
source fields are set on `source-ids`. To preserve backward compatibility, 
`source-id` is set to -1.
+1. In some cases partition specs are stored using only the field list instead 
of the object format that includes the spec ID, like the deprecated 
`partition-spec` field in table metadata. The object format should be used 
unless otherwise noted in this spec.
+2. The `field-id` property was added for each partition field in v2. In v1, 
the reference implementation assigned field ids sequentially in each spec 
starting at 1,000. See Partition Evolution for more details.
+3. For partition fields with a transform with a single argument, the ID of the 
source field is set on `source-id`, and `source-ids` is omitted.
+2. For partition fields with a transform with multiple arguments, the IDs of 
the source fields are set on `source-ids`, and `source-id` is set to -1.  This 
is only allowed in tables of version >= V3, or in tables of version >= V2 where 
compatibility.multi-arg-transform.enabled is true.  In the latter case, no 
guarantees are made that all implementations will successfully read/write this 
table metadata.

Review Comment:
   Should this be `4` instead of `2`?



##########
format/spec.md:
##########
@@ -388,6 +390,8 @@ A sort order is defined by a sort order id and a list of 
sort fields. The order
 *   A **sort direction**, that can only be either `asc` or `desc`
 *   A **null order** that describes the order of null values when sorted. Can 
only be either `nulls-first` or `nulls-last`
 
+The ability to have multiple source columns is added in V3, with a flag to 
allow tables in V2 to use this feature.  For serialization and backward 
compatibility details, see Appendix C.

Review Comment:
   Extra space before `For serialization`?



##########
format/spec.md:
##########
@@ -301,12 +301,14 @@ Tables are configured with a **partition spec** that 
defines how to produce a tu
 *   A **transform** that is applied to the source column(s) to produce a 
partition value
 *   A **partition name**
 
-The source column, selected by id, must be a primitive type and cannot be 
contained in a map or list, but may be nested in a struct. For details on how 
to serialize a partition spec to JSON, see Appendix C.
+The source column(s), selected by id(s), must be a primitive type and cannot 
be contained in a map or list, but may be nested in a struct. The ability to 
have multiple source columns is added in V3, with a flag to allow tables in V2 
to use this feature.  For serialization and backward compatibility details, see 
Appendix C.
 
 Partition specs capture the transform from table data to partition values. 
This is used to transform predicates to partition predicates, in addition to 
transforming data values. Deriving partition predicates from column predicates 
on the table data is used to separate the logical queries from physical 
storage: the partitioning can change and the correct partition filters are 
always derived from column predicates. This simplifies queries because users 
don’t have to supply both logical predicates and partition predicates. For more 
information, see Scan Planning below.
 
 Two partition specs are considered equivalent with each other if they have the 
same number of fields and for each corresponding field, the fields have the 
same source column ID, transform definition and partition name. Writers must 
not create a new parition spec if there already exists a compatible partition 
spec defined in the table.
 
+

Review Comment:
   Is this intentional?



##########
format/spec.md:
##########
@@ -1150,13 +1161,17 @@ Sort orders are serialized as a list of JSON object, 
each of which contains the
 
 Each sort field in the fields list is stored as an object with the following 
properties:
 
-|Field|JSON representation|Example|
-|--- |--- |--- |
-|**`Sort Field`** [1,2]|`JSON object: {`<br />&nbsp;&nbsp;`"transform": 
<transform JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br 
/>&nbsp;&nbsp;`"direction": <direction string>,`<br 
/>&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}`|`{`<br 
/>&nbsp;&nbsp;`  "transform": "bucket[4]",`<br />&nbsp;&nbsp;`  "source-id": 
3,`<br />&nbsp;&nbsp;`  "direction": "desc",`<br />&nbsp;&nbsp;`  "null-order": 
"nulls-last"`<br />`}`|
+| V1       | V2       | V3       | Field            | JSON representation | 
Example     |
+|----------|----------|----------|------------------|---------------------|-------------|
+| required | required | required | **`transform`**  | `JSON string`       | 
`bucket[4]` |
+| required | required | required | **`source-id`**  | `JSON int`          | 1  
         |
+|          |          | optional | **`source-ids`** | `JSON list`         | 
`[1,2]`     |
+| required | required | required | **`direction`**  | `JSON string`       | 
`asc`       |
+| required | required | required | **`null-order`** | `JSON string`       | 
`nulls-last`|
 
 Notes:
 1. For sort fields with a transform with a single argument, the ID of the 
source field is set on `source-id`, and `source-ids` is omitted.
-2. For sort fields with a transform of multiple arguments, the IDs of the 
source fields are set on `source-ids`. To preserve backward compatibility, 
`source-id` is set to -1.
+2. For sort fields with a transform with multiple arguments, the IDs of the 
source fields are set on `source-ids`, and `source-id` is set to -1.  This is 
only allowed in tables of version >= V3, or in tables of version >= V2 where 
compatibility.multi-arg-transform.enabled is true.  In the latter case, no 
guarantees are made that all implementations will successfully read/write this 
table metadata.

Review Comment:
   Do we separate new sentences with double space intentionally?



##########
format/spec.md:
##########
@@ -1117,7 +1117,17 @@ Partition specs are serialized as a JSON object with the 
following fields:
 |**`spec-id`**|`JSON int`|`0`|
 |**`fields`**|`JSON list: [`<br />&nbsp;&nbsp;`<partition field JSON>,`<br 
/>&nbsp;&nbsp;`...`<br />`]`|`[ {`<br />&nbsp;&nbsp;`"source-id": 4,`<br 
/>&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "ts_day",`<br 
/>&nbsp;&nbsp;`"transform": "day"`<br />`}, {`<br />&nbsp;&nbsp;`"source-id": 
1,`<br />&nbsp;&nbsp;`"field-id": 1001,`<br />&nbsp;&nbsp;`"name": 
"id_bucket",`<br />&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`} ]`|
 
-Each partition field in the fields list is stored as an object. See the table 
for more detail:
+Each partition field in the `fields` is stored as a JSON object with the 
following properties.
+
+| V1       | V2       | V3       | Field            | JSON representation | 
Example      |

Review Comment:
   There is a section about V1 and V2 versions at the beginning. What if we 
extend it and say that the V3 spec hasn't been adopted yet and under active 
development?
   
   ```
   Versions 1 and 2 of the Iceberg spec are complete and adopted by the 
community. Version 3 is under active development and has not been formally 
adopted.
   ```



##########
format/spec.md:
##########
@@ -1117,7 +1121,17 @@ Partition specs are serialized as a JSON object with the 
following fields:
 |**`spec-id`**|`JSON int`|`0`|
 |**`fields`**|`JSON list: [`<br />&nbsp;&nbsp;`<partition field JSON>,`<br 
/>&nbsp;&nbsp;`...`<br />`]`|`[ {`<br />&nbsp;&nbsp;`"source-id": 4,`<br 
/>&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "ts_day",`<br 
/>&nbsp;&nbsp;`"transform": "day"`<br />`}, {`<br />&nbsp;&nbsp;`"source-id": 
1,`<br />&nbsp;&nbsp;`"field-id": 1001,`<br />&nbsp;&nbsp;`"name": 
"id_bucket",`<br />&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`} ]`|
 
-Each partition field in the fields list is stored as an object. See the table 
for more detail:
+Each partition field in the `fields` is stored as a JSON object with the 
following properties.
+
+| V1       | V2       | V3       | Field            | JSON representation | 
Example      |
+|----------|----------|----------|------------------|---------------------|--------------|
+| required | required | required | **`source-id`**  | `JSON int`          | 1  
          |

Review Comment:
   I wonder whether we actually want to make `source-id` required in V3. The 
spec says:
   
   ```
   The format version number is incremented when new features are added that 
will break forward-compatibility---that is, when older readers would not read 
newer table features correctly.
   ```
   
   I understand that if we have add a compatibility flag to V2 to write this, 
we will have to populate `source-id` with -1. Will it be enough to simply not 
write `source-id` in V3 and make `source-ids` required? We can have some 
special instructions for how to read older metadata in V3.



##########
format/spec.md:
##########
@@ -1130,14 +1144,11 @@ Each partition field in the fields list is stored as an 
object. See the table fo
 |**`hour`**|`JSON string: "hour"`|`"hour"`|
 |**`Partition Field`** [1,2]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": 
<id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br 
/>&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": 
<transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br 
/>&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br 
/>&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|
 
-In some cases partition specs are stored using only the field list instead of 
the object format that includes the spec ID, like the deprecated 
`partition-spec` field in table metadata. The object format should be used 
unless otherwise noted in this spec.
-
-The `field-id` property was added for each partition field in v2. In v1, the 
reference implementation assigned field ids sequentially in each spec starting 
at 1,000. See Partition Evolution for more details.
-
 Notes:
-
-1. For partition fields with a transform with a single argument, the ID of the 
source field is set on `source-id`, and `source-ids` is omitted.
-2. For partition fields with a transform of multiple arguments, the IDs of the 
source fields are set on `source-ids`. To preserve backward compatibility, 
`source-id` is set to -1.
+1. In some cases partition specs are stored using only the field list instead 
of the object format that includes the spec ID, like the deprecated 
`partition-spec` field in table metadata. The object format should be used 
unless otherwise noted in this spec.
+2. The `field-id` property was added for each partition field in v2. In v1, 
the reference implementation assigned field ids sequentially in each spec 
starting at 1,000. See Partition Evolution for more details.
+3. For partition fields with a transform with a single argument, the ID of the 
source field is set on `source-id`, and `source-ids` is omitted.
+2. For partition fields with a transform with multiple arguments, the IDs of 
the source fields are set on `source-ids`, and `source-id` is set to -1.  This 
is only allowed in tables of version >= V3, or in tables of version >= V2 where 
compatibility.multi-arg-transform.enabled is true.  In the latter case, no 
guarantees are made that all implementations will successfully read/write this 
table metadata.

Review Comment:
   I don't think it is a good idea to mention the property name or even say 
there is a way to enable this in V2 tables. We may decide to do that in the 
Java library, similar to snapshot ID inheritance, but I am not sure the spec 
should enforce this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Clarify multi-arg transform behavior for different versions [iceberg]

Reply via email to