Re: [PR] Spec: Adds Row Lineage [iceberg]

via GitHub Thu, 19 Sep 2024 13:08:09 -0700


RussellSpitzer commented on code in PR #11130:
URL: https://github.com/apache/iceberg/pull/11130#discussion_r1767533154



##########
format/spec.md:
##########
@@ -298,16 +298,137 @@ Iceberg tables must not use field ids greater than 
2147483447 (`Integer.MAX_VALU
 
 The set of metadata columns is:
 
-| Field id, name              | Type          | Description |
-|-----------------------------|---------------|-------------|
-| **`2147483646  _file`**     | `string`      | Path of the file in which a 
row is stored |
-| **`2147483645  _pos`**      | `long`        | Ordinal position of a row in 
the source data file |
-| **`2147483644  _deleted`**  | `boolean`     | Whether the row has been 
deleted |
-| **`2147483643  _spec_id`**  | `int`         | Spec ID used to track the file 
containing a row |
-| **`2147483642  _partition`** | `struct`     | Partition to which a row 
belongs |
-| **`2147483546  file_path`** | `string`      | Path of a file, used in 
position-based delete files |
-| **`2147483545  pos`**       | `long`        | Ordinal position of a row, 
used in position-based delete files |
-| **`2147483544  row`**       | `struct<...>` | Deleted row values, used in 
position-based delete files |
+| Field id, name                    | Type          | Description              
                                                     |
+|-----------------------------------|---------------|-------------------------------------------------------------------------------|
+| **`2147483646  _file`**           | `string`      | Path of the file in 
which a row is stored                                     |
+| **`2147483645  _pos`**            | `long`        | Ordinal position of a 
row in the source data file, starting at `0`            |
+| **`2147483644  _deleted`**        | `boolean`     | Whether the row has been 
deleted                                              |
+| **`2147483643  _spec_id`**        | `int`         | Spec ID used to track 
the file containing a row                               |
+| **`2147483642  _partition`**      | `struct`      | Partition to which a row 
belongs                                              |
+| **`2147483546  file_path`**       | `string`      | Path of a file, used in 
position-based delete files                           |
+| **`2147483545  pos`**             | `long`        | Ordinal position of a 
row, used in position-based delete files                |
+| **`2147483544  row`**             | `struct<...>` | Deleted row values, used 
in position-based delete files                       |
+| **`2147483545  _row_identifier`** | `long`        | A unique long assigned 
when row-lineage is enabled see [Row Lineage](#row-lineage) |
+| **`2147483545  _last_update`**    | `long`        | The sequence number 
which last updated this row when row-lineage is enabled [Row 
Lineage](#row-lineage)  |
+
+### Row Lineage
+
+In Specification V3, an Iceberg Table can declare that engines must track 
row-lineage of all newly created rows. This
+requirement is controlled by setting the field `row-lineage` to true in the 
table's metadata. When true, two additional 
+fields in data files will be available for all rows added to the table.
+
+* `_row_identifier` a unique long for every row. Computed via inheritance for 
rows in their original datafiles 

Review Comment:
   The problem with having a UUID alone, is that we can't track row origins. We 
would need to use some bits to identify the origin snapshot/sequence id of the 
row as well which would involve us either using two columns or some custom 
representation. The current approach uses sequence number approach and can be 
coupled with the "first_***" columns to determine which snapshot the row was 
added in.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Adds Row Lineage [iceberg]

Reply via email to