tomtongue opened a new pull request, #12307:
URL: https://github.com/apache/iceberg/pull/12307
## Overview
Fix the `row-lineage` table property reflection on `enableRowLineage`.
## Issue
Currently to enable the Row Lineage feature from the Iceberg table
properties, it's required to run the following operation:
1. Create an Iceberg table
2. Update table properties
At the first step "Create an Iceberg table", even if you set `row-lineage`
to `true` in the table properties, the property isn't reflected on the Iceberg
table's metadata.json. Therefore, to enable that feature, you need to
additionally run table properties update after creating an Iceberg table.
### Details
#### Spark case
When you create an Iceberg table using Spark like the following query,
```
spark.sql("""
CREATE TABLE db.rowlin (id int, name string, year int) USING iceberg
TBLPROPERTIES ('format-version'='3', 'row-lineage'='true')
LOCATION 's3://bucket/iceberg-v3/row-lineage'
""")
```
The relevant metadata.json is stored in the specified bucket and path as
below:
```
aws s3 ls s3://bucket/iceberg-v3/row-lineage/ --recursive
2025-02-18 16:56:28 1194
iceberg-v3/row-lineage/metadata/00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json
```
At this point, the metadata content (partial) is below. The content doesn't
have `row-lineage` even if the parameter is in the `properties` part.
```json
{
"format-version" : 3,
"table-uuid" : "eaf5dec9-7866-49a5-81c6-11af8f344e1f",
"location" : "s3://bucket/iceberg-v3/row-lineage",
"last-sequence-number" : 0,
"last-updated-ms" : 1739865386995,
"last-column-id" : 3,
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"fields" : [ { ... } ]
} ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ ]
} ],
"last-partition-id" : 999,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"owner" : "hadoop",
"write.update.mode" : "merge-on-read",
"write.parquet.compression-codec" : "zstd",
"row-lineage" : "true"
},
"current-snapshot-id" : null,
...
}
```
And then, update the table property by the same table property like `ALTER
TABLE db.rowlin SET TBLPROPERTIES('row-lineage'= 'true')`.
After the query is complete, the content of the new metadata.json is below.
`row-lineage` and `next-row-id` is added.
```json
{
"format-version" : 3,
"table-uuid" : "eaf5dec9-7866-49a5-81c6-11af8f344e1f",
"location" : "s3://bucket/iceberg-v3/row-lineage",
"last-sequence-number" : 0,
"last-updated-ms" : 1739865514775,
"last-column-id" : 3,
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"fields" : [ { ... } ]
} ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ ]
} ],
"last-partition-id" : 999,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"owner" : "hadoop",
"write.update.mode" : "merge-on-read",
"write.parquet.compression-codec" : "zstd",
"row-lineage" : "true"
},
"current-snapshot-id" : null,
"row-lineage" : true, // <= ADDED
"next-row-id" : 0, // <= ADDED
"refs" : { },
"snapshots" : [ ],
"statistics" : [ ],
"partition-statistics" : [ ],
"snapshot-log" : [ ],
"metadata-log" : [ {
"timestamp-ms" : 1739865386995,
"metadata-file" :
"s3://bucket/iceberg-v3/row-lineage/metadata/00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json"
} ]
}
```
Here's the diff between two metadata files:
```diff
$ diff 00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json
00001-ebf641c8-9603-45d5-92c6-dafac315375e.metadata.json
6c6
< "last-updated-ms" : 1739865386995,
---
> "last-updated-ms" : 1739865514775,
46a47,48
> "row-lineage" : true,
> "next-row-id" : 0,
52c54,57
< "metadata-log" : [ ]
---
> "metadata-log" : [ {
> "timestamp-ms" : 1739865386995,
> "metadata-file" :
"s3://gsweep/iceberg-v3/row-lineage-mor13/metadata/00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json"
> } ]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]