[PR] Core: Fix setting row-lineage from table properties when initially creating an Iceberg table [iceberg]

via GitHub Tue, 18 Feb 2025 02:51:02 -0800


tomtongue opened a new pull request, #12307:
URL: https://github.com/apache/iceberg/pull/12307


   ## Overview
   Fix the `row-lineage` table property reflection on `enableRowLineage`.
   
   ## Issue
   Currently to enable the Row Lineage feature from the Iceberg table 
properties, it's required to run the following operation:
   1. Create an Iceberg table 
   2. Update table properties
   
   At the first step "Create an Iceberg table", even if you set `row-lineage` 
to `true` in the table properties, the property isn't reflected on the Iceberg 
table's metadata.json. Therefore, to enable that feature, you need to 
additionally run table properties update after creating an Iceberg table.
   
   ### Details
   #### Spark case
   When you create an Iceberg table using Spark like the following query, 
   
   ```
   spark.sql("""
   CREATE TABLE db.rowlin (id int, name string, year int) USING iceberg
   TBLPROPERTIES ('format-version'='3', 'row-lineage'='true')
   LOCATION 's3://bucket/iceberg-v3/row-lineage'
   """)
   ```
   
   The relevant metadata.json is stored in the specified bucket and path as 
below:
   
   ```
   aws s3 ls s3://bucket/iceberg-v3/row-lineage/ --recursive
   2025-02-18 16:56:28       1194 
iceberg-v3/row-lineage/metadata/00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json
   ```
   
   At this point, the metadata content (partial) is below. The content doesn't 
have `row-lineage` even if the parameter is in the `properties` part.
   
   ```json
   {
     "format-version" : 3,
     "table-uuid" : "eaf5dec9-7866-49a5-81c6-11af8f344e1f",
     "location" : "s3://bucket/iceberg-v3/row-lineage",
     "last-sequence-number" : 0,
     "last-updated-ms" : 1739865386995,
     "last-column-id" : 3,
     "current-schema-id" : 0,
     "schemas" : [ {
       "type" : "struct",
       "schema-id" : 0,
       "fields" : [ { ... } ]
     } ],
     "default-spec-id" : 0,
     "partition-specs" : [ {
       "spec-id" : 0,
       "fields" : [ ]
     } ],
     "last-partition-id" : 999,
     "default-sort-order-id" : 0,
     "sort-orders" : [ {
       "order-id" : 0,
       "fields" : [ ]
     } ],
     "properties" : {
       "owner" : "hadoop",
       "write.update.mode" : "merge-on-read",
       "write.parquet.compression-codec" : "zstd",
       "row-lineage" : "true"
     },
     "current-snapshot-id" : null,
   ...
   }
   ```
   
   And then, update the table property by the same table property like `ALTER 
TABLE db.rowlin SET TBLPROPERTIES('row-lineage'= 'true')`. 
   
   After the query is complete, the content of the new metadata.json is below. 
`row-lineage` and `next-row-id` is added.
   
   ```json
   {
     "format-version" : 3,
     "table-uuid" : "eaf5dec9-7866-49a5-81c6-11af8f344e1f",
     "location" : "s3://bucket/iceberg-v3/row-lineage",
     "last-sequence-number" : 0,
     "last-updated-ms" : 1739865514775,
     "last-column-id" : 3,
     "current-schema-id" : 0,
     "schemas" : [ {
       "type" : "struct",
       "schema-id" : 0,
       "fields" : [ { ... } ]
     } ],
     "default-spec-id" : 0,
     "partition-specs" : [ {
       "spec-id" : 0,
       "fields" : [ ]
     } ],
     "last-partition-id" : 999,
     "default-sort-order-id" : 0,
     "sort-orders" : [ {
       "order-id" : 0,
       "fields" : [ ]
     } ],
     "properties" : {
       "owner" : "hadoop",
       "write.update.mode" : "merge-on-read",
       "write.parquet.compression-codec" : "zstd",
       "row-lineage" : "true"
     },
     "current-snapshot-id" : null,
     "row-lineage" : true,  // <= ADDED
     "next-row-id" : 0, // <= ADDED
     "refs" : { },
     "snapshots" : [ ],
     "statistics" : [ ],
     "partition-statistics" : [ ],
     "snapshot-log" : [ ],
     "metadata-log" : [ {
       "timestamp-ms" : 1739865386995,
       "metadata-file" : 
"s3://bucket/iceberg-v3/row-lineage/metadata/00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json"
     } ]
   }
   ```
   
   Here's the diff between two metadata files:
   
   ```diff
   $ diff 00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json 
00001-ebf641c8-9603-45d5-92c6-dafac315375e.metadata.json
   6c6
   <   "last-updated-ms" : 1739865386995,
   ---
   >   "last-updated-ms" : 1739865514775,
   46a47,48
   >   "row-lineage" : true,
   >   "next-row-id" : 0,
   52c54,57
   <   "metadata-log" : [ ]
   ---
   >   "metadata-log" : [ {
   >     "timestamp-ms" : 1739865386995,
   >     "metadata-file" : 
"s3://gsweep/iceberg-v3/row-lineage-mor13/metadata/00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json"
   >   } ]
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Core: Fix setting row-lineage from table properties when initially creating an Iceberg table [iceberg]

Reply via email to