Re: [PR] Spark 4.0: Row Lineage support [iceberg]

via GitHub Thu, 10 Jul 2025 08:04:16 -0700


amogh-jahagirdar commented on code in PR #13310:
URL: https://github.com/apache/iceberg/pull/13310#discussion_r2197992704



##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -260,18 +260,52 @@ public MetadataColumn[] metadataColumns() {
     DataType sparkPartitionType = 
SparkSchemaUtil.convert(Partitioning.partitionType(table()));
     ImmutableList.Builder<SparkMetadataColumn> metadataColumns = 
ImmutableList.builder();
     metadataColumns.add(
-        new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), 
DataTypes.IntegerType, true),
-        new SparkMetadataColumn(MetadataColumns.PARTITION_COLUMN_NAME, 
sparkPartitionType, true),
-        new SparkMetadataColumn(MetadataColumns.FILE_PATH.name(), 
DataTypes.StringType, false),
-        new SparkMetadataColumn(MetadataColumns.ROW_POSITION.name(), 
DataTypes.LongType, false),
-        new SparkMetadataColumn(MetadataColumns.IS_DELETED.name(), 
DataTypes.BooleanType, false));
-
-    if (TableUtil.formatVersion(table()) >= 3) {
+        SparkMetadataColumn.builder()
+            .name(MetadataColumns.SPEC_ID.name())
+            .dataType(DataTypes.IntegerType)
+            .withNullability(true)
+            .build(),
+        SparkMetadataColumn.builder()
+            .name(MetadataColumns.PARTITION_COLUMN_NAME)
+            .dataType(sparkPartitionType)
+            .withNullability(true)
+            .build(),
+        SparkMetadataColumn.builder()
+            .name(MetadataColumns.FILE_PATH.name())
+            .dataType(DataTypes.StringType)
+            .withNullability(false)
+            .build(),
+        SparkMetadataColumn.builder()
+            .name(MetadataColumns.ROW_POSITION.name())
+            .dataType(DataTypes.LongType)
+            .withNullability(false)
+            .build(),
+        SparkMetadataColumn.builder()
+            .name(MetadataColumns.IS_DELETED.name())
+            .dataType(DataTypes.BooleanType)
+            .withNullability(false)
+            .build());
+
+    if (TableUtil.supportsRowLineage(icebergTable)) {
       metadataColumns.add(
-          new SparkMetadataColumn(MetadataColumns.ROW_ID.name(), 
DataTypes.LongType, true));
+          SparkMetadataColumn.builder()
+              .name(MetadataColumns.ROW_ID.name())
+              .dataType(DataTypes.LongType)
+              .withNullability(true)
+              .preserveOnReinsert(true)
+              .preserveOnUpdate(true)
+              .preserveOnDelete(false)
+              .build());
+
       metadataColumns.add(
-          new SparkMetadataColumn(
-              MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.name(), 
DataTypes.LongType, true));
+          SparkMetadataColumn.builder()
+              .name(MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.name())
+              .dataType(DataTypes.LongType)
+              .withNullability(true)
+              .preserveOnReinsert(false)

Review Comment:
   Yes for compaction or any kind of rewrite we do need to preserve both row 
ids/seq numbers however that is a different code path that we'd need to handle 
separately. These options for preserving metadata column values on 
reinsert/update/delete were new options added in DSV2 specifically for DML 
operations like merge/update/delete.
   
   For compaction we'll need to project the row lineage fields when reading the 
files 
https://github.com/amogh-jahagirdar/iceberg/blob/master/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackFileRewriteRunner.java#L44
 (and in any other rewriting logic) and make sure they are persisted on write.
   The trick here will be on write, we'll need a way to bypass any analyzer 
rules that tries to check alignment between the table schema (which doesn't 
include lineage fields) and the data frame write schema. I'm still looking 
through options here (do we need some kind of custom rule which based on the 
existence of the "REWRITTEN_FILE_SCAN_TASK_SET_ID" option will hijack the DSV2 
relation with lineage like we do in 3.4/3.5 DML rules or can we somehow 
conditionally leverage 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCapability.java#L94
 ) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark 4.0: Row Lineage support [iceberg]

Reply via email to