amogh-jahagirdar commented on code in PR #13310:
URL: https://github.com/apache/iceberg/pull/13310#discussion_r2197992704
##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -260,18 +260,52 @@ public MetadataColumn[] metadataColumns() {
DataType sparkPartitionType =
SparkSchemaUtil.convert(Partitioning.partitionType(table()));
ImmutableList.Builder<SparkMetadataColumn> metadataColumns =
ImmutableList.builder();
metadataColumns.add(
- new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(),
DataTypes.IntegerType, true),
- new SparkMetadataColumn(MetadataColumns.PARTITION_COLUMN_NAME,
sparkPartitionType, true),
- new SparkMetadataColumn(MetadataColumns.FILE_PATH.name(),
DataTypes.StringType, false),
- new SparkMetadataColumn(MetadataColumns.ROW_POSITION.name(),
DataTypes.LongType, false),
- new SparkMetadataColumn(MetadataColumns.IS_DELETED.name(),
DataTypes.BooleanType, false));
-
- if (TableUtil.formatVersion(table()) >= 3) {
+ SparkMetadataColumn.builder()
+ .name(MetadataColumns.SPEC_ID.name())
+ .dataType(DataTypes.IntegerType)
+ .withNullability(true)
+ .build(),
+ SparkMetadataColumn.builder()
+ .name(MetadataColumns.PARTITION_COLUMN_NAME)
+ .dataType(sparkPartitionType)
+ .withNullability(true)
+ .build(),
+ SparkMetadataColumn.builder()
+ .name(MetadataColumns.FILE_PATH.name())
+ .dataType(DataTypes.StringType)
+ .withNullability(false)
+ .build(),
+ SparkMetadataColumn.builder()
+ .name(MetadataColumns.ROW_POSITION.name())
+ .dataType(DataTypes.LongType)
+ .withNullability(false)
+ .build(),
+ SparkMetadataColumn.builder()
+ .name(MetadataColumns.IS_DELETED.name())
+ .dataType(DataTypes.BooleanType)
+ .withNullability(false)
+ .build());
+
+ if (TableUtil.supportsRowLineage(icebergTable)) {
metadataColumns.add(
- new SparkMetadataColumn(MetadataColumns.ROW_ID.name(),
DataTypes.LongType, true));
+ SparkMetadataColumn.builder()
+ .name(MetadataColumns.ROW_ID.name())
+ .dataType(DataTypes.LongType)
+ .withNullability(true)
+ .preserveOnReinsert(true)
+ .preserveOnUpdate(true)
+ .preserveOnDelete(false)
+ .build());
+
metadataColumns.add(
- new SparkMetadataColumn(
- MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.name(),
DataTypes.LongType, true));
+ SparkMetadataColumn.builder()
+ .name(MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.name())
+ .dataType(DataTypes.LongType)
+ .withNullability(true)
+ .preserveOnReinsert(false)
Review Comment:
Yes for compaction or any kind of rewrite we do need to preserve both row
ids/seq numbers however that is a different code path that we'd need to handle
separately. These options for preserving metadata column values on
reinsert/update/delete were new options added in DSV2 specifically for DML
operations like merge/update/delete.
For compaction we'll need to project the row lineage fields when reading the
files
https://github.com/amogh-jahagirdar/iceberg/blob/master/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackFileRewriteRunner.java#L44
(and in any other rewriting logic) and make sure they are persisted on write.
The trick here will be on write, we'll need a way to bypass any analyzer
rules that tries to check alignment between the table schema (which doesn't
include lineage fields) and the data frame write schema. I'm still looking
through options here (do we need some kind of custom rule which based on the
existence of the "REWRITTEN_FILE_SCAN_TASK_SET_ID" option will hijack the DSV2
relation with lineage like we do in 3.4/3.5 DML rules or can we somehow
conditionally leverage
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCapability.java#L94
)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]