amogh-jahagirdar commented on code in PR #13310:
URL: https://github.com/apache/iceberg/pull/13310#discussion_r2198388161
##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -426,17 +428,35 @@ public DeltaWriter<InternalRow> createWriter(int
partitionId, long taskId) {
.writeProperties(writeProperties)
.build();
+ Function<InternalRow, InternalRow> rowLineageProjector =
+ context.dataSchema() != null
+ &&
context.dataSchema().findField(MetadataColumns.ROW_ID.fieldId()) != null
+ ? new ProjectRowLineageFromMetadata()
Review Comment:
>ok. looks like context.dataSchema() can be null. when could it be null? if
it is null, row lineage is not carried over. doesn't it violate the spec?
Yeah I observed that during the very initial analysis rules,
context.dataSchema() won't technically be defined at certain points but
SparkPositionDeltaWrite will still attempted to be built. Ultimately before
execution it'll always be non-null because there will be some output schema for
a write, so there's no risk of lineage not being carried over.
I added the null check to be defensive otherwise we'd fail with an NPE in
the middle of analysis needlessly when trying to lookup to determine if lineage
fields are defined.
But I do hear your bigger point, which is it's probably cleaner to try and
abstract as much as possible behind the project logic
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]