anuragmantri opened a new pull request, #55518:
URL: https://github.com/apache/spark/pull/55518
**What changes were proposed in this pull request?**
For SPIP: [SPARK-56599](https://issues.apache.org/jira/browse/SPARK-56599)
This PR adds three new default methods to the DSv2 connector API to enable
scan and write-schema narrowing for column-level UPDATEs:
- `updatedColumns()` on RowLevelOperationInfo — Spark informs the
connector which columns are being assigned (non-identity only) before the
operation is
built.
- `requiredDataAttributes()` on RowLevelOperation — the connector declares
the exact set of data columns it needs in the write schema, symmetric with
`requiredMetadataAttributes()`.
- `supportsColumnUpdates()` on RowLevelOperation — explicit opt-in for
receiving a partial row instead of the full table row.
When a connector opts in, Spark removes identity assignments from the write
plan's Project node, unblocking ColumnPruning to narrow the physical scan
automatically (MOR path). For CoW, scan narrowing is done at analysis time via
`buildRelationWithAttrs()` since GroupBasedRowLevelOperationScanPlanning reads
DataSourceV2Relation.output before ColumnPruning fires.
All three methods have default implementations that preserve today's
full-row behavior. No existing connector is affected.
**Why are the changes needed?**
Today, Spark's analyzer generates identity assignments for every column
during [UPDATE
alignment](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentUtils.scala#L62).
These are used to build a Project that references [all columns
](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteUpdateTable.scala#L179),
preventing
[Optimizer](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1050)
from narrowing the scan. The cost scales as O(table width) regardless of how
many columns are being updated.
This is especially wasteful for columnar formats like Parquet/Iceberg and is
a blocker for efficient column-level update implementations in connectors (see
the [Efficient Column Updates
Proposal](https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?pli=1&tab=t.0)
in Iceberg).
**Does this PR introduce any user-facing change?**
Yes. Three new default methods are added to the public DSv2 connector API:
- `RowLevelOperation.supportsColumnUpdates()`
- `RowLevelOperation.requiredDataAttributes()`
- `RowLevelOperationInfo.updatedColumns()`
All are opt-in with backward-compatible defaults. Existing connectors see no
change.
**How was this patch tested?**
- 31 new tests in DeltaBasedColumnUpdateTableSuite covering scan
narrowing, write-schema narrowing, data correctness, identity assignment
filtering, updatedColumns behavior, and requiredDataAttributes across MOR
(delta), CoW (ReplaceData), and delete-then-reinsert paths.
- 6 new updatedColumns tests in DeltaBasedUpdateTableSuiteBase.
**Was this patch authored or co-authored using generative AI tooling?**
Generated-by: Claude Sonnet 4.6
I used Claude Code to generate code and tests and manually reviewed the
generated code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]