I took a look at the SPIP and PR. The approach seems reasonable to me and I
also see that we generalized it to support MERGE, which is great. I think
this is going to be a valuable addition to Spark DML operations and both
Delta and Iceberg would definitely use it.

- Anton

пт, 29 трав. 2026 р. о 10:06 Anurag Mantripragada <[email protected]>
пише:

> Hi everyone,
>
>
> Thanks for the initial feedback and reviewing the PR. In addition to the
> UPDATE feature, I wanted to ensure that the design supports the MERGE INTO
> use-case as well. Since this SPIP is currently scoped to UPDATE only, I
> added a section [1] in the SPIP that explains how the design can be
> extended to support MERGE INTO in the future.
>
>
> Please let me know your thoughts on this.
>
>
> [1] -
> https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?tab=t.0#bookmark=id.2fyy5gx4gtjl
>
> ~ Anurag
>
> On May 11, 2026, at 3:13 PM, Anurag Mantripragada <
> [email protected]> wrote:
>
> HI all,
>
> I would like to bump this thread. I have cleaned up the PR[1] and SPIP doc
> [2] based on initial feedback. I’m looking for more feedback on the
> approach here before going for a vote. Please take a look.
>
> Thanks,
> Anurag
>
> [1] - https://github.com/apache/spark/pull/55518
> [2] -
> https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?tab=t.0#heading=h.yoitjxhaitk8
>
> On Apr 28, 2026, at 3:24 PM, Anurag Mantripragada <
> [email protected]> wrote:
>
> Hi Peter,
>
> Thanks for reviewing the SPIP doc and PR. I've updated section B.3.B and
> B.3.C in the SPIP to clarify.
>
>
>
> When I traced through the optimizer rule ordering for MOR vs CoW, I
> observed the following (experts here: please correct me if I'm wrong):
>
>
> For MOR (WriteDelta), the DataSourceV2Relation stays in the plan through
> the normal optimizer batches. V2ScanRelationPushDown handles it like any
> other DSv2 scan. It looks at what the plan above references and narrows
> accordingly. Since my implementation produces a Project that only
> references the connector-declared columns, ColumnPruning propagates that
> narrowness down, and V2ScanRelationPushDown picks it up naturally.
>
> For CoW (ReplaceData), I found that
> GroupBasedRowLevelOperationScanPlanning fires in preOptimizationBatches,
> i.e. before ColumnPruning or V2ScanRelationPushDown run. This rule
> pattern-matches only on ReplaceData nodes (never WriteDelta) and converts
> the DataSourceV2Relation into a physical scan reading relation.output
> directly, ignoring any Project above it. By the time the normal optimizer
> runs, there's no DataSourceV2Relation left to narrow.
>
>
> So the implementation narrows DataSourceV2Relation.output at analysis time
> for CoW (in buildRelationWithAttrs).
>
>
>
> In summary:
>
>
>   - MOR: narrow Project → standard optimizer pipeline handles it (no rule
> changes)
>   - CoW: narrow DataSourceV2Relation.output at analysis time →
> GroupBasedRowLevelOperationScanPlanning sees it already narrow
> →RowLevelOperationRuntimeGroupFiltering tolerates missing columns
>
> I’m open to ideas to make this more clean, please let me know.
>
> Thanks,
> Anurag
>
>
>
> On Apr 28, 2026, at 2:36 AM, Peter Toth <[email protected]> wrote:
>
> Thank you Anurag for working on this!
> Let's focus on the SPIP first.
> The schema resolution flow makes sense to me, but I found the differences
> between the "Merge-on-Read"  and "Copy-on-Write" implementations a bit
> challenging to grasp at first. Could you clarify the purpose of the
> mentioned rules and how they are applied/affected in your implementation? I
> left some comments in the doc.
>
> Thanks,
> Peter
>
> On Thu, Apr 23, 2026 at 8:39 PM Anurag Mantripragada <
> [email protected]> wrote:
>
>> Hi everyone,
>>
>> I would like to start a discussion regarding an enhancement to the DSv2
>> API. This proposal allows connectors to declare which columns they need to
>> receive during an update, significantly improving performance and reducing
>> write amplification. This is particularly beneficial for connectors like
>> Iceberg on wide tables, which are increasingly common in AI/ML use cases.
>>
>> I have included a PR with this SPIP that demonstrates the changes. It has
>> been tested on the Iceberg connector and is working well end-to-end.
>>
>> Huaxian Gao has agreed to serve as the shepherd for this SPIP.
>>
>> SPARK-56599 <https://issues.apache.org/jira/browse/SPARK-56599>
>> SPIP Doc
>> <https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?tab=t.0#heading=h.yoitjxhaitk8>
>> PR <https://github.com/apache/spark/pull/55518>
>>
>> Please take a look and provide feedback!
>>
>> Thanks,
>> Anurag Mantripragada
>>
>
>
>
>

Reply via email to