[PR] [wip] [Spark] Support computed column shortcut [paimon]

via GitHub Wed, 17 Dec 2025 22:19:39 -0800


wayneli-vt opened a new pull request, #6827:
URL: https://github.com/apache/paimon/pull/6827


   ### Purpose
   
   This PR optimizes `MERGE INTO` on Paimon row-tracking (data-evolution) 
append tables for the self-merge pattern:
   
   ```SQL
   MERGE INTO target
   USING target AS source
   ON target._ROW_ID = source._ROW_ID
   WHEN MATCHED THEN UPDATE SET col2 = col1 + 1
   ```
   
   When `source` and `target` are the same table and the merge condition is 
`_ROW_ID` equality, the joined result is identical to scanning the target table 
itself. Therefore, we can avoid all `join/shuffle/sort` and only write the 
updated columns.
   
   Changes included:
   
   1. `targetRelatedSplits`: for self-merge on `_ROW_ID`, directly scan all 
splits instead of computing touched splits via source scan/join.
   2. **Join Free**`updateActionInvoke`: for self-merge on `_ROW_ID`, remove 
the target-source join. Read the required columns from the target scan once, 
and rewrite all source-side attribute references in merge actions/conditions to 
target attributes so `MergeRows` can be evaluated on the target scan output.
   3. **Reduce write-side shuffle/sort**: let the scan report its natural 
partitioning and ordering:
      - **Partitioned by** `_FIRST_ROW_ID`
      - **Ordered by** `(_FIRST_ROW_ID, _ROW_ID)`
      With this, Spark can eliminate the explicit shuffle/sort previously 
required before writing partial column updates:
   
      ```scala
      repartitionByRange(firstRowIdColumn)
        .sortWithinPartitions(FIRST_ROW_ID_NAME, ROW_ID_NAME)
   
   
   ### Tests
   
   A new test case has been added to `RowTrackingTestBase` to specifically 
verify this pr:
   
   * `org.apache.paimon.spark.sql.RowTrackingTestBase#Data Evolution: merge 
into table with data-evolution with _ROW_ID shortcut`
   
   ### API and Format
   
   No
   
   ### Documentation
   
   TODO
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [wip] [Spark] Support computed column shortcut [paimon]

Reply via email to