wayneli-vt opened a new pull request, #6827:
URL: https://github.com/apache/paimon/pull/6827
### Purpose
This PR optimizes `MERGE INTO` on Paimon row-tracking (data-evolution)
append tables for the self-merge pattern:
```SQL
MERGE INTO target
USING target AS source
ON target._ROW_ID = source._ROW_ID
WHEN MATCHED THEN UPDATE SET col2 = col1 + 1
```
When `source` and `target` are the same table and the merge condition is
`_ROW_ID` equality, the joined result is identical to scanning the target table
itself. Therefore, we can avoid all `join/shuffle/sort` and only write the
updated columns.
Changes included:
1. `targetRelatedSplits`: for self-merge on `_ROW_ID`, directly scan all
splits instead of computing touched splits via source scan/join.
2. **Join Free**`updateActionInvoke`: for self-merge on `_ROW_ID`, remove
the target-source join. Read the required columns from the target scan once,
and rewrite all source-side attribute references in merge actions/conditions to
target attributes so `MergeRows` can be evaluated on the target scan output.
3. **Reduce write-side shuffle/sort**: let the scan report its natural
partitioning and ordering:
- **Partitioned by** `_FIRST_ROW_ID`
- **Ordered by** `(_FIRST_ROW_ID, _ROW_ID)`
With this, Spark can eliminate the explicit shuffle/sort previously
required before writing partial column updates:
```scala
repartitionByRange(firstRowIdColumn)
.sortWithinPartitions(FIRST_ROW_ID_NAME, ROW_ID_NAME)
### Tests
A new test case has been added to `RowTrackingTestBase` to specifically
verify this pr:
* `org.apache.paimon.spark.sql.RowTrackingTestBase#Data Evolution: merge
into table with data-evolution with _ROW_ID shortcut`
### API and Format
No
### Documentation
TODO
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]