[ 
https://issues.apache.org/jira/browse/SPARK-43775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726767#comment-17726767
 ] 

Snoot.io commented on SPARK-43775:
----------------------------------

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/41300

> DataSource V2: Allow representing updates as deletes and inserts
> ----------------------------------------------------------------
>
>                 Key: SPARK-43775
>                 URL: https://issues.apache.org/jira/browse/SPARK-43775
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Anton Okolnychyi
>            Priority: Major
>
> It may be beneficial for data sources to represent updates as deletes and 
> inserts for delta-based implementations. Specifically, it may be helpful to 
> properly distribute and order records on write. Remember that delete records 
> have only row ID and metadata attributes set. Update records have data, row 
> ID, metadata attributes set. Insert records have only data attributes set.
> For instance, a data source may rely on a metadata column _row_id (synthetic 
> internally generated) to identify the row and is partitioned by 
> bucket(product_id). Splitting updates into inserts and deletes would allow 
> data sources to cluster all update and insert records for the same partition 
> into a single task. Otherwise, the clustering key for updates and inserts 
> will be different (updates have _row_id set). This is critical to reduce the 
> number of generated files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to