Anton Okolnychyi created SPARK-43775:
----------------------------------------

             Summary: DataSource V2: Allow representing updates as deletes and 
inserts
                 Key: SPARK-43775
                 URL: https://issues.apache.org/jira/browse/SPARK-43775
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 3.5.0
            Reporter: Anton Okolnychyi


It may be beneficial for data sources to represent updates as deletes and 
inserts for delta-based implementations. Specifically, it may be helpful to 
properly distribute and order records on write. Remember that delete records 
have only row ID and metadata attributes set. Update records have data, row ID, 
metadata attributes set. Insert records have only data attributes set.

For instance, a data source may rely on a metadata column _row_id (synthetic 
internally generated) to identify the row and is partitioned by 
bucket(product_id). Splitting updates into inserts and deletes would allow data 
sources to cluster all update and insert records for the same partition into a 
single task. Otherwise, the clustering key for updates and inserts will be 
different (updates have _row_id set). This is critical to reduce the number of 
generated files.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to