Anton Okolnychyi created SPARK-43775: ----------------------------------------
Summary: DataSource V2: Allow representing updates as deletes and inserts Key: SPARK-43775 URL: https://issues.apache.org/jira/browse/SPARK-43775 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: Anton Okolnychyi It may be beneficial for data sources to represent updates as deletes and inserts for delta-based implementations. Specifically, it may be helpful to properly distribute and order records on write. Remember that delete records have only row ID and metadata attributes set. Update records have data, row ID, metadata attributes set. Insert records have only data attributes set. For instance, a data source may rely on a metadata column _row_id (synthetic internally generated) to identify the row and is partitioned by bucket(product_id). Splitting updates into inserts and deletes would allow data sources to cluster all update and insert records for the same partition into a single task. Otherwise, the clustering key for updates and inserts will be different (updates have _row_id set). This is critical to reduce the number of generated files. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org