[ https://issues.apache.org/jira/browse/SPARK-43775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726767#comment-17726767 ]
Snoot.io commented on SPARK-43775: ---------------------------------- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/41300 > DataSource V2: Allow representing updates as deletes and inserts > ---------------------------------------------------------------- > > Key: SPARK-43775 > URL: https://issues.apache.org/jira/browse/SPARK-43775 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.5.0 > Reporter: Anton Okolnychyi > Priority: Major > > It may be beneficial for data sources to represent updates as deletes and > inserts for delta-based implementations. Specifically, it may be helpful to > properly distribute and order records on write. Remember that delete records > have only row ID and metadata attributes set. Update records have data, row > ID, metadata attributes set. Insert records have only data attributes set. > For instance, a data source may rely on a metadata column _row_id (synthetic > internally generated) to identify the row and is partitioned by > bucket(product_id). Splitting updates into inserts and deletes would allow > data sources to cluster all update and insert records for the same partition > into a single task. Otherwise, the clustering key for updates and inserts > will be different (updates have _row_id set). This is critical to reduce the > number of generated files. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org