[ https://issues.apache.org/jira/browse/HUDI-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-7229: --------------------------------- Description: OLTP workloads on upstream databases, often update/delete/insert different columns in the table on each operation. Currently, Hudi can only supporting partial updates in cases where the same columns are being mutated in a given write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we explore what it takes to support a smarter storage format, that can only encode the changed columns into log along with the different implementations. h2. Goals # Enable partial update functionality for all existing and potential future CDC workloads without huge modification or duplication. # Performance parity with current full-record updates or partial updates across the same set of columns # Exhibit reduction in storage costs, by only storing the changed columns. # Should also result in computation cost reductions by scanning/processing less data # Should not affect the scalability of the existing system ingestion system. The number of files generated for partial update should not increase dramatically. was:DMS, Debezium, etc. > Enable partial updates for CDC work payload > ------------------------------------------- > > Key: HUDI-7229 > URL: https://issues.apache.org/jira/browse/HUDI-7229 > Project: Apache Hudi > Issue Type: Task > Reporter: Lin Liu > Assignee: Vinoth Chandar > Priority: Major > Labels: pull-request-available > Fix For: 1.1.0 > > > OLTP workloads on upstream databases, often update/delete/insert different > columns in the table on each operation. Currently, Hudi can only supporting > partial updates in cases where the same columns are being mutated in a given > write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we > explore what it takes to support a smarter storage format, that can only > encode the changed columns into log along with the different implementations. > h2. Goals > # Enable partial update functionality for all existing and potential future > CDC workloads without huge modification or duplication. > # Performance parity with current full-record updates or partial updates > across the same set of columns > # Exhibit reduction in storage costs, by only storing the changed columns. > # Should also result in computation cost reductions by scanning/processing > less data > # Should not affect the scalability of the existing system ingestion system. > The number of files generated for partial update should not increase > dramatically. > -- This message was sent by Atlassian Jira (v8.20.10#820010)