[ 
https://issues.apache.org/jira/browse/HUDI-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7229:
---------------------------------
    Description: 
OLTP workloads on upstream databases, often update/delete/insert different 
columns in the table on each operation. Currently, Hudi can only supporting 
partial updates in cases where the same columns are being mutated in a given 
write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we 
explore what it takes to support a smarter storage format, that can only encode 
the changed columns into log along with the different implementations.
h2. Goals
 # Enable partial update functionality for all existing and potential future 
CDC workloads without huge modification or duplication.
 # Performance parity with current full-record updates or partial updates 
across the same set of columns
 # Exhibit reduction in storage costs, by only storing the changed columns.
 # Should also result in computation cost reductions by scanning/processing 
less data
 # Should not affect the scalability of the existing system ingestion system. 
The number of files generated for partial update should not increase 
dramatically.

 

  was:DMS, Debezium, etc.


> Enable partial updates for CDC work payload
> -------------------------------------------
>
>                 Key: HUDI-7229
>                 URL: https://issues.apache.org/jira/browse/HUDI-7229
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Lin Liu
>            Assignee: Vinoth Chandar
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.1.0
>
>
> OLTP workloads on upstream databases, often update/delete/insert different 
> columns in the table on each operation. Currently, Hudi can only supporting 
> partial updates in cases where the same columns are being mutated in a given 
> write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we 
> explore what it takes to support a smarter storage format, that can only 
> encode the changed columns into log along with the different implementations.
> h2. Goals
>  # Enable partial update functionality for all existing and potential future 
> CDC workloads without huge modification or duplication.
>  # Performance parity with current full-record updates or partial updates 
> across the same set of columns
>  # Exhibit reduction in storage costs, by only storing the changed columns.
>  # Should also result in computation cost reductions by scanning/processing 
> less data
>  # Should not affect the scalability of the existing system ingestion system. 
> The number of files generated for partial update should not increase 
> dramatically.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to