soma17dec opened a new issue #4729:
URL: https://github.com/apache/hudi/issues/4729


   Hi,
   
   We are in the process of building a Lake House using AWS services and Apache 
Hudi. In the process, we are extracting data from Oracle DB using AWS DMS 
Service and pushing the files to S3 object storage. As AWS DMS does both Full 
load and CDC replication, we are creating two different tasks and moving ahead 
with data loads to S3. With CDC task, delta files are generated with only 
Primary key and modified columns leaving rest all to NULLS. Unfortunately, we 
cannot do supplemental logging on all columns for our tables as it increases 
the overhead and have performance impact. 
   
   We are building Hudi tables after moving data as parquet files to S3 and 
running upserts in MOR mode. 
   
   We want to understand if HUDI has a capability to update the old full record 
(with all columns) with a new version that has only PK column and modified 
columns. 
   
   Eg:-
   
   Full Record - 101, Rahul, Manager, Engineering, 23-Apr-2020, $50000, Y
   Delta Record - 101, , Sr Manager,,24-Apr-2022,,
   
   When the compaction happens, the HUDI table is returning
   
   101,,Sr Manager,,24-Apr-2022,,
   
   
   Expected Value - 101,Rahul, Sr Manager, Engineering, 24-Apr-2022, $50000,Y
   
   Please Advice if there is a solution for this problem.
   
   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to