Interesting thoughts. Not sure if I fully understand this part: "generate 2
records in combineAndGetUpdateValue". the API is defined to return just 1
record?

On Fri, Oct 21, 2022 at 1:07 AM 冯健 <[email protected]> wrote:

> Hi guys,
>     After reading this article with respect to how to implement SCD-2 with
> Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and
> Apache Hudi on Amazon EMR
> <
> https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/
> >
>     I have an idea about implementing embedded SCD-2 support in hudi by
> using a new Payload. Users don't need to manually join the data, then
> update end_data and status.
>    For example, the record key is 'id,end_date',  Let's say the current
> data's id is 1 and the end_date is 2099-12-31,  when a new record with id=1
> arrives, it will update the current record's end_date to 2022-10-21, and
> also insert this new record with end_data ' 2099-12-31'.  so this Payload
> will generate two records in combineAndGetUpdateValue . there will be no
> join cost, and the whole process is transparent to users.
>
>    Any thoughts?
>


-- 
Best,
Shiyan

Reply via email to