Hello, what is the schema of the reading file: S3_INCR_RAW_DATA ?

Best,
Danny

Sid Kal <[email protected]> 于2022年2月21日周一 03:49写道:
>
>
>
>
>
>
> We have a use case for which we were planning to use Hudi tables for CDC 
> purposes. Basically, my whole intention is to perform upserts along with the 
> deletes. So, if a record in my source system is deleted, it should be deleted 
> from my target as well.
>
> I went through this link where a user is performing CDC using Hudi.
> https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b
>
> My question is how does Hudi internally recognize the records in the 
> incremental data load? So how should the incremental file be using which we 
> can recognize which records are meant to be appended/deleted/updates.
>
> I am actually confused with this part:
>
> S3_INCR_RAW_DATA = 
> "s3://aws-analytics-course/raw/dms/fossil/coal_prod/20200808-*.csv"
> df_coal_prod_incr = spark.read.csv(S3_INCR_RAW_DATA, header=False, 
> schema=coal_prod_schema)
> df_coal_prod_incr_u_i=df_coal_prod_incr.filter("Mode IN ('U', 'I')")
>
> Where the user is directly filtering out on mode. Is "Mode" a column inside 
> the dataset? Or how is it gonna be?
>
> I am a newbie to Hudi.
>
> Thanks,
> Sid

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to