Re: [DISCUSS] Incremental computation pipeline for HUDI

Danny Chan Wed, 31 Mar 2021 19:04:06 -0700

Oops, the image crushes, for "change flags", i mean: insert, update(before
and after) and delete.


The Flink engine can propagate the change flags internally between its
operators, if HUDI can send the change flags to Flink, the incremental
calculation of CDC would be very natural (almost transparent to users).

Best,
Danny Chan

vino yang <yanghua1...@gmail.com> 于2021年3月31日周三 下午11:32写道：

> Hi Danny,
>
> Thanks for kicking off this discussion thread.
>
> Yes, incremental query( or says "incremental processing") has always been
> an important feature of the Hudi framework. If we can make this feature
> better, it will be even more exciting.
>
> In the data warehouse, in some complex calculations, I have not found a
> good way to conveniently use some incremental change data (similar to the
> concept of retracement stream in Flink?) to locally "correct" the
> aggregation result (these aggregation results may belong to the DWS layer).
>
> BTW: Yes, I do admit that some simple calculation scenarios (single table
> or an algorithm that can be very easily retracement) can be dealt with
> based on the incremental calculation of CDC.
>
> Of course, the expression of incremental calculation on various occasions
> is sometimes not very clear. Maybe we will discuss it more clearly in
> specific scenarios.
>
> >> If HUDI can keep and propagate these change flags to its consumers, we
> can
> use HUDI as the unified format for the pipeline.
>
> Regarding the "change flags" here, do you mean the flags like the one
> shown in the figure below?
>
> [image: image.png]
>
> Best,
> Vino
>
> Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道：
>
>> Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI as
>> the unified storage/format for data warehouse/lake incremental
>> computation.
>>
>> Usually people divide data warehouse production into several levels, such
>> as the ODS(operation data store), DWD(data warehouse details), DWS(data
>> warehouse service), ADS(application data service).
>>
>>
>> ODS -> DWD -> DWS -> ADS
>>
>> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big topic is
>> syncing the change log(CDC pattern) from all kinds of RDBMS into the
>> warehouse/lake, the cdc patten records and propagate the change flag:
>> insert, update(before and after) and delete for the consumer, with these
>> flags, the downstream engines can have a realtime accumulation
>> computation.
>>
>> Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME
>> computation pipeline for each of the layer.
>>
>> If HUDI can keep and propagate these change flags to its consumers, we can
>> use HUDI as the unified format for the pipeline.
>>
>> I'm expecting your nice ideas here ~
>>
>> Best,
>> Danny Chan
>>
>

Re: [DISCUSS] Incremental computation pipeline for HUDI

Reply via email to