Re: [DISCUSS] Incremental computation pipeline for HUDI

vino yang Wed, 31 Mar 2021 08:32:28 -0700

Hi Danny,

Thanks for kicking off this discussion thread.


Yes, incremental query( or says "incremental processing") has always been
an important feature of the Hudi framework. If we can make this feature
better, it will be even more exciting.

In the data warehouse, in some complex calculations, I have not found a
good way to conveniently use some incremental change data (similar to the
concept of retracement stream in Flink?) to locally "correct" the
aggregation result (these aggregation results may belong to the DWS layer).

BTW: Yes, I do admit that some simple calculation scenarios (single table
or an algorithm that can be very easily retracement) can be dealt with
based on the incremental calculation of CDC.

Of course, the expression of incremental calculation on various occasions
is sometimes not very clear. Maybe we will discuss it more clearly in
specific scenarios.

>> If HUDI can keep and propagate these change flags to its consumers, we
can
use HUDI as the unified format for the pipeline.

Regarding the "change flags" here, do you mean the flags like the one shown
in the figure below?

[image: image.png]

Best,
Vino

Danny Chan <[email protected]> 于2021年3月31日周三 下午6:24写道：

> Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI as
> the unified storage/format for data warehouse/lake incremental computation.
>
> Usually people divide data warehouse production into several levels, such
> as the ODS(operation data store), DWD(data warehouse details), DWS(data
> warehouse service), ADS(application data service).
>
>
> ODS -> DWD -> DWS -> ADS
>
> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big topic is
> syncing the change log(CDC pattern) from all kinds of RDBMS into the
> warehouse/lake, the cdc patten records and propagate the change flag:
> insert, update(before and after) and delete for the consumer, with these
> flags, the downstream engines can have a realtime accumulation computation.
>
> Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME
> computation pipeline for each of the layer.
>
> If HUDI can keep and propagate these change flags to its consumers, we can
> use HUDI as the unified format for the pipeline.
>
> I'm expecting your nice ideas here ~
>
> Best,
> Danny Chan
>

Re: [DISCUSS] Incremental computation pipeline for HUDI

Reply via email to