[DISCUSS] Incremental computation pipeline for HUDI

Danny Chan Wed, 31 Mar 2021 03:24:56 -0700

Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI as
the unified storage/format for data warehouse/lake incremental computation.


Usually people divide data warehouse production into several levels, such
as the ODS(operation data store), DWD(data warehouse details), DWS(data
warehouse service), ADS(application data service).


ODS -> DWD -> DWS -> ADS

In the NEAR-REAL-TIME (or pure realtime) computation cases, a big topic is
syncing the change log(CDC pattern) from all kinds of RDBMS into the
warehouse/lake, the cdc patten records and propagate the change flag:
insert, update(before and after) and delete for the consumer, with these
flags, the downstream engines can have a realtime accumulation computation.

Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME
computation pipeline for each of the layer.

If HUDI can keep and propagate these change flags to its consumers, we can
use HUDI as the unified format for the pipeline.

I'm expecting your nice ideas here ~

Best,
Danny Chan

[DISCUSS] Incremental computation pipeline for HUDI

Reply via email to