Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI as the unified storage/format for data warehouse/lake incremental computation.
Usually people divide data warehouse production into several levels, such as the ODS(operation data store), DWD(data warehouse details), DWS(data warehouse service), ADS(application data service). ODS -> DWD -> DWS -> ADS In the NEAR-REAL-TIME (or pure realtime) computation cases, a big topic is syncing the change log(CDC pattern) from all kinds of RDBMS into the warehouse/lake, the cdc patten records and propagate the change flag: insert, update(before and after) and delete for the consumer, with these flags, the downstream engines can have a realtime accumulation computation. Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME computation pipeline for each of the layer. If HUDI can keep and propagate these change flags to its consumers, we can use HUDI as the unified format for the pipeline. I'm expecting your nice ideas here ~ Best, Danny Chan