Hi Danny, Thanks for kicking off this discussion thread.
Yes, incremental query( or says "incremental processing") has always been an important feature of the Hudi framework. If we can make this feature better, it will be even more exciting. In the data warehouse, in some complex calculations, I have not found a good way to conveniently use some incremental change data (similar to the concept of retracement stream in Flink?) to locally "correct" the aggregation result (these aggregation results may belong to the DWS layer). BTW: Yes, I do admit that some simple calculation scenarios (single table or an algorithm that can be very easily retracement) can be dealt with based on the incremental calculation of CDC. Of course, the expression of incremental calculation on various occasions is sometimes not very clear. Maybe we will discuss it more clearly in specific scenarios. >> If HUDI can keep and propagate these change flags to its consumers, we can use HUDI as the unified format for the pipeline. Regarding the "change flags" here, do you mean the flags like the one shown in the figure below? [image: image.png] Best, Vino Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道: > Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI as > the unified storage/format for data warehouse/lake incremental computation. > > Usually people divide data warehouse production into several levels, such > as the ODS(operation data store), DWD(data warehouse details), DWS(data > warehouse service), ADS(application data service). > > > ODS -> DWD -> DWS -> ADS > > In the NEAR-REAL-TIME (or pure realtime) computation cases, a big topic is > syncing the change log(CDC pattern) from all kinds of RDBMS into the > warehouse/lake, the cdc patten records and propagate the change flag: > insert, update(before and after) and delete for the consumer, with these > flags, the downstream engines can have a realtime accumulation computation. > > Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME > computation pipeline for each of the layer. > > If HUDI can keep and propagate these change flags to its consumers, we can > use HUDI as the unified format for the pipeline. > > I'm expecting your nice ideas here ~ > > Best, > Danny Chan >