I tries to do a POC for flink locally and it works well, in the PR i add a new metadata column named "_hoodie_change_flag", but actually i found that only log format needs this flag, and the Spark may has no ability to handle the flag for incremental processing yet.
So should i add the "_hoodie_change_flag" metadata column, or is there any better solution for this? Best, Danny Chan Danny Chan <danny0...@apache.org> 于2021年4月2日周五 上午11:08写道: > Thanks cool, then the left questions are: > > - where we record these change, should we add a builtin meta field such as > the _change_flag_ like the other system columns for e.g _hoodie_commit_time > - what kind of table should keep these flags, in my thoughts, we should > only add these flags for "MERGE_ON_READ" table, and only for AVRO logs > - we should add a config there to switch on/off the flags in system meta > fields > > What do you think? > > Best, > Danny Chan > > vino yang <yanghua1...@gmail.com> 于2021年4月1日周四 上午10:58写道: > >> >> Oops, the image crushes, for "change flags", i mean: insert, >> update(before >> and after) and delete. >> >> Yes, the image I attached is also about these flags. >> [image: image (3).png] >> >> +1 for the idea. >> >> Best, >> Vino >> >> >> Danny Chan <danny0...@apache.org> 于2021年4月1日周四 上午10:03写道: >> >>> Oops, the image crushes, for "change flags", i mean: insert, >>> update(before >>> and after) and delete. >>> >>> The Flink engine can propagate the change flags internally between its >>> operators, if HUDI can send the change flags to Flink, the incremental >>> calculation of CDC would be very natural (almost transparent to users). >>> >>> Best, >>> Danny Chan >>> >>> vino yang <yanghua1...@gmail.com> 于2021年3月31日周三 下午11:32写道: >>> >>> > Hi Danny, >>> > >>> > Thanks for kicking off this discussion thread. >>> > >>> > Yes, incremental query( or says "incremental processing") has always >>> been >>> > an important feature of the Hudi framework. If we can make this feature >>> > better, it will be even more exciting. >>> > >>> > In the data warehouse, in some complex calculations, I have not found a >>> > good way to conveniently use some incremental change data (similar to >>> the >>> > concept of retracement stream in Flink?) to locally "correct" the >>> > aggregation result (these aggregation results may belong to the DWS >>> layer). >>> > >>> > BTW: Yes, I do admit that some simple calculation scenarios (single >>> table >>> > or an algorithm that can be very easily retracement) can be dealt with >>> > based on the incremental calculation of CDC. >>> > >>> > Of course, the expression of incremental calculation on various >>> occasions >>> > is sometimes not very clear. Maybe we will discuss it more clearly in >>> > specific scenarios. >>> > >>> > >> If HUDI can keep and propagate these change flags to its consumers, >>> we >>> > can >>> > use HUDI as the unified format for the pipeline. >>> > >>> > Regarding the "change flags" here, do you mean the flags like the one >>> > shown in the figure below? >>> > >>> > [image: image.png] >>> > >>> > Best, >>> > Vino >>> > >>> > Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道: >>> > >>> >> Hi dear HUDI community ~ Here i want to fire a discuss about using >>> HUDI as >>> >> the unified storage/format for data warehouse/lake incremental >>> >> computation. >>> >> >>> >> Usually people divide data warehouse production into several levels, >>> such >>> >> as the ODS(operation data store), DWD(data warehouse details), >>> DWS(data >>> >> warehouse service), ADS(application data service). >>> >> >>> >> >>> >> ODS -> DWD -> DWS -> ADS >>> >> >>> >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big >>> topic is >>> >> syncing the change log(CDC pattern) from all kinds of RDBMS into the >>> >> warehouse/lake, the cdc patten records and propagate the change flag: >>> >> insert, update(before and after) and delete for the consumer, with >>> these >>> >> flags, the downstream engines can have a realtime accumulation >>> >> computation. >>> >> >>> >> Using streaming engine like Flink, we can have a totally >>> NEAR-REAL-TIME >>> >> computation pipeline for each of the layer. >>> >> >>> >> If HUDI can keep and propagate these change flags to its consumers, >>> we can >>> >> use HUDI as the unified format for the pipeline. >>> >> >>> >> I'm expecting your nice ideas here ~ >>> >> >>> >> Best, >>> >> Danny Chan >>> >> >>> > >>> >>