Re: [DISCUSS] Incremental computation pipeline for HUDI

Danny Chan Thu, 08 Apr 2021 02:13:05 -0700

I tries to do a POC for flink locally and it works well, in the PR i add a
new metadata column named "_hoodie_change_flag", but actually i found that
only log format needs this flag, and the Spark may has no ability to handle
the flag for incremental processing yet.


So should i add the "_hoodie_change_flag" metadata column, or is there any
better solution for this?

Best,
Danny Chan

Danny Chan <danny0...@apache.org> 于2021年4月2日周五 上午11:08写道：

> Thanks cool, then the left questions are:
>
> - where we record these change, should we add a builtin meta field such as
> the _change_flag_ like the other system columns for e.g _hoodie_commit_time
> - what kind of table should keep these flags, in my thoughts, we should
> only add these flags for "MERGE_ON_READ" table, and only for AVRO logs
> - we should add a config there to switch on/off the flags in system meta
> fields
>
> What do you think?
>
> Best,
> Danny Chan
>
> vino yang <yanghua1...@gmail.com> 于2021年4月1日周四 上午10:58写道：
>
>> >> Oops, the image crushes, for "change flags", i mean: insert,
>> update(before
>> and after) and delete.
>>
>> Yes, the image I attached is also about these flags.
>> [image: image (3).png]
>>
>> +1 for the idea.
>>
>> Best,
>> Vino
>>
>>
>> Danny Chan <danny0...@apache.org> 于2021年4月1日周四 上午10:03写道：
>>
>>> Oops, the image crushes, for "change flags", i mean: insert,
>>> update(before
>>> and after) and delete.
>>>
>>> The Flink engine can propagate the change flags internally between its
>>> operators, if HUDI can send the change flags to Flink, the incremental
>>> calculation of CDC would be very natural (almost transparent to users).
>>>
>>> Best,
>>> Danny Chan
>>>
>>> vino yang <yanghua1...@gmail.com> 于2021年3月31日周三 下午11:32写道：
>>>
>>> > Hi Danny,
>>> >
>>> > Thanks for kicking off this discussion thread.
>>> >
>>> > Yes, incremental query( or says "incremental processing") has always
>>> been
>>> > an important feature of the Hudi framework. If we can make this feature
>>> > better, it will be even more exciting.
>>> >
>>> > In the data warehouse, in some complex calculations, I have not found a
>>> > good way to conveniently use some incremental change data (similar to
>>> the
>>> > concept of retracement stream in Flink?) to locally "correct" the
>>> > aggregation result (these aggregation results may belong to the DWS
>>> layer).
>>> >
>>> > BTW: Yes, I do admit that some simple calculation scenarios (single
>>> table
>>> > or an algorithm that can be very easily retracement) can be dealt with
>>> > based on the incremental calculation of CDC.
>>> >
>>> > Of course, the expression of incremental calculation on various
>>> occasions
>>> > is sometimes not very clear. Maybe we will discuss it more clearly in
>>> > specific scenarios.
>>> >
>>> > >> If HUDI can keep and propagate these change flags to its consumers,
>>> we
>>> > can
>>> > use HUDI as the unified format for the pipeline.
>>> >
>>> > Regarding the "change flags" here, do you mean the flags like the one
>>> > shown in the figure below?
>>> >
>>> > [image: image.png]
>>> >
>>> > Best,
>>> > Vino
>>> >
>>> > Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道：
>>> >
>>> >> Hi dear HUDI community ~ Here i want to fire a discuss about using
>>> HUDI as
>>> >> the unified storage/format for data warehouse/lake incremental
>>> >> computation.
>>> >>
>>> >> Usually people divide data warehouse production into several levels,
>>> such
>>> >> as the ODS(operation data store), DWD(data warehouse details),
>>> DWS(data
>>> >> warehouse service), ADS(application data service).
>>> >>
>>> >>
>>> >> ODS -> DWD -> DWS -> ADS
>>> >>
>>> >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big
>>> topic is
>>> >> syncing the change log(CDC pattern) from all kinds of RDBMS into the
>>> >> warehouse/lake, the cdc patten records and propagate the change flag:
>>> >> insert, update(before and after) and delete for the consumer, with
>>> these
>>> >> flags, the downstream engines can have a realtime accumulation
>>> >> computation.
>>> >>
>>> >> Using streaming engine like Flink, we can have a totally
>>> NEAR-REAL-TIME
>>> >> computation pipeline for each of the layer.
>>> >>
>>> >> If HUDI can keep and propagate these change flags to its consumers,
>>> we can
>>> >> use HUDI as the unified format for the pipeline.
>>> >>
>>> >> I'm expecting your nice ideas here ~
>>> >>
>>> >> Best,
>>> >> Danny Chan
>>> >>
>>> >
>>>
>>

Re: [DISCUSS] Incremental computation pipeline for HUDI

Reply via email to