Re: [DISCUSS] Incremental computation pipeline for HUDI

Danny Chan Thu, 15 Apr 2021 20:33:29 -0700

Thanks Vinoth ~

Here is a document about the notion of 《Flink Dynamic Table》[1] , every
operator that has accumulate state can handle retractions(UPDATE_BEFORE or
DELETE) then apply new changes (INSERT or UPDATE_AFTER), so that each
operator can consume the CDC format messages in streaming way.


> Another aspect to think about is, how the new flag can be added to
existing
tables and if the schema evolution would be fine.

That is also my concern, but it's not that bad because adding a new column
is still compatible for old schema in Avro.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html

Best,
Danny Chan

Vinoth Chandar <vin...@apache.org> 于2021年4月16日周五 上午9:44写道：

> Hi,
>
> Is the intent of the flag to convey if an insert delete or update changed
> the record? If so I would imagine that we do this even for cow tables,
> since that also supports a logical notion of a change stream using the
> commit_time meta field.
>
> You may be right, but I am trying to understand the use case for this. Any
> links/flink docs I can read?
>
> Another aspect to think about is, how the new flag can be added to existing
> tables and if the schema evolution would be fine.
>
> Thanks
> Vinoth
>
> On Thu, Apr 8, 2021 at 2:13 AM Danny Chan <danny0...@apache.org> wrote:
>
> > I tries to do a POC for flink locally and it works well, in the PR i add
> a
> > new metadata column named "_hoodie_change_flag", but actually i found
> that
> > only log format needs this flag, and the Spark may has no ability to
> handle
> > the flag for incremental processing yet.
> >
> > So should i add the "_hoodie_change_flag" metadata column, or is there
> any
> > better solution for this?
> >
> > Best,
> > Danny Chan
> >
> > Danny Chan <danny0...@apache.org> 于2021年4月2日周五 上午11:08写道：
> >
> > > Thanks cool, then the left questions are:
> > >
> > > - where we record these change, should we add a builtin meta field such
> > as
> > > the _change_flag_ like the other system columns for e.g
> > _hoodie_commit_time
> > > - what kind of table should keep these flags, in my thoughts, we should
> > > only add these flags for "MERGE_ON_READ" table, and only for AVRO logs
> > > - we should add a config there to switch on/off the flags in system
> meta
> > > fields
> > >
> > > What do you think?
> > >
> > > Best,
> > > Danny Chan
> > >
> > > vino yang <yanghua1...@gmail.com> 于2021年4月1日周四 上午10:58写道：
> > >
> > >> >> Oops, the image crushes, for "change flags", i mean: insert,
> > >> update(before
> > >> and after) and delete.
> > >>
> > >> Yes, the image I attached is also about these flags.
> > >> [image: image (3).png]
> > >>
> > >> +1 for the idea.
> > >>
> > >> Best,
> > >> Vino
> > >>
> > >>
> > >> Danny Chan <danny0...@apache.org> 于2021年4月1日周四 上午10:03写道：
> > >>
> > >>> Oops, the image crushes, for "change flags", i mean: insert,
> > >>> update(before
> > >>> and after) and delete.
> > >>>
> > >>> The Flink engine can propagate the change flags internally between
> its
> > >>> operators, if HUDI can send the change flags to Flink, the
> incremental
> > >>> calculation of CDC would be very natural (almost transparent to
> users).
> > >>>
> > >>> Best,
> > >>> Danny Chan
> > >>>
> > >>> vino yang <yanghua1...@gmail.com> 于2021年3月31日周三 下午11:32写道：
> > >>>
> > >>> > Hi Danny,
> > >>> >
> > >>> > Thanks for kicking off this discussion thread.
> > >>> >
> > >>> > Yes, incremental query( or says "incremental processing") has
> always
> > >>> been
> > >>> > an important feature of the Hudi framework. If we can make this
> > feature
> > >>> > better, it will be even more exciting.
> > >>> >
> > >>> > In the data warehouse, in some complex calculations, I have not
> > found a
> > >>> > good way to conveniently use some incremental change data (similar
> to
> > >>> the
> > >>> > concept of retracement stream in Flink?) to locally "correct" the
> > >>> > aggregation result (these aggregation results may belong to the DWS
> > >>> layer).
> > >>> >
> > >>> > BTW: Yes, I do admit that some simple calculation scenarios (single
> > >>> table
> > >>> > or an algorithm that can be very easily retracement) can be dealt
> > with
> > >>> > based on the incremental calculation of CDC.
> > >>> >
> > >>> > Of course, the expression of incremental calculation on various
> > >>> occasions
> > >>> > is sometimes not very clear. Maybe we will discuss it more clearly
> in
> > >>> > specific scenarios.
> > >>> >
> > >>> > >> If HUDI can keep and propagate these change flags to its
> > consumers,
> > >>> we
> > >>> > can
> > >>> > use HUDI as the unified format for the pipeline.
> > >>> >
> > >>> > Regarding the "change flags" here, do you mean the flags like the
> one
> > >>> > shown in the figure below?
> > >>> >
> > >>> > [image: image.png]
> > >>> >
> > >>> > Best,
> > >>> > Vino
> > >>> >
> > >>> > Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道：
> > >>> >
> > >>> >> Hi dear HUDI community ~ Here i want to fire a discuss about using
> > >>> HUDI as
> > >>> >> the unified storage/format for data warehouse/lake incremental
> > >>> >> computation.
> > >>> >>
> > >>> >> Usually people divide data warehouse production into several
> levels,
> > >>> such
> > >>> >> as the ODS(operation data store), DWD(data warehouse details),
> > >>> DWS(data
> > >>> >> warehouse service), ADS(application data service).
> > >>> >>
> > >>> >>
> > >>> >> ODS -> DWD -> DWS -> ADS
> > >>> >>
> > >>> >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big
> > >>> topic is
> > >>> >> syncing the change log(CDC pattern) from all kinds of RDBMS into
> the
> > >>> >> warehouse/lake, the cdc patten records and propagate the change
> > flag:
> > >>> >> insert, update(before and after) and delete for the consumer, with
> > >>> these
> > >>> >> flags, the downstream engines can have a realtime accumulation
> > >>> >> computation.
> > >>> >>
> > >>> >> Using streaming engine like Flink, we can have a totally
> > >>> NEAR-REAL-TIME
> > >>> >> computation pipeline for each of the layer.
> > >>> >>
> > >>> >> If HUDI can keep and propagate these change flags to its
> consumers,
> > >>> we can
> > >>> >> use HUDI as the unified format for the pipeline.
> > >>> >>
> > >>> >> I'm expecting your nice ideas here ~
> > >>> >>
> > >>> >> Best,
> > >>> >> Danny Chan
> > >>> >>
> > >>> >
> > >>>
> > >>
> >
>

Re: [DISCUSS] Incremental computation pipeline for HUDI

Reply via email to