wrt changes if we plan to add this only to log files, compaction needs to
be fixed to omit this column to the minimum.

On Fri, Apr 16, 2021 at 9:07 PM Sivabalan <n.siv...@gmail.com> wrote:

> Just got a chance to read about dynamic tables. sounds interesting.
>
> some thoughts on your questions:
> - yes, just MOR makes sense.
> - But adding this new meta column only to avro logs might incur some non
> trivial changes. Since as of today, schema of avro and base files are in
> sync. If this new col is going to store just 2 bits(insert/update/delete),
> we might as well add it to base files as well and keep it simple. But we
> can make it configurable so that only those interested can enable this new
> column to hudi dataset.
> - Wondering, even today we can achieve this by using a transformer, to set
> right values to this new column. I mean, users need to add this col in
> their schema when defining the hudi dataset and if the incoming data has
> right values for this col (using deltastreamer's transformer or by explicit
> means), we don't even need to add this as a meta column. Just saying that
> we can achieve this even today. But if we are looking to integrate w/ SQL
> DML, then adding this support would be elegant.
>
>
>
> On Thu, Apr 15, 2021 at 11:33 PM Danny Chan <danny0...@apache.org> wrote:
>
>> Thanks Vinoth ~
>>
>> Here is a document about the notion of 《Flink Dynamic Table》[1] , every
>> operator that has accumulate state can handle retractions(UPDATE_BEFORE or
>> DELETE) then apply new changes (INSERT or UPDATE_AFTER), so that each
>> operator can consume the CDC format messages in streaming way.
>>
>> > Another aspect to think about is, how the new flag can be added to
>> existing
>> tables and if the schema evolution would be fine.
>>
>> That is also my concern, but it's not that bad because adding a new column
>> is still compatible for old schema in Avro.
>>
>> [1]
>>
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
>>
>> Best,
>> Danny Chan
>>
>> Vinoth Chandar <vin...@apache.org> 于2021年4月16日周五 上午9:44写道:
>>
>> > Hi,
>> >
>> > Is the intent of the flag to convey if an insert delete or update
>> changed
>> > the record? If so I would imagine that we do this even for cow tables,
>> > since that also supports a logical notion of a change stream using the
>> > commit_time meta field.
>> >
>> > You may be right, but I am trying to understand the use case for this.
>> Any
>> > links/flink docs I can read?
>> >
>> > Another aspect to think about is, how the new flag can be added to
>> existing
>> > tables and if the schema evolution would be fine.
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Thu, Apr 8, 2021 at 2:13 AM Danny Chan <danny0...@apache.org> wrote:
>> >
>> > > I tries to do a POC for flink locally and it works well, in the PR i
>> add
>> > a
>> > > new metadata column named "_hoodie_change_flag", but actually i found
>> > that
>> > > only log format needs this flag, and the Spark may has no ability to
>> > handle
>> > > the flag for incremental processing yet.
>> > >
>> > > So should i add the "_hoodie_change_flag" metadata column, or is there
>> > any
>> > > better solution for this?
>> > >
>> > > Best,
>> > > Danny Chan
>> > >
>> > > Danny Chan <danny0...@apache.org> 于2021年4月2日周五 上午11:08写道:
>> > >
>> > > > Thanks cool, then the left questions are:
>> > > >
>> > > > - where we record these change, should we add a builtin meta field
>> such
>> > > as
>> > > > the _change_flag_ like the other system columns for e.g
>> > > _hoodie_commit_time
>> > > > - what kind of table should keep these flags, in my thoughts, we
>> should
>> > > > only add these flags for "MERGE_ON_READ" table, and only for AVRO
>> logs
>> > > > - we should add a config there to switch on/off the flags in system
>> > meta
>> > > > fields
>> > > >
>> > > > What do you think?
>> > > >
>> > > > Best,
>> > > > Danny Chan
>> > > >
>> > > > vino yang <yanghua1...@gmail.com> 于2021年4月1日周四 上午10:58写道:
>> > > >
>> > > >> >> Oops, the image crushes, for "change flags", i mean: insert,
>> > > >> update(before
>> > > >> and after) and delete.
>> > > >>
>> > > >> Yes, the image I attached is also about these flags.
>> > > >> [image: image (3).png]
>> > > >>
>> > > >> +1 for the idea.
>> > > >>
>> > > >> Best,
>> > > >> Vino
>> > > >>
>> > > >>
>> > > >> Danny Chan <danny0...@apache.org> 于2021年4月1日周四 上午10:03写道:
>> > > >>
>> > > >>> Oops, the image crushes, for "change flags", i mean: insert,
>> > > >>> update(before
>> > > >>> and after) and delete.
>> > > >>>
>> > > >>> The Flink engine can propagate the change flags internally between
>> > its
>> > > >>> operators, if HUDI can send the change flags to Flink, the
>> > incremental
>> > > >>> calculation of CDC would be very natural (almost transparent to
>> > users).
>> > > >>>
>> > > >>> Best,
>> > > >>> Danny Chan
>> > > >>>
>> > > >>> vino yang <yanghua1...@gmail.com> 于2021年3月31日周三 下午11:32写道:
>> > > >>>
>> > > >>> > Hi Danny,
>> > > >>> >
>> > > >>> > Thanks for kicking off this discussion thread.
>> > > >>> >
>> > > >>> > Yes, incremental query( or says "incremental processing") has
>> > always
>> > > >>> been
>> > > >>> > an important feature of the Hudi framework. If we can make this
>> > > feature
>> > > >>> > better, it will be even more exciting.
>> > > >>> >
>> > > >>> > In the data warehouse, in some complex calculations, I have not
>> > > found a
>> > > >>> > good way to conveniently use some incremental change data
>> (similar
>> > to
>> > > >>> the
>> > > >>> > concept of retracement stream in Flink?) to locally "correct"
>> the
>> > > >>> > aggregation result (these aggregation results may belong to the
>> DWS
>> > > >>> layer).
>> > > >>> >
>> > > >>> > BTW: Yes, I do admit that some simple calculation scenarios
>> (single
>> > > >>> table
>> > > >>> > or an algorithm that can be very easily retracement) can be
>> dealt
>> > > with
>> > > >>> > based on the incremental calculation of CDC.
>> > > >>> >
>> > > >>> > Of course, the expression of incremental calculation on various
>> > > >>> occasions
>> > > >>> > is sometimes not very clear. Maybe we will discuss it more
>> clearly
>> > in
>> > > >>> > specific scenarios.
>> > > >>> >
>> > > >>> > >> If HUDI can keep and propagate these change flags to its
>> > > consumers,
>> > > >>> we
>> > > >>> > can
>> > > >>> > use HUDI as the unified format for the pipeline.
>> > > >>> >
>> > > >>> > Regarding the "change flags" here, do you mean the flags like
>> the
>> > one
>> > > >>> > shown in the figure below?
>> > > >>> >
>> > > >>> > [image: image.png]
>> > > >>> >
>> > > >>> > Best,
>> > > >>> > Vino
>> > > >>> >
>> > > >>> > Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道:
>> > > >>> >
>> > > >>> >> Hi dear HUDI community ~ Here i want to fire a discuss about
>> using
>> > > >>> HUDI as
>> > > >>> >> the unified storage/format for data warehouse/lake incremental
>> > > >>> >> computation.
>> > > >>> >>
>> > > >>> >> Usually people divide data warehouse production into several
>> > levels,
>> > > >>> such
>> > > >>> >> as the ODS(operation data store), DWD(data warehouse details),
>> > > >>> DWS(data
>> > > >>> >> warehouse service), ADS(application data service).
>> > > >>> >>
>> > > >>> >>
>> > > >>> >> ODS -> DWD -> DWS -> ADS
>> > > >>> >>
>> > > >>> >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a
>> big
>> > > >>> topic is
>> > > >>> >> syncing the change log(CDC pattern) from all kinds of RDBMS
>> into
>> > the
>> > > >>> >> warehouse/lake, the cdc patten records and propagate the change
>> > > flag:
>> > > >>> >> insert, update(before and after) and delete for the consumer,
>> with
>> > > >>> these
>> > > >>> >> flags, the downstream engines can have a realtime accumulation
>> > > >>> >> computation.
>> > > >>> >>
>> > > >>> >> Using streaming engine like Flink, we can have a totally
>> > > >>> NEAR-REAL-TIME
>> > > >>> >> computation pipeline for each of the layer.
>> > > >>> >>
>> > > >>> >> If HUDI can keep and propagate these change flags to its
>> > consumers,
>> > > >>> we can
>> > > >>> >> use HUDI as the unified format for the pipeline.
>> > > >>> >>
>> > > >>> >> I'm expecting your nice ideas here ~
>> > > >>> >>
>> > > >>> >> Best,
>> > > >>> >> Danny Chan
>> > > >>> >>
>> > > >>> >
>> > > >>>
>> > > >>
>> > >
>> >
>>
>
>
> --
> Regards,
> -Sivabalan
>


-- 
Regards,
-Sivabalan

Reply via email to