Hi Danny,
Thanks, I will review this asap. Already, in the "review in progress"
column.
Thanks
Vinoth
On Thu, Apr 22, 2021 at 12:49 AM Danny Chan wrote:
> > Should we throw together a PoC/test code for an example Flink pipeline
> that
> will use hudi cdc flags + state ful operators?
>
> I have
> Should we throw together a PoC/test code for an example Flink pipeline
that
will use hudi cdc flags + state ful operators?
I have updated the pr https://github.com/apache/hudi/pull/2854,
see the test case HoodieDataSourceITCase#testStreamReadWithDeletes.
A data source:
change_flag | uuid | na
Keeping compatibility is a must. i.e users should be able to upgrade to the
new release with the _hoodie_cdc_flag meta column,
and be able to query new data (with this new meta col) alongside old data
(without this new meta col).
In fact, they should be able to downgrade back to previous versions (
> Is it providing the ability to author continuous queries on
Hudi source tables end-end,
given Flink can use the flags to generate retract/upsert streams
Yes,that's the key point, with these flags plus flink stateful operators,
we can have a real time incremental ETL pipeline.
For example, a glo
Hi Danny,
Read up on the Flink docs as well.
If we don't actually publish data to the metacolumn, I think the overhead
is pretty low w.r.t avro/parquet. Both are very good at encoding nulls.
But, I feel it's worth adding a HoodieWriteConfig to control this and since
addition of meta columns mostl
Hi, i have created a PR here: https://github.com/apache/hudi/pull/2854/files
In the PR i do these changes:
1. Add a metadata column: "_hoodie_cdc_operation", i did not add a config
option because i can not find a good way to make the code clean, a metadata
column is very primitive and a config opt
Thanks @Sivabalan ~
I agree that parquet and log files should keep sync in metadata columns in
case there are confusions
and special handling in some use cases like compaction.
I also agree add a metadata column is more ease to use for SQL connectors.
We can add a metadata column named "_hoodie_
wrt changes if we plan to add this only to log files, compaction needs to
be fixed to omit this column to the minimum.
On Fri, Apr 16, 2021 at 9:07 PM Sivabalan wrote:
> Just got a chance to read about dynamic tables. sounds interesting.
>
> some thoughts on your questions:
> - yes, just MOR mak
Just got a chance to read about dynamic tables. sounds interesting.
some thoughts on your questions:
- yes, just MOR makes sense.
- But adding this new meta column only to avro logs might incur some non
trivial changes. Since as of today, schema of avro and base files are in
sync. If this new col
Thanks Vinoth ~
Here is a document about the notion of 《Flink Dynamic Table》[1] , every
operator that has accumulate state can handle retractions(UPDATE_BEFORE or
DELETE) then apply new changes (INSERT or UPDATE_AFTER), so that each
operator can consume the CDC format messages in streaming way.
>
Hi,
Is the intent of the flag to convey if an insert delete or update changed
the record? If so I would imagine that we do this even for cow tables,
since that also supports a logical notion of a change stream using the
commit_time meta field.
You may be right, but I am trying to understand the u
I tries to do a POC for flink locally and it works well, in the PR i add a
new metadata column named "_hoodie_change_flag", but actually i found that
only log format needs this flag, and the Spark may has no ability to handle
the flag for incremental processing yet.
So should i add the "_hoodie_ch
Thanks cool, then the left questions are:
- where we record these change, should we add a builtin meta field such as
the _change_flag_ like the other system columns for e.g _hoodie_commit_time
- what kind of table should keep these flags, in my thoughts, we should
only add these flags for "MERGE_O
>> Oops, the image crushes, for "change flags", i mean: insert,
update(before
and after) and delete.
Yes, the image I attached is also about these flags.
[image: image (3).png]
+1 for the idea.
Best,
Vino
Danny Chan 于2021年4月1日周四 上午10:03写道:
> Oops, the image crushes, for "change flags", i mea
Oops, the image crushes, for "change flags", i mean: insert, update(before
and after) and delete.
The Flink engine can propagate the change flags internally between its
operators, if HUDI can send the change flags to Flink, the incremental
calculation of CDC would be very natural (almost transpare
Hi Danny,
Thanks for kicking off this discussion thread.
Yes, incremental query( or says "incremental processing") has always been
an important feature of the Hudi framework. If we can make this feature
better, it will be even more exciting.
In the data warehouse, in some complex calculations, I
Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI as
the unified storage/format for data warehouse/lake incremental computation.
Usually people divide data warehouse production into several levels, such
as the ODS(operation data store), DWD(data warehouse details), DWS(data
w
17 matches
Mail list logo