Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Danny Chan
Hi, i have created a PR here: https://github.com/apache/hudi/pull/2854/files In the PR i do these changes: 1. Add a metadata column: "_hoodie_cdc_operation", i did not add a config option because i can not find a good way to make the code clean, a metadata column is very primitive and a config opt

Re: Re[2]:Re: About re-run Travis CI

2021-04-20 Thread Vinoth Chandar
Ack. But the "rerun tests" bot should be working. I see the github actions running actually. So not sure. https://github.com/apache/hudi/actions May be need a JIRA to investigate :) On Fri, Apr 16, 2021 at 6:44 AM Roc Marshal wrote: > > > > Susudong. > Thanks for your help. > Now,

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar
Hi Danny, Read up on the Flink docs as well. If we don't actually publish data to the metacolumn, I think the overhead is pretty low w.r.t avro/parquet. Both are very good at encoding nulls. But, I feel it's worth adding a HoodieWriteConfig to control this and since addition of meta columns mostl

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Danny Chan
> Is it providing the ability to author continuous queries on Hudi source tables end-end, given Flink can use the flags to generate retract/upsert streams Yes,that's the key point, with these flags plus flink stateful operators, we can have a real time incremental ETL pipeline. For example, a glo

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar
Keeping compatibility is a must. i.e users should be able to upgrade to the new release with the _hoodie_cdc_flag meta column, and be able to query new data (with this new meta col) alongside old data (without this new meta col). In fact, they should be able to downgrade back to previous versions (