Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-15 Thread vino yang
t; > > there is definitely gaps we would like to improve upon to make
> > > incremental
> > > > processing mainstream, not sure if the proposed
> > > > APIs are the highest on that list.
> > > >
> > > > Yes, Airflow can solve the triggering problem. We are using another
> > > > scheduler framework: Apache Dolphinscheduler[1]
> > > > that can also trigger hudi's incremental processing too.
> > > >
> > > > The key difference is the performance and effective triggering,
> right?
> > > > This proposal tries to add more transformation into the same Spark's
> > DAG
> > > of
> > > > the original data ingestion.
> > > >
> > > > So, in short, this proposal tries to bring something:
> > > >
> > > >- performance: better performance when processing after data
> > > ingestion;
> > > >- focus and fluent: inline ingestion and processing logic in some
> > > >scenarios;
> > > >- boundary: at a high level, introduce more ability of computing
> > > engine
> > > >directly, we have depends on the computing engine right? why not
> > > release
> > > >more?
> > > >
> > > > Of course, as you said, we will face some problems, such as the
> problem
> > > of
> > > > providing abstraction on multiple computing engines.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > [1]: https://dolphinscheduler.apache.org/
> > > >
> > > >
> > > >
> > > > Vinoth Chandar  于2020年9月2日周三 上午12:13写道:
> > > >
> > > > > Hi,
> > > > >
> > > > > While I agree on bringing more of these capabilities to Hudi
> > natively,
> > > I
> > > > > have few questions/concerns on the specific approach.
> > > > >
> > > > > > And these calculation functions should be engine independent.
> > > > Therefore,
> > > > > I plan to introduce some new APIs that allow users to directly
> define
> > > > >
> > > > > Today, if I am a Spark developer, I can write a little program to
> do
> > a
> > > > Hudi
> > > > > upsert and then trigger some other transformation conditionally
> based
> > > on
> > > > > whether upsert/insert happened, right?
> > > > > and I could do that without losing any of the existing
> transformation
> > > > > methods I know in Spark. I am not quite clear on how much value
> this
> > > > > library adds on top and in fact, bit concerned
> > > > > that we set ourselves up for solving engine-independent problems
> that
> > > > > Apache Beam for e.g has already solved.
> > > > >
> > > > > I also have doubts on whether coupling the incremental processing
> > after
> > > > > commit into a single process itself is desirable. Typical
> scenarios I
> > > > have
> > > > > seen, job A ingests data into table A, job B
> > > > > incrementally queries table A and kicks another ETL to build table
> B.
> > > > Job A
> > > > > and B are typically different and written by different developers.
> > > > > If you could help me understand the use-case, that would be
> awesome.
> > > > >
> > > > > All that said, there are pains around "triggering" job B
> (downstream
> > > > > computations incrementally) and we could solve that by for e.g
> > > supporting
> > > > > an Apache Airflow operator that can trigger workflows
> > > > > when commits arrive on its upstream tables. What I am trying to say
> > is
> > > -
> > > > > there is definitely gaps we would like to improve upon to make
> > > > incremental
> > > > > processing mainstream, not sure if the proposed
> > > > > APIs are the highest on that list.
> > > > >
> > > > > Apologies if I am missing something. Please help me understand if
> so.
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Sep 1, 2020 at 4:26 AM vino yang 
> > > wrote:
> > > > >
> > > > > > Hi,
> > 

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-14 Thread Vinoth Chandar
c approach.
> > > >
> > > > > And these calculation functions should be engine independent.
> > > Therefore,
> > > > I plan to introduce some new APIs that allow users to directly define
> > > >
> > > > Today, if I am a Spark developer, I can write a little program to do
> a
> > > Hudi
> > > > upsert and then trigger some other transformation conditionally based
> > on
> > > > whether upsert/insert happened, right?
> > > > and I could do that without losing any of the existing transformation
> > > > methods I know in Spark. I am not quite clear on how much value this
> > > > library adds on top and in fact, bit concerned
> > > > that we set ourselves up for solving engine-independent problems that
> > > > Apache Beam for e.g has already solved.
> > > >
> > > > I also have doubts on whether coupling the incremental processing
> after
> > > > commit into a single process itself is desirable. Typical scenarios I
> > > have
> > > > seen, job A ingests data into table A, job B
> > > > incrementally queries table A and kicks another ETL to build table B.
> > > Job A
> > > > and B are typically different and written by different developers.
> > > > If you could help me understand the use-case, that would be awesome.
> > > >
> > > > All that said, there are pains around "triggering" job B (downstream
> > > > computations incrementally) and we could solve that by for e.g
> > supporting
> > > > an Apache Airflow operator that can trigger workflows
> > > > when commits arrive on its upstream tables. What I am trying to say
> is
> > -
> > > > there is definitely gaps we would like to improve upon to make
> > > incremental
> > > > processing mainstream, not sure if the proposed
> > > > APIs are the highest on that list.
> > > >
> > > > Apologies if I am missing something. Please help me understand if so.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Sep 1, 2020 at 4:26 AM vino yang 
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Does anyone have ideas or disagreements?
> > > > >
> > > > > I think the introduction of these APIs will greatly enhance Hudi's
> > data
> > > > > processing capabilities and eliminate the performance overhead of
> > > reading
> > > > > data for processing after writing.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > wangxianghu  于2020年8月31日周一 下午3:44写道:
> > > > >
> > > > > > +1
> > > > > > This will give hudi more capabilities besides data ingestion and
> > > > writing,
> > > > > > and make hudi-based data processing more timely!
> > > > > > Best,
> > > > > > wangxianghu
> > > > > >
> > > > > > 发件人: Abhishek Modi
> > > > > > 发送时间: 2020年8月31日 15:01
> > > > > > 收件人: dev@hudi.apache.org
> > > > > > 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > This sounds really interesting! I like that this implicitly gives
> > > Hudi
> > > > > the
> > > > > > ability to do transformations on ingested data :)
> > > > > >
> > > > > > On Sun, Aug 30, 2020 at 10:59 PM vino yang 
> > > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > >
> > > > > > > For a long time, in the field of big data, people hope that the
> > > tools
> > > > > > they
> > > > > > > use can give greater play to the processing and analysis
> > > capabilities
> > > > > of
> > > > > > > big data. At present, from the perspective of API, Hudi mostly
> > > > provides
> > > > > > > APIs related to data ingestion, and relies on various big data
> > > query
> > > > > > > engines on the query side to release capabilities, b

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-09 Thread vino yang
 processing?
> >
> > > Typical scenarios I have
> > seen, job A ingests data into table A, job B
> > incrementally queries table A and kicks another ETL to build table B.
> Job A
> > and B are typically different and written by different developers.
> > If you could help me understand the use-case, that would be awesome.
> >
> > Yes, the general scenarios are like what you said, especially, data
> > warehouse.
> > However, there are some scenarios that are not about table processing.
> e.g.
> > metrics calculating,
> > quickly aggregate and calculate windows (If we fix the commit cycle, in a
> > sense, it is a fixed window
> > based on processing-time semantics.) and so on.
> >
> > After the data has landed, it can be processed quickly without
> re-reading.
> > It's a key advantage.
> >
> > > All that said, there are pains around "triggering" job B (downstream
> > computations incrementally) and we could solve that by for e.g supporting
> > an Apache Airflow operator that can trigger workflows
> > when commits arrive on its upstream tables. What I am trying to say is -
> > there is definitely gaps we would like to improve upon to make
> incremental
> > processing mainstream, not sure if the proposed
> > APIs are the highest on that list.
> >
> > Yes, Airflow can solve the triggering problem. We are using another
> > scheduler framework: Apache Dolphinscheduler[1]
> > that can also trigger hudi's incremental processing too.
> >
> > The key difference is the performance and effective triggering, right?
> > This proposal tries to add more transformation into the same Spark's DAG
> of
> > the original data ingestion.
> >
> > So, in short, this proposal tries to bring something:
> >
> >- performance: better performance when processing after data
> ingestion;
> >- focus and fluent: inline ingestion and processing logic in some
> >scenarios;
> >- boundary: at a high level, introduce more ability of computing
> engine
> >directly, we have depends on the computing engine right? why not
> release
> >more?
> >
> > Of course, as you said, we will face some problems, such as the problem
> of
> > providing abstraction on multiple computing engines.
> >
> > Best,
> > Vino
> >
> > [1]: https://dolphinscheduler.apache.org/
> >
> >
> >
> > Vinoth Chandar  于2020年9月2日周三 上午12:13写道:
> >
> > > Hi,
> > >
> > > While I agree on bringing more of these capabilities to Hudi natively,
> I
> > > have few questions/concerns on the specific approach.
> > >
> > > > And these calculation functions should be engine independent.
> > Therefore,
> > > I plan to introduce some new APIs that allow users to directly define
> > >
> > > Today, if I am a Spark developer, I can write a little program to do a
> > Hudi
> > > upsert and then trigger some other transformation conditionally based
> on
> > > whether upsert/insert happened, right?
> > > and I could do that without losing any of the existing transformation
> > > methods I know in Spark. I am not quite clear on how much value this
> > > library adds on top and in fact, bit concerned
> > > that we set ourselves up for solving engine-independent problems that
> > > Apache Beam for e.g has already solved.
> > >
> > > I also have doubts on whether coupling the incremental processing after
> > > commit into a single process itself is desirable. Typical scenarios I
> > have
> > > seen, job A ingests data into table A, job B
> > > incrementally queries table A and kicks another ETL to build table B.
> > Job A
> > > and B are typically different and written by different developers.
> > > If you could help me understand the use-case, that would be awesome.
> > >
> > > All that said, there are pains around "triggering" job B (downstream
> > > computations incrementally) and we could solve that by for e.g
> supporting
> > > an Apache Airflow operator that can trigger workflows
> > > when commits arrive on its upstream tables. What I am trying to say is
> -
> > > there is definitely gaps we would like to improve upon to make
> > incremental
> > > processing mainstream, not sure if the proposed
> > > APIs are the highest on that list.
> > >
> > > Apologies if I am missing something. Please help me understand if so.
> > >
> > > Thanks
>

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-09 Thread Vinoth Chandar
 and processing logic in some
>scenarios;
>- boundary: at a high level, introduce more ability of computing engine
>directly, we have depends on the computing engine right? why not release
>more?
>
> Of course, as you said, we will face some problems, such as the problem of
> providing abstraction on multiple computing engines.
>
> Best,
> Vino
>
> [1]: https://dolphinscheduler.apache.org/
>
>
>
> Vinoth Chandar  于2020年9月2日周三 上午12:13写道:
>
> > Hi,
> >
> > While I agree on bringing more of these capabilities to Hudi natively, I
> > have few questions/concerns on the specific approach.
> >
> > > And these calculation functions should be engine independent.
> Therefore,
> > I plan to introduce some new APIs that allow users to directly define
> >
> > Today, if I am a Spark developer, I can write a little program to do a
> Hudi
> > upsert and then trigger some other transformation conditionally based on
> > whether upsert/insert happened, right?
> > and I could do that without losing any of the existing transformation
> > methods I know in Spark. I am not quite clear on how much value this
> > library adds on top and in fact, bit concerned
> > that we set ourselves up for solving engine-independent problems that
> > Apache Beam for e.g has already solved.
> >
> > I also have doubts on whether coupling the incremental processing after
> > commit into a single process itself is desirable. Typical scenarios I
> have
> > seen, job A ingests data into table A, job B
> > incrementally queries table A and kicks another ETL to build table B.
> Job A
> > and B are typically different and written by different developers.
> > If you could help me understand the use-case, that would be awesome.
> >
> > All that said, there are pains around "triggering" job B (downstream
> > computations incrementally) and we could solve that by for e.g supporting
> > an Apache Airflow operator that can trigger workflows
> > when commits arrive on its upstream tables. What I am trying to say is -
> > there is definitely gaps we would like to improve upon to make
> incremental
> > processing mainstream, not sure if the proposed
> > APIs are the highest on that list.
> >
> > Apologies if I am missing something. Please help me understand if so.
> >
> > Thanks
> > Vinoth
> >
> >
> >
> >
> > On Tue, Sep 1, 2020 at 4:26 AM vino yang  wrote:
> >
> > > Hi,
> > >
> > > Does anyone have ideas or disagreements?
> > >
> > > I think the introduction of these APIs will greatly enhance Hudi's data
> > > processing capabilities and eliminate the performance overhead of
> reading
> > > data for processing after writing.
> > >
> > > Best,
> > > Vino
> > >
> > > wangxianghu  于2020年8月31日周一 下午3:44写道:
> > >
> > > > +1
> > > > This will give hudi more capabilities besides data ingestion and
> > writing,
> > > > and make hudi-based data processing more timely!
> > > > Best,
> > > > wangxianghu
> > > >
> > > > 发件人: Abhishek Modi
> > > > 发送时间: 2020年8月31日 15:01
> > > > 收件人: dev@hudi.apache.org
> > > > 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
> > > >
> > > > +1
> > > >
> > > > This sounds really interesting! I like that this implicitly gives
> Hudi
> > > the
> > > > ability to do transformations on ingested data :)
> > > >
> > > > On Sun, Aug 30, 2020 at 10:59 PM vino yang 
> > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > >
> > > > > For a long time, in the field of big data, people hope that the
> tools
> > > > they
> > > > > use can give greater play to the processing and analysis
> capabilities
> > > of
> > > > > big data. At present, from the perspective of API, Hudi mostly
> > provides
> > > > > APIs related to data ingestion, and relies on various big data
> query
> > > > > engines on the query side to release capabilities, but does not
> > > provide a
> > > > > more convenient API for data processing after transactional
> writing.
> > > > >
> > > > > Currently, if a user wants to process the incremental data of a
> > commit
> > > > that
> > > > > has just recently taken. It needs to go through 

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-01 Thread vino yang
; incrementally queries table A and kicks another ETL to build table B. Job A
> and B are typically different and written by different developers.
> If you could help me understand the use-case, that would be awesome.
>
> All that said, there are pains around "triggering" job B (downstream
> computations incrementally) and we could solve that by for e.g supporting
> an Apache Airflow operator that can trigger workflows
> when commits arrive on its upstream tables. What I am trying to say is -
> there is definitely gaps we would like to improve upon to make incremental
> processing mainstream, not sure if the proposed
> APIs are the highest on that list.
>
> Apologies if I am missing something. Please help me understand if so.
>
> Thanks
> Vinoth
>
>
>
>
> On Tue, Sep 1, 2020 at 4:26 AM vino yang  wrote:
>
> > Hi,
> >
> > Does anyone have ideas or disagreements?
> >
> > I think the introduction of these APIs will greatly enhance Hudi's data
> > processing capabilities and eliminate the performance overhead of reading
> > data for processing after writing.
> >
> > Best,
> > Vino
> >
> > wangxianghu  于2020年8月31日周一 下午3:44写道:
> >
> > > +1
> > > This will give hudi more capabilities besides data ingestion and
> writing,
> > > and make hudi-based data processing more timely!
> > > Best,
> > > wangxianghu
> > >
> > > 发件人: Abhishek Modi
> > > 发送时间: 2020年8月31日 15:01
> > > 收件人: dev@hudi.apache.org
> > > 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
> > >
> > > +1
> > >
> > > This sounds really interesting! I like that this implicitly gives Hudi
> > the
> > > ability to do transformations on ingested data :)
> > >
> > > On Sun, Aug 30, 2020 at 10:59 PM vino yang 
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > For a long time, in the field of big data, people hope that the tools
> > > they
> > > > use can give greater play to the processing and analysis capabilities
> > of
> > > > big data. At present, from the perspective of API, Hudi mostly
> provides
> > > > APIs related to data ingestion, and relies on various big data query
> > > > engines on the query side to release capabilities, but does not
> > provide a
> > > > more convenient API for data processing after transactional writing.
> > > >
> > > > Currently, if a user wants to process the incremental data of a
> commit
> > > that
> > > > has just recently taken. It needs to go through three steps:
> > > >
> > > >
> > > >1.
> > > >
> > > >Write data to a hudi table;
> > > >2.
> > > >
> > > >Query or check completion of commit;
> > > >3.
> > > >
> > > >After the data is committed, the data is found out through
> > incremental
> > > >query, and then the data is processed;
> > > >
> > > >
> > > > If you want a quick link here, you may use Hudi's recent written
> commit
> > > > callback function to simplify it into two steps:
> > > >
> > > >
> > > >1.
> > > >
> > > >Write data to a hudi table;
> > > >2.
> > > >
> > > >Based on the written commit callback function to trigger an
> > > incremental
> > > >query to find out the data, and then perform data processing;
> > > >
> > > >
> > > > However, it is still very troublesome to split into two steps for
> > > scenarios
> > > > that want to perform more timely and efficient data analysis on the
> > data
> > > > ingest pipeline. Therefore, I propose to merge the entire process
> into
> > > one
> > > > step and provide a set of incremental(or saying Pipelined) processing
> > API
> > > > based on this:
> > > >
> > > > Write the data to a hudi table, after obtaining the data through
> > > > JavaRDD, directly apply the user-defined function(UDF)
> to
> > > > process the data. The processing behavior can be described via these
> > two
> > > > steps:
> > > >
> > > >
> > > >1.
> > > >
> > > >Conventional conversion such as Map/Filter/Reduce;
> > > >2.
> &g

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-01 Thread Vinoth Chandar
Hi,

While I agree on bringing more of these capabilities to Hudi natively, I
have few questions/concerns on the specific approach.

> And these calculation functions should be engine independent. Therefore,
I plan to introduce some new APIs that allow users to directly define

Today, if I am a Spark developer, I can write a little program to do a Hudi
upsert and then trigger some other transformation conditionally based on
whether upsert/insert happened, right?
and I could do that without losing any of the existing transformation
methods I know in Spark. I am not quite clear on how much value this
library adds on top and in fact, bit concerned
that we set ourselves up for solving engine-independent problems that
Apache Beam for e.g has already solved.

I also have doubts on whether coupling the incremental processing after
commit into a single process itself is desirable. Typical scenarios I have
seen, job A ingests data into table A, job B
incrementally queries table A and kicks another ETL to build table B. Job A
and B are typically different and written by different developers.
If you could help me understand the use-case, that would be awesome.

All that said, there are pains around "triggering" job B (downstream
computations incrementally) and we could solve that by for e.g supporting
an Apache Airflow operator that can trigger workflows
when commits arrive on its upstream tables. What I am trying to say is -
there is definitely gaps we would like to improve upon to make incremental
processing mainstream, not sure if the proposed
APIs are the highest on that list.

Apologies if I am missing something. Please help me understand if so.

Thanks
Vinoth




On Tue, Sep 1, 2020 at 4:26 AM vino yang  wrote:

> Hi,
>
> Does anyone have ideas or disagreements?
>
> I think the introduction of these APIs will greatly enhance Hudi's data
> processing capabilities and eliminate the performance overhead of reading
> data for processing after writing.
>
> Best,
> Vino
>
> wangxianghu  于2020年8月31日周一 下午3:44写道:
>
> > +1
> > This will give hudi more capabilities besides data ingestion and writing,
> > and make hudi-based data processing more timely!
> > Best,
> > wangxianghu
> >
> > 发件人: Abhishek Modi
> > 发送时间: 2020年8月31日 15:01
> > 收件人: dev@hudi.apache.org
> > 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
> >
> > +1
> >
> > This sounds really interesting! I like that this implicitly gives Hudi
> the
> > ability to do transformations on ingested data :)
> >
> > On Sun, Aug 30, 2020 at 10:59 PM vino yang  wrote:
> >
> > > Hi everyone,
> > >
> > >
> > > For a long time, in the field of big data, people hope that the tools
> > they
> > > use can give greater play to the processing and analysis capabilities
> of
> > > big data. At present, from the perspective of API, Hudi mostly provides
> > > APIs related to data ingestion, and relies on various big data query
> > > engines on the query side to release capabilities, but does not
> provide a
> > > more convenient API for data processing after transactional writing.
> > >
> > > Currently, if a user wants to process the incremental data of a commit
> > that
> > > has just recently taken. It needs to go through three steps:
> > >
> > >
> > >1.
> > >
> > >Write data to a hudi table;
> > >2.
> > >
> > >Query or check completion of commit;
> > >3.
> > >
> > >After the data is committed, the data is found out through
> incremental
> > >query, and then the data is processed;
> > >
> > >
> > > If you want a quick link here, you may use Hudi's recent written commit
> > > callback function to simplify it into two steps:
> > >
> > >
> > >1.
> > >
> > >Write data to a hudi table;
> > >2.
> > >
> > >Based on the written commit callback function to trigger an
> > incremental
> > >query to find out the data, and then perform data processing;
> > >
> > >
> > > However, it is still very troublesome to split into two steps for
> > scenarios
> > > that want to perform more timely and efficient data analysis on the
> data
> > > ingest pipeline. Therefore, I propose to merge the entire process into
> > one
> > > step and provide a set of incremental(or saying Pipelined) processing
> API
> > > based on this:
> > >
> > > Write the data to a hudi table, after obtaining the data through
> > > JavaRDD, d

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-01 Thread vino yang
Hi,

Does anyone have ideas or disagreements?

I think the introduction of these APIs will greatly enhance Hudi's data
processing capabilities and eliminate the performance overhead of reading
data for processing after writing.

Best,
Vino

wangxianghu  于2020年8月31日周一 下午3:44写道:

> +1
> This will give hudi more capabilities besides data ingestion and writing,
> and make hudi-based data processing more timely!
> Best,
> wangxianghu
>
> 发件人: Abhishek Modi
> 发送时间: 2020年8月31日 15:01
> 收件人: dev@hudi.apache.org
> 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
>
> +1
>
> This sounds really interesting! I like that this implicitly gives Hudi the
> ability to do transformations on ingested data :)
>
> On Sun, Aug 30, 2020 at 10:59 PM vino yang  wrote:
>
> > Hi everyone,
> >
> >
> > For a long time, in the field of big data, people hope that the tools
> they
> > use can give greater play to the processing and analysis capabilities of
> > big data. At present, from the perspective of API, Hudi mostly provides
> > APIs related to data ingestion, and relies on various big data query
> > engines on the query side to release capabilities, but does not provide a
> > more convenient API for data processing after transactional writing.
> >
> > Currently, if a user wants to process the incremental data of a commit
> that
> > has just recently taken. It needs to go through three steps:
> >
> >
> >1.
> >
> >Write data to a hudi table;
> >2.
> >
> >Query or check completion of commit;
> >3.
> >
> >After the data is committed, the data is found out through incremental
> >query, and then the data is processed;
> >
> >
> > If you want a quick link here, you may use Hudi's recent written commit
> > callback function to simplify it into two steps:
> >
> >
> >1.
> >
> >Write data to a hudi table;
> >2.
> >
> >Based on the written commit callback function to trigger an
> incremental
> >query to find out the data, and then perform data processing;
> >
> >
> > However, it is still very troublesome to split into two steps for
> scenarios
> > that want to perform more timely and efficient data analysis on the data
> > ingest pipeline. Therefore, I propose to merge the entire process into
> one
> > step and provide a set of incremental(or saying Pipelined) processing API
> > based on this:
> >
> > Write the data to a hudi table, after obtaining the data through
> > JavaRDD, directly apply the user-defined function(UDF) to
> > process the data. The processing behavior can be described via these two
> > steps:
> >
> >
> >1.
> >
> >Conventional conversion such as Map/Filter/Reduce;
> >2.
> >
> >Aggregation calculation based on fixed time window;
> >
> >
> > And these calculation functions should be engine independent. Therefore,
> I
> > plan to introduce some new APIs that allow users to directly define
> > incremental processing capabilities after each writing operation.
> >
> > The preliminary idea is that we can introduce a tool class, for example,
> > named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
> > like this:
> >
> > IncrementalProcessingBuilder builder = new
> IncrementalProcessingBuilder();
> >
> > builder.source() //soure table
> >
> > .transform()
> >
> > .sink()  //derived table
> >
> > .build();
> >
> > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
> > records, HudiMapFunction mapFunction);
> >
> > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
> > records, HudiMapFunction mapFunction);
> >
> > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
> > records, HudiFilterFunction mapFunction);
> >
> > //window function
> >
> >
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
> > records, HudiAggregateFunction aggFunction);
> >
> > It is suitable for scenarios where the commit interval (window) is
> moderate
> > and the delay of data ingestion is not very concerned.
> >
> >
> > What do you think? Looking forward to your thoughts and opinions.
> >
> >
> > Best,
> >
> > Vino
> >
>
>
>


回复: [DISCUSS] Introduce incremental processing API in Hudi

2020-08-31 Thread wangxianghu
+1
This will give hudi more capabilities besides data ingestion and writing, and 
make hudi-based data processing more timely!
Best,
wangxianghu

发件人: Abhishek Modi
发送时间: 2020年8月31日 15:01
收件人: dev@hudi.apache.org
主题: Re: [DISCUSS] Introduce incremental processing API in Hudi

+1

This sounds really interesting! I like that this implicitly gives Hudi the
ability to do transformations on ingested data :)

On Sun, Aug 30, 2020 at 10:59 PM vino yang  wrote:

> Hi everyone,
>
>
> For a long time, in the field of big data, people hope that the tools they
> use can give greater play to the processing and analysis capabilities of
> big data. At present, from the perspective of API, Hudi mostly provides
> APIs related to data ingestion, and relies on various big data query
> engines on the query side to release capabilities, but does not provide a
> more convenient API for data processing after transactional writing.
>
> Currently, if a user wants to process the incremental data of a commit that
> has just recently taken. It needs to go through three steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Query or check completion of commit;
>3.
>
>After the data is committed, the data is found out through incremental
>query, and then the data is processed;
>
>
> If you want a quick link here, you may use Hudi's recent written commit
> callback function to simplify it into two steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Based on the written commit callback function to trigger an incremental
>query to find out the data, and then perform data processing;
>
>
> However, it is still very troublesome to split into two steps for scenarios
> that want to perform more timely and efficient data analysis on the data
> ingest pipeline. Therefore, I propose to merge the entire process into one
> step and provide a set of incremental(or saying Pipelined) processing API
> based on this:
>
> Write the data to a hudi table, after obtaining the data through
> JavaRDD, directly apply the user-defined function(UDF) to
> process the data. The processing behavior can be described via these two
> steps:
>
>
>1.
>
>Conventional conversion such as Map/Filter/Reduce;
>2.
>
>Aggregation calculation based on fixed time window;
>
>
> And these calculation functions should be engine independent. Therefore, I
> plan to introduce some new APIs that allow users to directly define
> incremental processing capabilities after each writing operation.
>
> The preliminary idea is that we can introduce a tool class, for example,
> named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
> like this:
>
> IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();
>
> builder.source() //soure table
>
> .transform()
>
> .sink()  //derived table
>
> .build();
>
> IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
> records, HudiFilterFunction mapFunction);
>
> //window function
>
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
> records, HudiAggregateFunction aggFunction);
>
> It is suitable for scenarios where the commit interval (window) is moderate
> and the delay of data ingestion is not very concerned.
>
>
> What do you think? Looking forward to your thoughts and opinions.
>
>
> Best,
>
> Vino
>




Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-08-31 Thread Abhishek Modi
+1

This sounds really interesting! I like that this implicitly gives Hudi the
ability to do transformations on ingested data :)

On Sun, Aug 30, 2020 at 10:59 PM vino yang  wrote:

> Hi everyone,
>
>
> For a long time, in the field of big data, people hope that the tools they
> use can give greater play to the processing and analysis capabilities of
> big data. At present, from the perspective of API, Hudi mostly provides
> APIs related to data ingestion, and relies on various big data query
> engines on the query side to release capabilities, but does not provide a
> more convenient API for data processing after transactional writing.
>
> Currently, if a user wants to process the incremental data of a commit that
> has just recently taken. It needs to go through three steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Query or check completion of commit;
>3.
>
>After the data is committed, the data is found out through incremental
>query, and then the data is processed;
>
>
> If you want a quick link here, you may use Hudi's recent written commit
> callback function to simplify it into two steps:
>
>
>1.
>
>Write data to a hudi table;
>2.
>
>Based on the written commit callback function to trigger an incremental
>query to find out the data, and then perform data processing;
>
>
> However, it is still very troublesome to split into two steps for scenarios
> that want to perform more timely and efficient data analysis on the data
> ingest pipeline. Therefore, I propose to merge the entire process into one
> step and provide a set of incremental(or saying Pipelined) processing API
> based on this:
>
> Write the data to a hudi table, after obtaining the data through
> JavaRDD, directly apply the user-defined function(UDF) to
> process the data. The processing behavior can be described via these two
> steps:
>
>
>1.
>
>Conventional conversion such as Map/Filter/Reduce;
>2.
>
>Aggregation calculation based on fixed time window;
>
>
> And these calculation functions should be engine independent. Therefore, I
> plan to introduce some new APIs that allow users to directly define
> incremental processing capabilities after each writing operation.
>
> The preliminary idea is that we can introduce a tool class, for example,
> named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
> like this:
>
> IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();
>
> builder.source() //soure table
>
> .transform()
>
> .sink()  //derived table
>
> .build();
>
> IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
> records, HudiFilterFunction mapFunction);
>
> //window function
>
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
> records, HudiAggregateFunction aggFunction);
>
> It is suitable for scenarios where the commit interval (window) is moderate
> and the delay of data ingestion is not very concerned.
>
>
> What do you think? Looking forward to your thoughts and opinions.
>
>
> Best,
>
> Vino
>


[DISCUSS] Introduce incremental processing API in Hudi

2020-08-30 Thread vino yang
Hi everyone,


For a long time, in the field of big data, people hope that the tools they
use can give greater play to the processing and analysis capabilities of
big data. At present, from the perspective of API, Hudi mostly provides
APIs related to data ingestion, and relies on various big data query
engines on the query side to release capabilities, but does not provide a
more convenient API for data processing after transactional writing.

Currently, if a user wants to process the incremental data of a commit that
has just recently taken. It needs to go through three steps:


   1.

   Write data to a hudi table;
   2.

   Query or check completion of commit;
   3.

   After the data is committed, the data is found out through incremental
   query, and then the data is processed;


If you want a quick link here, you may use Hudi's recent written commit
callback function to simplify it into two steps:


   1.

   Write data to a hudi table;
   2.

   Based on the written commit callback function to trigger an incremental
   query to find out the data, and then perform data processing;


However, it is still very troublesome to split into two steps for scenarios
that want to perform more timely and efficient data analysis on the data
ingest pipeline. Therefore, I propose to merge the entire process into one
step and provide a set of incremental(or saying Pipelined) processing API
based on this:

Write the data to a hudi table, after obtaining the data through
JavaRDD, directly apply the user-defined function(UDF) to
process the data. The processing behavior can be described via these two
steps:


   1.

   Conventional conversion such as Map/Filter/Reduce;
   2.

   Aggregation calculation based on fixed time window;


And these calculation functions should be engine independent. Therefore, I
plan to introduce some new APIs that allow users to directly define
incremental processing capabilities after each writing operation.

The preliminary idea is that we can introduce a tool class, for example,
named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
like this:

IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();

builder.source() //soure table

.transform()

.sink()  //derived table

.build();

IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
records, HudiMapFunction mapFunction);

IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
records, HudiMapFunction mapFunction);

IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
records, HudiFilterFunction mapFunction);

//window function

IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
records, HudiAggregateFunction aggFunction);

It is suitable for scenarios where the commit interval (window) is moderate
and the delay of data ingestion is not very concerned.


What do you think? Looking forward to your thoughts and opinions.


Best,

Vino