Re: [DISCUSS] Introduce incremental processing API in Hudi

Vinoth Chandar Tue, 01 Sep 2020 09:14:18 -0700

Hi,

While I agree on bringing more of these capabilities to Hudi natively, I
have few questions/concerns on the specific approach.


> And these calculation functions should be engine independent. Therefore,
I plan to introduce some new APIs that allow users to directly define

Today, if I am a Spark developer, I can write a little program to do a Hudi
upsert and then trigger some other transformation conditionally based on
whether upsert/insert happened, right?
and I could do that without losing any of the existing transformation
methods I know in Spark. I am not quite clear on how much value this
library adds on top and in fact, bit concerned
that we set ourselves up for solving engine-independent problems that
Apache Beam for e.g has already solved.

I also have doubts on whether coupling the incremental processing after
commit into a single process itself is desirable. Typical scenarios I have
seen, job A ingests data into table A, job B
incrementally queries table A and kicks another ETL to build table B. Job A
and B are typically different and written by different developers.
If you could help me understand the use-case, that would be awesome.

All that said, there are pains around "triggering" job B (downstream
computations incrementally) and we could solve that by for e.g supporting
an Apache Airflow operator that can trigger workflows
when commits arrive on its upstream tables. What I am trying to say is -
there is definitely gaps we would like to improve upon to make incremental
processing mainstream, not sure if the proposed
APIs are the highest on that list.

Apologies if I am missing something. Please help me understand if so.

Thanks
Vinoth




On Tue, Sep 1, 2020 at 4:26 AM vino yang <yanghua1...@gmail.com> wrote:

> Hi,
>
> Does anyone have ideas or disagreements?
>
> I think the introduction of these APIs will greatly enhance Hudi's data
> processing capabilities and eliminate the performance overhead of reading
> data for processing after writing.
>
> Best,
> Vino
>
> wangxianghu <wxhj...@126.com> 于2020年8月31日周一 下午3:44写道：
>
> > +1
> > This will give hudi more capabilities besides data ingestion and writing,
> > and make hudi-based data processing more timely!
> > Best,
> > wangxianghu
> >
> > 发件人: Abhishek Modi
> > 发送时间: 2020年8月31日 15:01
> > 收件人: dev@hudi.apache.org
> > 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
> >
> > +1
> >
> > This sounds really interesting! I like that this implicitly gives Hudi
> the
> > ability to do transformations on ingested data :)
> >
> > On Sun, Aug 30, 2020 at 10:59 PM vino yang <vinoy...@apache.org> wrote:
> >
> > > Hi everyone,
> > >
> > >
> > > For a long time, in the field of big data, people hope that the tools
> > they
> > > use can give greater play to the processing and analysis capabilities
> of
> > > big data. At present, from the perspective of API, Hudi mostly provides
> > > APIs related to data ingestion, and relies on various big data query
> > > engines on the query side to release capabilities, but does not
> provide a
> > > more convenient API for data processing after transactional writing.
> > >
> > > Currently, if a user wants to process the incremental data of a commit
> > that
> > > has just recently taken. It needs to go through three steps:
> > >
> > >
> > >    1.
> > >
> > >    Write data to a hudi table;
> > >    2.
> > >
> > >    Query or check completion of commit;
> > >    3.
> > >
> > >    After the data is committed, the data is found out through
> incremental
> > >    query, and then the data is processed;
> > >
> > >
> > > If you want a quick link here, you may use Hudi's recent written commit
> > > callback function to simplify it into two steps:
> > >
> > >
> > >    1.
> > >
> > >    Write data to a hudi table;
> > >    2.
> > >
> > >    Based on the written commit callback function to trigger an
> > incremental
> > >    query to find out the data, and then perform data processing;
> > >
> > >
> > > However, it is still very troublesome to split into two steps for
> > scenarios
> > > that want to perform more timely and efficient data analysis on the
> data
> > > ingest pipeline. Therefore, I propose to merge the entire process into
> > one
> > > step and provide a set of incremental(or saying Pipelined) processing
> API
> > > based on this:
> > >
> > > Write the data to a hudi table, after obtaining the data through
> > > JavaRDD<WriteStatus>, directly apply the user-defined function(UDF) to
> > > process the data. The processing behavior can be described via these
> two
> > > steps:
> > >
> > >
> > >    1.
> > >
> > >    Conventional conversion such as Map/Filter/Reduce;
> > >    2.
> > >
> > >    Aggregation calculation based on fixed time window;
> > >
> > >
> > > And these calculation functions should be engine independent.
> Therefore,
> > I
> > > plan to introduce some new APIs that allow users to directly define
> > > incremental processing capabilities after each writing operation.
> > >
> > > The preliminary idea is that we can introduce a tool class, for
> example,
> > > named: IncrementalProcessingBuilder or PipelineBuilder, which can be
> used
> > > like this:
> > >
> > > IncrementalProcessingBuilder builder = new
> > IncrementalProcessingBuilder();
> > >
> > > builder.source() //soure table
> > >
> > > .transform()
> > >
> > > .sink()          //derived table
> > >
> > > .build();
> > >
> > > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD<HoodieRecord<T>>
> > > records, HudiMapFunction mapFunction);
> > >
> > > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD<HoodieRecord<T>>
> > > records, HudiMapFunction mapFunction);
> > >
> > > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD<HoodieRecord<T>>
> > > records, HudiFilterFunction mapFunction);
> > >
> > > //window function
> > >
> > >
> >
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD<HoodieRecord<T>>
> > > records, HudiAggregateFunction aggFunction);
> > >
> > > It is suitable for scenarios where the commit interval (window) is
> > moderate
> > > and the delay of data ingestion is not very concerned.
> > >
> > >
> > > What do you think? Looking forward to your thoughts and opinions.
> > >
> > >
> > > Best,
> > >
> > > Vino
> > >
> >
> >
> >
>

Re: [DISCUSS] Introduce incremental processing API in Hudi

Reply via email to