Re: [DISCUSS] Introduce incremental processing API in Hudi

vino yang Tue, 01 Sep 2020 04:27:22 -0700

Hi,

Does anyone have ideas or disagreements?


I think the introduction of these APIs will greatly enhance Hudi's data
processing capabilities and eliminate the performance overhead of reading
data for processing after writing.

Best,
Vino

wangxianghu <wxhj...@126.com> 于2020年8月31日周一 下午3:44写道：

> +1
> This will give hudi more capabilities besides data ingestion and writing,
> and make hudi-based data processing more timely!
> Best,
> wangxianghu
>
> 发件人: Abhishek Modi
> 发送时间: 2020年8月31日 15:01
> 收件人: dev@hudi.apache.org
> 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
>
> +1
>
> This sounds really interesting! I like that this implicitly gives Hudi the
> ability to do transformations on ingested data :)
>
> On Sun, Aug 30, 2020 at 10:59 PM vino yang <vinoy...@apache.org> wrote:
>
> > Hi everyone,
> >
> >
> > For a long time, in the field of big data, people hope that the tools
> they
> > use can give greater play to the processing and analysis capabilities of
> > big data. At present, from the perspective of API, Hudi mostly provides
> > APIs related to data ingestion, and relies on various big data query
> > engines on the query side to release capabilities, but does not provide a
> > more convenient API for data processing after transactional writing.
> >
> > Currently, if a user wants to process the incremental data of a commit
> that
> > has just recently taken. It needs to go through three steps:
> >
> >
> >    1.
> >
> >    Write data to a hudi table;
> >    2.
> >
> >    Query or check completion of commit;
> >    3.
> >
> >    After the data is committed, the data is found out through incremental
> >    query, and then the data is processed;
> >
> >
> > If you want a quick link here, you may use Hudi's recent written commit
> > callback function to simplify it into two steps:
> >
> >
> >    1.
> >
> >    Write data to a hudi table;
> >    2.
> >
> >    Based on the written commit callback function to trigger an
> incremental
> >    query to find out the data, and then perform data processing;
> >
> >
> > However, it is still very troublesome to split into two steps for
> scenarios
> > that want to perform more timely and efficient data analysis on the data
> > ingest pipeline. Therefore, I propose to merge the entire process into
> one
> > step and provide a set of incremental(or saying Pipelined) processing API
> > based on this:
> >
> > Write the data to a hudi table, after obtaining the data through
> > JavaRDD<WriteStatus>, directly apply the user-defined function(UDF) to
> > process the data. The processing behavior can be described via these two
> > steps:
> >
> >
> >    1.
> >
> >    Conventional conversion such as Map/Filter/Reduce;
> >    2.
> >
> >    Aggregation calculation based on fixed time window;
> >
> >
> > And these calculation functions should be engine independent. Therefore,
> I
> > plan to introduce some new APIs that allow users to directly define
> > incremental processing capabilities after each writing operation.
> >
> > The preliminary idea is that we can introduce a tool class, for example,
> > named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
> > like this:
> >
> > IncrementalProcessingBuilder builder = new
> IncrementalProcessingBuilder();
> >
> > builder.source() //soure table
> >
> > .transform()
> >
> > .sink()          //derived table
> >
> > .build();
> >
> > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD<HoodieRecord<T>>
> > records, HudiMapFunction mapFunction);
> >
> > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD<HoodieRecord<T>>
> > records, HudiMapFunction mapFunction);
> >
> > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD<HoodieRecord<T>>
> > records, HudiFilterFunction mapFunction);
> >
> > //window function
> >
> >
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD<HoodieRecord<T>>
> > records, HudiAggregateFunction aggFunction);
> >
> > It is suitable for scenarios where the commit interval (window) is
> moderate
> > and the delay of data ingestion is not very concerned.
> >
> >
> > What do you think? Looking forward to your thoughts and opinions.
> >
> >
> > Best,
> >
> > Vino
> >
>
>
>

Re: [DISCUSS] Introduce incremental processing API in Hudi

Reply via email to