Hi, Does anyone have ideas or disagreements?
I think the introduction of these APIs will greatly enhance Hudi's data processing capabilities and eliminate the performance overhead of reading data for processing after writing. Best, Vino wangxianghu <wxhj...@126.com> 于2020年8月31日周一 下午3:44写道: > +1 > This will give hudi more capabilities besides data ingestion and writing, > and make hudi-based data processing more timely! > Best, > wangxianghu > > 发件人: Abhishek Modi > 发送时间: 2020年8月31日 15:01 > 收件人: dev@hudi.apache.org > 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi > > +1 > > This sounds really interesting! I like that this implicitly gives Hudi the > ability to do transformations on ingested data :) > > On Sun, Aug 30, 2020 at 10:59 PM vino yang <vinoy...@apache.org> wrote: > > > Hi everyone, > > > > > > For a long time, in the field of big data, people hope that the tools > they > > use can give greater play to the processing and analysis capabilities of > > big data. At present, from the perspective of API, Hudi mostly provides > > APIs related to data ingestion, and relies on various big data query > > engines on the query side to release capabilities, but does not provide a > > more convenient API for data processing after transactional writing. > > > > Currently, if a user wants to process the incremental data of a commit > that > > has just recently taken. It needs to go through three steps: > > > > > > 1. > > > > Write data to a hudi table; > > 2. > > > > Query or check completion of commit; > > 3. > > > > After the data is committed, the data is found out through incremental > > query, and then the data is processed; > > > > > > If you want a quick link here, you may use Hudi's recent written commit > > callback function to simplify it into two steps: > > > > > > 1. > > > > Write data to a hudi table; > > 2. > > > > Based on the written commit callback function to trigger an > incremental > > query to find out the data, and then perform data processing; > > > > > > However, it is still very troublesome to split into two steps for > scenarios > > that want to perform more timely and efficient data analysis on the data > > ingest pipeline. Therefore, I propose to merge the entire process into > one > > step and provide a set of incremental(or saying Pipelined) processing API > > based on this: > > > > Write the data to a hudi table, after obtaining the data through > > JavaRDD<WriteStatus>, directly apply the user-defined function(UDF) to > > process the data. The processing behavior can be described via these two > > steps: > > > > > > 1. > > > > Conventional conversion such as Map/Filter/Reduce; > > 2. > > > > Aggregation calculation based on fixed time window; > > > > > > And these calculation functions should be engine independent. Therefore, > I > > plan to introduce some new APIs that allow users to directly define > > incremental processing capabilities after each writing operation. > > > > The preliminary idea is that we can introduce a tool class, for example, > > named: IncrementalProcessingBuilder or PipelineBuilder, which can be used > > like this: > > > > IncrementalProcessingBuilder builder = new > IncrementalProcessingBuilder(); > > > > builder.source() //soure table > > > > .transform() > > > > .sink() //derived table > > > > .build(); > > > > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD<HoodieRecord<T>> > > records, HudiMapFunction mapFunction); > > > > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD<HoodieRecord<T>> > > records, HudiMapFunction mapFunction); > > > > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD<HoodieRecord<T>> > > records, HudiFilterFunction mapFunction); > > > > //window function > > > > > IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD<HoodieRecord<T>> > > records, HudiAggregateFunction aggFunction); > > > > It is suitable for scenarios where the commit interval (window) is > moderate > > and the delay of data ingestion is not very concerned. > > > > > > What do you think? Looking forward to your thoughts and opinions. > > > > > > Best, > > > > Vino > > > > >