Hi, While I agree on bringing more of these capabilities to Hudi natively, I have few questions/concerns on the specific approach.
> And these calculation functions should be engine independent. Therefore, I plan to introduce some new APIs that allow users to directly define Today, if I am a Spark developer, I can write a little program to do a Hudi upsert and then trigger some other transformation conditionally based on whether upsert/insert happened, right? and I could do that without losing any of the existing transformation methods I know in Spark. I am not quite clear on how much value this library adds on top and in fact, bit concerned that we set ourselves up for solving engine-independent problems that Apache Beam for e.g has already solved. I also have doubts on whether coupling the incremental processing after commit into a single process itself is desirable. Typical scenarios I have seen, job A ingests data into table A, job B incrementally queries table A and kicks another ETL to build table B. Job A and B are typically different and written by different developers. If you could help me understand the use-case, that would be awesome. All that said, there are pains around "triggering" job B (downstream computations incrementally) and we could solve that by for e.g supporting an Apache Airflow operator that can trigger workflows when commits arrive on its upstream tables. What I am trying to say is - there is definitely gaps we would like to improve upon to make incremental processing mainstream, not sure if the proposed APIs are the highest on that list. Apologies if I am missing something. Please help me understand if so. Thanks Vinoth On Tue, Sep 1, 2020 at 4:26 AM vino yang <yanghua1...@gmail.com> wrote: > Hi, > > Does anyone have ideas or disagreements? > > I think the introduction of these APIs will greatly enhance Hudi's data > processing capabilities and eliminate the performance overhead of reading > data for processing after writing. > > Best, > Vino > > wangxianghu <wxhj...@126.com> 于2020年8月31日周一 下午3:44写道: > > > +1 > > This will give hudi more capabilities besides data ingestion and writing, > > and make hudi-based data processing more timely! > > Best, > > wangxianghu > > > > 发件人: Abhishek Modi > > 发送时间: 2020年8月31日 15:01 > > 收件人: dev@hudi.apache.org > > 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi > > > > +1 > > > > This sounds really interesting! I like that this implicitly gives Hudi > the > > ability to do transformations on ingested data :) > > > > On Sun, Aug 30, 2020 at 10:59 PM vino yang <vinoy...@apache.org> wrote: > > > > > Hi everyone, > > > > > > > > > For a long time, in the field of big data, people hope that the tools > > they > > > use can give greater play to the processing and analysis capabilities > of > > > big data. At present, from the perspective of API, Hudi mostly provides > > > APIs related to data ingestion, and relies on various big data query > > > engines on the query side to release capabilities, but does not > provide a > > > more convenient API for data processing after transactional writing. > > > > > > Currently, if a user wants to process the incremental data of a commit > > that > > > has just recently taken. It needs to go through three steps: > > > > > > > > > 1. > > > > > > Write data to a hudi table; > > > 2. > > > > > > Query or check completion of commit; > > > 3. > > > > > > After the data is committed, the data is found out through > incremental > > > query, and then the data is processed; > > > > > > > > > If you want a quick link here, you may use Hudi's recent written commit > > > callback function to simplify it into two steps: > > > > > > > > > 1. > > > > > > Write data to a hudi table; > > > 2. > > > > > > Based on the written commit callback function to trigger an > > incremental > > > query to find out the data, and then perform data processing; > > > > > > > > > However, it is still very troublesome to split into two steps for > > scenarios > > > that want to perform more timely and efficient data analysis on the > data > > > ingest pipeline. Therefore, I propose to merge the entire process into > > one > > > step and provide a set of incremental(or saying Pipelined) processing > API > > > based on this: > > > > > > Write the data to a hudi table, after obtaining the data through > > > JavaRDD<WriteStatus>, directly apply the user-defined function(UDF) to > > > process the data. The processing behavior can be described via these > two > > > steps: > > > > > > > > > 1. > > > > > > Conventional conversion such as Map/Filter/Reduce; > > > 2. > > > > > > Aggregation calculation based on fixed time window; > > > > > > > > > And these calculation functions should be engine independent. > Therefore, > > I > > > plan to introduce some new APIs that allow users to directly define > > > incremental processing capabilities after each writing operation. > > > > > > The preliminary idea is that we can introduce a tool class, for > example, > > > named: IncrementalProcessingBuilder or PipelineBuilder, which can be > used > > > like this: > > > > > > IncrementalProcessingBuilder builder = new > > IncrementalProcessingBuilder(); > > > > > > builder.source() //soure table > > > > > > .transform() > > > > > > .sink() //derived table > > > > > > .build(); > > > > > > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD<HoodieRecord<T>> > > > records, HudiMapFunction mapFunction); > > > > > > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD<HoodieRecord<T>> > > > records, HudiMapFunction mapFunction); > > > > > > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD<HoodieRecord<T>> > > > records, HudiFilterFunction mapFunction); > > > > > > //window function > > > > > > > > > IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD<HoodieRecord<T>> > > > records, HudiAggregateFunction aggFunction); > > > > > > It is suitable for scenarios where the commit interval (window) is > > moderate > > > and the delay of data ingestion is not very concerned. > > > > > > > > > What do you think? Looking forward to your thoughts and opinions. > > > > > > > > > Best, > > > > > > Vino > > > > > > > > > >