+1 This sounds really interesting! I like that this implicitly gives Hudi the ability to do transformations on ingested data :)
On Sun, Aug 30, 2020 at 10:59 PM vino yang <vinoy...@apache.org> wrote: > Hi everyone, > > > For a long time, in the field of big data, people hope that the tools they > use can give greater play to the processing and analysis capabilities of > big data. At present, from the perspective of API, Hudi mostly provides > APIs related to data ingestion, and relies on various big data query > engines on the query side to release capabilities, but does not provide a > more convenient API for data processing after transactional writing. > > Currently, if a user wants to process the incremental data of a commit that > has just recently taken. It needs to go through three steps: > > > 1. > > Write data to a hudi table; > 2. > > Query or check completion of commit; > 3. > > After the data is committed, the data is found out through incremental > query, and then the data is processed; > > > If you want a quick link here, you may use Hudi's recent written commit > callback function to simplify it into two steps: > > > 1. > > Write data to a hudi table; > 2. > > Based on the written commit callback function to trigger an incremental > query to find out the data, and then perform data processing; > > > However, it is still very troublesome to split into two steps for scenarios > that want to perform more timely and efficient data analysis on the data > ingest pipeline. Therefore, I propose to merge the entire process into one > step and provide a set of incremental(or saying Pipelined) processing API > based on this: > > Write the data to a hudi table, after obtaining the data through > JavaRDD<WriteStatus>, directly apply the user-defined function(UDF) to > process the data. The processing behavior can be described via these two > steps: > > > 1. > > Conventional conversion such as Map/Filter/Reduce; > 2. > > Aggregation calculation based on fixed time window; > > > And these calculation functions should be engine independent. Therefore, I > plan to introduce some new APIs that allow users to directly define > incremental processing capabilities after each writing operation. > > The preliminary idea is that we can introduce a tool class, for example, > named: IncrementalProcessingBuilder or PipelineBuilder, which can be used > like this: > > IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder(); > > builder.source() //soure table > > .transform() > > .sink() //derived table > > .build(); > > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD<HoodieRecord<T>> > records, HudiMapFunction mapFunction); > > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD<HoodieRecord<T>> > records, HudiMapFunction mapFunction); > > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD<HoodieRecord<T>> > records, HudiFilterFunction mapFunction); > > //window function > > IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD<HoodieRecord<T>> > records, HudiAggregateFunction aggFunction); > > It is suitable for scenarios where the commit interval (window) is moderate > and the delay of data ingestion is not very concerned. > > > What do you think? Looking forward to your thoughts and opinions. > > > Best, > > Vino >