Hi everyone,
For a long time, in the field of big data, people hope that the tools they use can give greater play to the processing and analysis capabilities of big data. At present, from the perspective of API, Hudi mostly provides APIs related to data ingestion, and relies on various big data query engines on the query side to release capabilities, but does not provide a more convenient API for data processing after transactional writing. Currently, if a user wants to process the incremental data of a commit that has just recently taken. It needs to go through three steps: 1. Write data to a hudi table; 2. Query or check completion of commit; 3. After the data is committed, the data is found out through incremental query, and then the data is processed; If you want a quick link here, you may use Hudi's recent written commit callback function to simplify it into two steps: 1. Write data to a hudi table; 2. Based on the written commit callback function to trigger an incremental query to find out the data, and then perform data processing; However, it is still very troublesome to split into two steps for scenarios that want to perform more timely and efficient data analysis on the data ingest pipeline. Therefore, I propose to merge the entire process into one step and provide a set of incremental(or saying Pipelined) processing API based on this: Write the data to a hudi table, after obtaining the data through JavaRDD<WriteStatus>, directly apply the user-defined function(UDF) to process the data. The processing behavior can be described via these two steps: 1. Conventional conversion such as Map/Filter/Reduce; 2. Aggregation calculation based on fixed time window; And these calculation functions should be engine independent. Therefore, I plan to introduce some new APIs that allow users to directly define incremental processing capabilities after each writing operation. The preliminary idea is that we can introduce a tool class, for example, named: IncrementalProcessingBuilder or PipelineBuilder, which can be used like this: IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder(); builder.source() //soure table .transform() .sink() //derived table .build(); IncrementalProcessingBuilder#mapAfterInsert(JavaRDD<HoodieRecord<T>> records, HudiMapFunction mapFunction); IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD<HoodieRecord<T>> records, HudiMapFunction mapFunction); IncrementalProcessingBuilder#filterAfterInsert(JavaRDD<HoodieRecord<T>> records, HudiFilterFunction mapFunction); //window function IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD<HoodieRecord<T>> records, HudiAggregateFunction aggFunction); It is suitable for scenarios where the commit interval (window) is moderate and the delay of data ingestion is not very concerned. What do you think? Looking forward to your thoughts and opinions. Best, Vino