Hi everyone,

For a long time, in the field of big data, people hope that the tools they
use can give greater play to the processing and analysis capabilities of
big data. At present, from the perspective of API, Hudi mostly provides
APIs related to data ingestion, and relies on various big data query
engines on the query side to release capabilities, but does not provide a
more convenient API for data processing after transactional writing.

Currently, if a user wants to process the incremental data of a commit that
has just recently taken. It needs to go through three steps:


   1.

   Write data to a hudi table;
   2.

   Query or check completion of commit;
   3.

   After the data is committed, the data is found out through incremental
   query, and then the data is processed;


If you want a quick link here, you may use Hudi's recent written commit
callback function to simplify it into two steps:


   1.

   Write data to a hudi table;
   2.

   Based on the written commit callback function to trigger an incremental
   query to find out the data, and then perform data processing;


However, it is still very troublesome to split into two steps for scenarios
that want to perform more timely and efficient data analysis on the data
ingest pipeline. Therefore, I propose to merge the entire process into one
step and provide a set of incremental(or saying Pipelined) processing API
based on this:

Write the data to a hudi table, after obtaining the data through
JavaRDD<WriteStatus>, directly apply the user-defined function(UDF) to
process the data. The processing behavior can be described via these two
steps:


   1.

   Conventional conversion such as Map/Filter/Reduce;
   2.

   Aggregation calculation based on fixed time window;


And these calculation functions should be engine independent. Therefore, I
plan to introduce some new APIs that allow users to directly define
incremental processing capabilities after each writing operation.

The preliminary idea is that we can introduce a tool class, for example,
named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
like this:

IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();

builder.source() //soure table

.transform()

.sink()          //derived table

.build();

IncrementalProcessingBuilder#mapAfterInsert(JavaRDD<HoodieRecord<T>>
records, HudiMapFunction mapFunction);

IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD<HoodieRecord<T>>
records, HudiMapFunction mapFunction);

IncrementalProcessingBuilder#filterAfterInsert(JavaRDD<HoodieRecord<T>>
records, HudiFilterFunction mapFunction);

//window function

IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD<HoodieRecord<T>>
records, HudiAggregateFunction aggFunction);

It is suitable for scenarios where the commit interval (window) is moderate
and the delay of data ingestion is not very concerned.


What do you think? Looking forward to your thoughts and opinions.


Best,

Vino

Reply via email to