+1

This sounds really interesting! I like that this implicitly gives Hudi the
ability to do transformations on ingested data :)

On Sun, Aug 30, 2020 at 10:59 PM vino yang <vinoy...@apache.org> wrote:

> Hi everyone,
>
>
> For a long time, in the field of big data, people hope that the tools they
> use can give greater play to the processing and analysis capabilities of
> big data. At present, from the perspective of API, Hudi mostly provides
> APIs related to data ingestion, and relies on various big data query
> engines on the query side to release capabilities, but does not provide a
> more convenient API for data processing after transactional writing.
>
> Currently, if a user wants to process the incremental data of a commit that
> has just recently taken. It needs to go through three steps:
>
>
>    1.
>
>    Write data to a hudi table;
>    2.
>
>    Query or check completion of commit;
>    3.
>
>    After the data is committed, the data is found out through incremental
>    query, and then the data is processed;
>
>
> If you want a quick link here, you may use Hudi's recent written commit
> callback function to simplify it into two steps:
>
>
>    1.
>
>    Write data to a hudi table;
>    2.
>
>    Based on the written commit callback function to trigger an incremental
>    query to find out the data, and then perform data processing;
>
>
> However, it is still very troublesome to split into two steps for scenarios
> that want to perform more timely and efficient data analysis on the data
> ingest pipeline. Therefore, I propose to merge the entire process into one
> step and provide a set of incremental(or saying Pipelined) processing API
> based on this:
>
> Write the data to a hudi table, after obtaining the data through
> JavaRDD<WriteStatus>, directly apply the user-defined function(UDF) to
> process the data. The processing behavior can be described via these two
> steps:
>
>
>    1.
>
>    Conventional conversion such as Map/Filter/Reduce;
>    2.
>
>    Aggregation calculation based on fixed time window;
>
>
> And these calculation functions should be engine independent. Therefore, I
> plan to introduce some new APIs that allow users to directly define
> incremental processing capabilities after each writing operation.
>
> The preliminary idea is that we can introduce a tool class, for example,
> named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
> like this:
>
> IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();
>
> builder.source() //soure table
>
> .transform()
>
> .sink()          //derived table
>
> .build();
>
> IncrementalProcessingBuilder#mapAfterInsert(JavaRDD<HoodieRecord<T>>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD<HoodieRecord<T>>
> records, HudiMapFunction mapFunction);
>
> IncrementalProcessingBuilder#filterAfterInsert(JavaRDD<HoodieRecord<T>>
> records, HudiFilterFunction mapFunction);
>
> //window function
>
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD<HoodieRecord<T>>
> records, HudiAggregateFunction aggFunction);
>
> It is suitable for scenarios where the commit interval (window) is moderate
> and the delay of data ingestion is not very concerned.
>
>
> What do you think? Looking forward to your thoughts and opinions.
>
>
> Best,
>
> Vino
>

Reply via email to