Re: [DISCUSS] Introduce incremental processing API in Hudi

vino yang Tue, 01 Sep 2020 20:23:58 -0700

Hi vc,

Thanks for your feedback.


> Today, if I am a Spark developer, I can write a little program to do a
Hudi
upsert and then trigger some other transformation conditionally based on
whether upsert/insert happened, right?

Yes, technically it can be implemented.

> and I could do that without losing any of the existing transformation
methods I know in Spark.

Yes

> I am not quite clear on how much value this
library adds on top and in fact, bit concerned
that we set ourselves up for solving engine-independent problems that
Apache Beam for e.g has already solved.

Introducing these APIs can bring more fluent API usage experience without
using chained program as you mentioned above. We can directly define the
logic of how to process the committed data after they land to the fs.

Yes, if hudi's goal is an engine-independent data lake library. And if we
want
to introduce some abilities of transformation. We must touch the field that
Beam trying to solve.

> I also have doubts on whether coupling the incremental processing after
commit into a single process itself is desirable.

Actually, I just want to introduce a utility class (or a utility
module/library if
we try to define some engine-independent transformation class) that gives
users another choice to build a data processing pipeline, not only a data
ingesting library.

The current functions stay the same, e.g. reading and writing are
completely decoupled.

And I am thinking if it's a good idea naming it's an incremental processing
relevant proposal.
Based on it just try to process the recently commit, it could only provide
a limited function
compared with the current provided incremental view.

Maybe it would be better to define it to be a pipeline relevant proposal?
Linking ingestion and processing?

> Typical scenarios I have
seen, job A ingests data into table A, job B
incrementally queries table A and kicks another ETL to build table B. Job A
and B are typically different and written by different developers.
If you could help me understand the use-case, that would be awesome.

Yes, the general scenarios are like what you said, especially, data
warehouse.
However, there are some scenarios that are not about table processing. e.g.
metrics calculating,
quickly aggregate and calculate windows (If we fix the commit cycle, in a
sense, it is a fixed window
based on processing-time semantics.) and so on.

After the data has landed, it can be processed quickly without re-reading.
It's a key advantage.

> All that said, there are pains around "triggering" job B (downstream
computations incrementally) and we could solve that by for e.g supporting
an Apache Airflow operator that can trigger workflows
when commits arrive on its upstream tables. What I am trying to say is -
there is definitely gaps we would like to improve upon to make incremental
processing mainstream, not sure if the proposed
APIs are the highest on that list.

Yes, Airflow can solve the triggering problem. We are using another
scheduler framework: Apache Dolphinscheduler[1]
that can also trigger hudi's incremental processing too.

The key difference is the performance and effective triggering, right?
This proposal tries to add more transformation into the same Spark's DAG of
the original data ingestion.

So, in short, this proposal tries to bring something:

   - performance: better performance when processing after data ingestion;
   - focus and fluent: inline ingestion and processing logic in some
   scenarios;
   - boundary: at a high level, introduce more ability of computing engine
   directly, we have depends on the computing engine right? why not release
   more?

Of course, as you said, we will face some problems, such as the problem of
providing abstraction on multiple computing engines.

Best,
Vino

[1]: https://dolphinscheduler.apache.org/



Vinoth Chandar <vin...@apache.org> 于2020年9月2日周三 上午12:13写道：

> Hi,
>
> While I agree on bringing more of these capabilities to Hudi natively, I
> have few questions/concerns on the specific approach.
>
> > And these calculation functions should be engine independent. Therefore,
> I plan to introduce some new APIs that allow users to directly define
>
> Today, if I am a Spark developer, I can write a little program to do a Hudi
> upsert and then trigger some other transformation conditionally based on
> whether upsert/insert happened, right?
> and I could do that without losing any of the existing transformation
> methods I know in Spark. I am not quite clear on how much value this
> library adds on top and in fact, bit concerned
> that we set ourselves up for solving engine-independent problems that
> Apache Beam for e.g has already solved.
>
> I also have doubts on whether coupling the incremental processing after
> commit into a single process itself is desirable. Typical scenarios I have
> seen, job A ingests data into table A, job B
> incrementally queries table A and kicks another ETL to build table B. Job A
> and B are typically different and written by different developers.
> If you could help me understand the use-case, that would be awesome.
>
> All that said, there are pains around "triggering" job B (downstream
> computations incrementally) and we could solve that by for e.g supporting
> an Apache Airflow operator that can trigger workflows
> when commits arrive on its upstream tables. What I am trying to say is -
> there is definitely gaps we would like to improve upon to make incremental
> processing mainstream, not sure if the proposed
> APIs are the highest on that list.
>
> Apologies if I am missing something. Please help me understand if so.
>
> Thanks
> Vinoth
>
>
>
>
> On Tue, Sep 1, 2020 at 4:26 AM vino yang <yanghua1...@gmail.com> wrote:
>
> > Hi,
> >
> > Does anyone have ideas or disagreements?
> >
> > I think the introduction of these APIs will greatly enhance Hudi's data
> > processing capabilities and eliminate the performance overhead of reading
> > data for processing after writing.
> >
> > Best,
> > Vino
> >
> > wangxianghu <wxhj...@126.com> 于2020年8月31日周一 下午3:44写道：
> >
> > > +1
> > > This will give hudi more capabilities besides data ingestion and
> writing,
> > > and make hudi-based data processing more timely!
> > > Best,
> > > wangxianghu
> > >
> > > 发件人: Abhishek Modi
> > > 发送时间: 2020年8月31日 15:01
> > > 收件人: dev@hudi.apache.org
> > > 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
> > >
> > > +1
> > >
> > > This sounds really interesting! I like that this implicitly gives Hudi
> > the
> > > ability to do transformations on ingested data :)
> > >
> > > On Sun, Aug 30, 2020 at 10:59 PM vino yang <vinoy...@apache.org>
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > For a long time, in the field of big data, people hope that the tools
> > > they
> > > > use can give greater play to the processing and analysis capabilities
> > of
> > > > big data. At present, from the perspective of API, Hudi mostly
> provides
> > > > APIs related to data ingestion, and relies on various big data query
> > > > engines on the query side to release capabilities, but does not
> > provide a
> > > > more convenient API for data processing after transactional writing.
> > > >
> > > > Currently, if a user wants to process the incremental data of a
> commit
> > > that
> > > > has just recently taken. It needs to go through three steps:
> > > >
> > > >
> > > >    1.
> > > >
> > > >    Write data to a hudi table;
> > > >    2.
> > > >
> > > >    Query or check completion of commit;
> > > >    3.
> > > >
> > > >    After the data is committed, the data is found out through
> > incremental
> > > >    query, and then the data is processed;
> > > >
> > > >
> > > > If you want a quick link here, you may use Hudi's recent written
> commit
> > > > callback function to simplify it into two steps:
> > > >
> > > >
> > > >    1.
> > > >
> > > >    Write data to a hudi table;
> > > >    2.
> > > >
> > > >    Based on the written commit callback function to trigger an
> > > incremental
> > > >    query to find out the data, and then perform data processing;
> > > >
> > > >
> > > > However, it is still very troublesome to split into two steps for
> > > scenarios
> > > > that want to perform more timely and efficient data analysis on the
> > data
> > > > ingest pipeline. Therefore, I propose to merge the entire process
> into
> > > one
> > > > step and provide a set of incremental(or saying Pipelined) processing
> > API
> > > > based on this:
> > > >
> > > > Write the data to a hudi table, after obtaining the data through
> > > > JavaRDD<WriteStatus>, directly apply the user-defined function(UDF)
> to
> > > > process the data. The processing behavior can be described via these
> > two
> > > > steps:
> > > >
> > > >
> > > >    1.
> > > >
> > > >    Conventional conversion such as Map/Filter/Reduce;
> > > >    2.
> > > >
> > > >    Aggregation calculation based on fixed time window;
> > > >
> > > >
> > > > And these calculation functions should be engine independent.
> > Therefore,
> > > I
> > > > plan to introduce some new APIs that allow users to directly define
> > > > incremental processing capabilities after each writing operation.
> > > >
> > > > The preliminary idea is that we can introduce a tool class, for
> > example,
> > > > named: IncrementalProcessingBuilder or PipelineBuilder, which can be
> > used
> > > > like this:
> > > >
> > > > IncrementalProcessingBuilder builder = new
> > > IncrementalProcessingBuilder();
> > > >
> > > > builder.source() //soure table
> > > >
> > > > .transform()
> > > >
> > > > .sink()          //derived table
> > > >
> > > > .build();
> > > >
> > > > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD<HoodieRecord<T>>
> > > > records, HudiMapFunction mapFunction);
> > > >
> > > > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD<HoodieRecord<T>>
> > > > records, HudiMapFunction mapFunction);
> > > >
> > > >
> IncrementalProcessingBuilder#filterAfterInsert(JavaRDD<HoodieRecord<T>>
> > > > records, HudiFilterFunction mapFunction);
> > > >
> > > > //window function
> > > >
> > > >
> > >
> >
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD<HoodieRecord<T>>
> > > > records, HudiAggregateFunction aggFunction);
> > > >
> > > > It is suitable for scenarios where the commit interval (window) is
> > > moderate
> > > > and the delay of data ingestion is not very concerned.
> > > >
> > > >
> > > > What do you think? Looking forward to your thoughts and opinions.
> > > >
> > > >
> > > > Best,
> > > >
> > > > Vino
> > > >
> > >
> > >
> > >
> >
>

Re: [DISCUSS] Introduce incremental processing API in Hudi

Reply via email to