Re: Refactor and enhance Hudi Transformer

vino yang Fri, 14 Feb 2020 05:18:49 -0800

I have filed a Jira issue[1] to track this work.

[1]: https://issues.apache.org/jira/browse/HUDI-613


vino yang <yanghua1...@gmail.com> 于2020年2月13日周四 下午9:51写道：

> Hi hamid,
>
> Agree with your opinion.
>
> Let's move forward step by step.
>
> Will file an issue to track refactor about Transformer.
>
> Best,
> Vino
>
> hamid pirahesh <hpirah...@gmail.com> 于2020年2月13日周四 下午6:38写道：
>
>> I think it is a good idea to decouple  the transformer from spark so that
>> it can be used with other flow engines.
>> Once you do that, then it is worth considering a much bigger play rather
>> than another incremental play.
>> Given the scale of Hudi, we need to look at airflow, particularly in the
>> context of what google is doing with Composer, addressing autoscaling,
>> scheduleing, monitoring, etc.
>> You need all of that to manage a serious tetl/elt flow.
>>
>> On Thu, Feb 6, 2020 at 8:25 PM vino yang <yanghua1...@gmail.com> wrote:
>>
>> > Currently, Hudi has a component that has not been widely used:
>> Transformer.
>> > As we all know, before the original data fell into the data lake, a very
>> > common operation is data preprocessing and ETL. This is also the most
>> > common use scenario of many computing engines, such as Flink and Spark.
>> Now
>> > that Hudi has taken advantage of the power of the computing engine, it
>> can
>> > also naturally take advantage of its ability of data preprocessing. We
>> can
>> > refactor the Transformer to make it become more flexible. To summarize,
>> we
>> > can refactor from the following aspects:
>> >
>> >    - Decouple Transformer from Spark
>> >    - Enrich the Transformer and provide built-in transformer
>> >    - Support Transformer-chain
>> >
>> > For the first point, the Transformer interface is tightly coupled with
>> > Spark in design, and it contains a Spark-specific context. This makes it
>> > impossible for us to take advantage of the transform capabilities
>> provided
>> > by other engines (such as Flink) after supporting multiple engines.
>> > Therefore, we need to decouple it from Spark in design.
>> >
>> > For the second point, we can enhance the Transformer and provide some
>> > out-of-the-box Transformers, such as FilterTransformer,
>> FlatMapTrnasformer,
>> > and so on.
>> >
>> > For the third point, the most common pattern for data processing is the
>> > pipeline model, and the common implementation of the pipeline model is
>> the
>> > responsibility chain model, which can be compared to the Apache commons
>> > chain[1], combining multiple Transformers can make data-processing
>> become
>> > more flexible and expandable.
>> >
>> > If we enhance the capabilities of Transformer components, Hudi will
>> provide
>> > richer data processing capabilities based on the computing engine.
>> >
>> > What do you think?
>> >
>> > Any opinions and feedback are welcome and appreciated.
>> >
>> > Best,
>> > Vino
>> >
>> > [1]: https://commons.apache.org/proper/commons-chain/
>> >
>>
>

Re: Refactor and enhance Hudi Transformer

Reply via email to