I have filed a Jira issue[1] to track this work. [1]: https://issues.apache.org/jira/browse/HUDI-613
vino yang <yanghua1...@gmail.com> 于2020年2月13日周四 下午9:51写道: > Hi hamid, > > Agree with your opinion. > > Let's move forward step by step. > > Will file an issue to track refactor about Transformer. > > Best, > Vino > > hamid pirahesh <hpirah...@gmail.com> 于2020年2月13日周四 下午6:38写道: > >> I think it is a good idea to decouple the transformer from spark so that >> it can be used with other flow engines. >> Once you do that, then it is worth considering a much bigger play rather >> than another incremental play. >> Given the scale of Hudi, we need to look at airflow, particularly in the >> context of what google is doing with Composer, addressing autoscaling, >> scheduleing, monitoring, etc. >> You need all of that to manage a serious tetl/elt flow. >> >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <yanghua1...@gmail.com> wrote: >> >> > Currently, Hudi has a component that has not been widely used: >> Transformer. >> > As we all know, before the original data fell into the data lake, a very >> > common operation is data preprocessing and ETL. This is also the most >> > common use scenario of many computing engines, such as Flink and Spark. >> Now >> > that Hudi has taken advantage of the power of the computing engine, it >> can >> > also naturally take advantage of its ability of data preprocessing. We >> can >> > refactor the Transformer to make it become more flexible. To summarize, >> we >> > can refactor from the following aspects: >> > >> > - Decouple Transformer from Spark >> > - Enrich the Transformer and provide built-in transformer >> > - Support Transformer-chain >> > >> > For the first point, the Transformer interface is tightly coupled with >> > Spark in design, and it contains a Spark-specific context. This makes it >> > impossible for us to take advantage of the transform capabilities >> provided >> > by other engines (such as Flink) after supporting multiple engines. >> > Therefore, we need to decouple it from Spark in design. >> > >> > For the second point, we can enhance the Transformer and provide some >> > out-of-the-box Transformers, such as FilterTransformer, >> FlatMapTrnasformer, >> > and so on. >> > >> > For the third point, the most common pattern for data processing is the >> > pipeline model, and the common implementation of the pipeline model is >> the >> > responsibility chain model, which can be compared to the Apache commons >> > chain[1], combining multiple Transformers can make data-processing >> become >> > more flexible and expandable. >> > >> > If we enhance the capabilities of Transformer components, Hudi will >> provide >> > richer data processing capabilities based on the computing engine. >> > >> > What do you think? >> > >> > Any opinions and feedback are welcome and appreciated. >> > >> > Best, >> > Vino >> > >> > [1]: https://commons.apache.org/proper/commons-chain/ >> > >> >