Re: Refactor and enhance Hudi Transformer

vino yang Sun, 23 Feb 2020 18:34:54 -0800

Hi Shiyan,

Really sorry, I forgot to attach the reference, the relevant Jira ID is
HUDI-561: https://issues.apache.org/jira/browse/HUDI-561


It seems both of you faced the same issue. While the solution is not the
same. Never mind, you can move the discussion to that issue.

Best,
Vino


Shiyan Xu <xu.shiyan.raym...@gmail.com> 于2020年2月24日周一 上午10:21写道：

> Thanks Vino. Are you referring to HUDI-613? How about making it an umbrella
> task due to its big scope? (btw it is stated as "bug", which should be
> fixed too). I can create another specific task under it for the idea of
> datetime -> partition path transformer, if it makes sense.
>
> On Sun, Feb 23, 2020 at 5:57 PM vino yang <yanghua1...@gmail.com> wrote:
>
> > Hi Shiyan,
> >
> > Thanks for rasing this thread up again and sharing your thoughts. They
> are
> > valuable.
> >
> > Regarding the date-time specific transform, there is an issue[1] that
> > describes this business requirement.
> >
> > Best,
> > Vino
> >
> > Shiyan Xu <xu.shiyan.raym...@gmail.com> 于2020年2月24日周一 上午7:22写道：
> >
> > > Late to the party. :P
> > >
> > > I really favor the idea of built-in support enrichment. It is a very
> > common
> > > case where we want to set datetime fields for partition path. We could
> > have
> > > a built-in support to normalize ISO format / unix timestamp. For
> example
> > > `HourlyPartitionTransformer` will normalize whatever field user
> specified
> > > as partition path. Let's say user set `create_ts` as partition path
> > field,
> > > the transfromer will apply change create_ts => _hoodie_partition_path
> > >
> > >
> > >    - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> > >    - 1582497702.123456789 => 2020/02/23/22
> > >
> > > Does that make sense? If so, I may file a jira for this.
> > >
> > > As for FilterTransformer or FlatMapTransformer which is designed for
> > > generic purpose, they seem to belong to Spark or Flink's realm.
> > > You can do these 2 transformation with Spark Dataset now. Or once
> > > decoupled from Spark, you'll probably have an abstract Dataset class
> > > to perform engine-agnostic transformation
> > >
> > > My understanding of transformer in HUDI is more specifically purposed,
> > > where the underlying transformation is handled by the actual
> > > processing engine (Spark or Flink)
> > >
> > >
> > > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar <vin...@apache.org>
> > wrote:
> > >
> > > > Thanks Hamid and Vinoyang for the great discussion
> > > >
> > > > On Fri, Feb 14, 2020 at 5:18 AM vino yang <yanghua1...@gmail.com>
> > wrote:
> > > >
> > > > > I have filed a Jira issue[1] to track this work.
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > > >
> > > > > vino yang <yanghua1...@gmail.com> 于2020年2月13日周四 下午9:51写道：
> > > > >
> > > > > > Hi hamid,
> > > > > >
> > > > > > Agree with your opinion.
> > > > > >
> > > > > > Let's move forward step by step.
> > > > > >
> > > > > > Will file an issue to track refactor about Transformer.
> > > > > >
> > > > > > Best,
> > > > > > Vino
> > > > > >
> > > > > > hamid pirahesh <hpirah...@gmail.com> 于2020年2月13日周四 下午6:38写道：
> > > > > >
> > > > > >> I think it is a good idea to decouple  the transformer from
> spark
> > so
> > > > > that
> > > > > >> it can be used with other flow engines.
> > > > > >> Once you do that, then it is worth considering a much bigger
> play
> > > > rather
> > > > > >> than another incremental play.
> > > > > >> Given the scale of Hudi, we need to look at airflow,
> particularly
> > in
> > > > the
> > > > > >> context of what google is doing with Composer, addressing
> > > autoscaling,
> > > > > >> scheduleing, monitoring, etc.
> > > > > >> You need all of that to manage a serious tetl/elt flow.
> > > > > >>
> > > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <yanghua1...@gmail.com
> >
> > > > wrote:
> > > > > >>
> > > > > >> > Currently, Hudi has a component that has not been widely used:
> > > > > >> Transformer.
> > > > > >> > As we all know, before the original data fell into the data
> > lake,
> > > a
> > > > > very
> > > > > >> > common operation is data preprocessing and ETL. This is also
> the
> > > > most
> > > > > >> > common use scenario of many computing engines, such as Flink
> and
> > > > > Spark.
> > > > > >> Now
> > > > > >> > that Hudi has taken advantage of the power of the computing
> > > engine,
> > > > it
> > > > > >> can
> > > > > >> > also naturally take advantage of its ability of data
> > > preprocessing.
> > > > We
> > > > > >> can
> > > > > >> > refactor the Transformer to make it become more flexible. To
> > > > > summarize,
> > > > > >> we
> > > > > >> > can refactor from the following aspects:
> > > > > >> >
> > > > > >> >    - Decouple Transformer from Spark
> > > > > >> >    - Enrich the Transformer and provide built-in transformer
> > > > > >> >    - Support Transformer-chain
> > > > > >> >
> > > > > >> > For the first point, the Transformer interface is tightly
> > coupled
> > > > with
> > > > > >> > Spark in design, and it contains a Spark-specific context.
> This
> > > > makes
> > > > > it
> > > > > >> > impossible for us to take advantage of the transform
> > capabilities
> > > > > >> provided
> > > > > >> > by other engines (such as Flink) after supporting multiple
> > > engines.
> > > > > >> > Therefore, we need to decouple it from Spark in design.
> > > > > >> >
> > > > > >> > For the second point, we can enhance the Transformer and
> provide
> > > > some
> > > > > >> > out-of-the-box Transformers, such as FilterTransformer,
> > > > > >> FlatMapTrnasformer,
> > > > > >> > and so on.
> > > > > >> >
> > > > > >> > For the third point, the most common pattern for data
> processing
> > > is
> > > > > the
> > > > > >> > pipeline model, and the common implementation of the pipeline
> > > model
> > > > is
> > > > > >> the
> > > > > >> > responsibility chain model, which can be compared to the
> Apache
> > > > > commons
> > > > > >> > chain[1], combining multiple Transformers can make
> > data-processing
> > > > > >> become
> > > > > >> > more flexible and expandable.
> > > > > >> >
> > > > > >> > If we enhance the capabilities of Transformer components, Hudi
> > > will
> > > > > >> provide
> > > > > >> > richer data processing capabilities based on the computing
> > engine.
> > > > > >> >
> > > > > >> > What do you think?
> > > > > >> >
> > > > > >> > Any opinions and feedback are welcome and appreciated.
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Vino
> > > > > >> >
> > > > > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Reply via email to