Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread Shiyan Xu
Thanks. After reading the discussion in HUDI-561, I just realized that the
previously-mentioned built-in partition transformer is better suited to a
custom key generator. Hopefully other suitable ideas of built-in
transformer would come up later.

On Sun, Feb 23, 2020 at 6:34 PM vino yang  wrote:

> Hi Shiyan,
>
> Really sorry, I forgot to attach the reference, the relevant Jira ID is
> HUDI-561: https://issues.apache.org/jira/browse/HUDI-561
>
> It seems both of you faced the same issue. While the solution is not the
> same. Never mind, you can move the discussion to that issue.
>
> Best,
> Vino
>
>
> Shiyan Xu  于2020年2月24日周一 上午10:21写道:
>
> > Thanks Vino. Are you referring to HUDI-613? How about making it an
> umbrella
> > task due to its big scope? (btw it is stated as "bug", which should be
> > fixed too). I can create another specific task under it for the idea of
> > datetime -> partition path transformer, if it makes sense.
> >
> > On Sun, Feb 23, 2020 at 5:57 PM vino yang  wrote:
> >
> > > Hi Shiyan,
> > >
> > > Thanks for rasing this thread up again and sharing your thoughts. They
> > are
> > > valuable.
> > >
> > > Regarding the date-time specific transform, there is an issue[1] that
> > > describes this business requirement.
> > >
> > > Best,
> > > Vino
> > >
> > > Shiyan Xu  于2020年2月24日周一 上午7:22写道:
> > >
> > > > Late to the party. :P
> > > >
> > > > I really favor the idea of built-in support enrichment. It is a very
> > > common
> > > > case where we want to set datetime fields for partition path. We
> could
> > > have
> > > > a built-in support to normalize ISO format / unix timestamp. For
> > example
> > > > `HourlyPartitionTransformer` will normalize whatever field user
> > specified
> > > > as partition path. Let's say user set `create_ts` as partition path
> > > field,
> > > > the transfromer will apply change create_ts => _hoodie_partition_path
> > > >
> > > >
> > > >- 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> > > >- 1582497702.123456789 => 2020/02/23/22
> > > >
> > > > Does that make sense? If so, I may file a jira for this.
> > > >
> > > > As for FilterTransformer or FlatMapTransformer which is designed for
> > > > generic purpose, they seem to belong to Spark or Flink's realm.
> > > > You can do these 2 transformation with Spark Dataset now. Or once
> > > > decoupled from Spark, you'll probably have an abstract Dataset class
> > > > to perform engine-agnostic transformation
> > > >
> > > > My understanding of transformer in HUDI is more specifically
> purposed,
> > > > where the underlying transformation is handled by the actual
> > > > processing engine (Spark or Flink)
> > > >
> > > >
> > > > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Thanks Hamid and Vinoyang for the great discussion
> > > > >
> > > > > On Fri, Feb 14, 2020 at 5:18 AM vino yang 
> > > wrote:
> > > > >
> > > > > > I have filed a Jira issue[1] to track this work.
> > > > > >
> > > > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > > > >
> > > > > > vino yang  于2020年2月13日周四 下午9:51写道:
> > > > > >
> > > > > > > Hi hamid,
> > > > > > >
> > > > > > > Agree with your opinion.
> > > > > > >
> > > > > > > Let's move forward step by step.
> > > > > > >
> > > > > > > Will file an issue to track refactor about Transformer.
> > > > > > >
> > > > > > > Best,
> > > > > > > Vino
> > > > > > >
> > > > > > > hamid pirahesh  于2020年2月13日周四 下午6:38写道:
> > > > > > >
> > > > > > >> I think it is a good idea to decouple  the transformer from
> > spark
> > > so
> > > > > > that
> > > > > > >> it can be used with other flow engines.
> > > > > > >> Once you do that, then it is worth considering a much bigger
> > play
> > > > > rather
> > > > > > >> than another incremental play.
> > > > > > >> Given the scale of Hudi, we need to look at airflow,
> > particularly
> > > in
> > > > > the
> > > > > > >> context of what google is doing with Composer, addressing
> > > > autoscaling,
> > > > > > >> scheduleing, monitoring, etc.
> > > > > > >> You need all of that to manage a serious tetl/elt flow.
> > > > > > >>
> > > > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <
> yanghua1...@gmail.com
> > >
> > > > > wrote:
> > > > > > >>
> > > > > > >> > Currently, Hudi has a component that has not been widely
> used:
> > > > > > >> Transformer.
> > > > > > >> > As we all know, before the original data fell into the data
> > > lake,
> > > > a
> > > > > > very
> > > > > > >> > common operation is data preprocessing and ETL. This is also
> > the
> > > > > most
> > > > > > >> > common use scenario of many computing engines, such as Flink
> > and
> > > > > > Spark.
> > > > > > >> Now
> > > > > > >> > that Hudi has taken advantage of the power of the computing
> > > > engine,
> > > > > it
> > > > > > >> can
> > > > > > >> > also naturally take advantage of its ability of data
> > > > preprocessing.
> > > > > We
> > > > > > >> can
> > > > > > >> > refactor the Transformer to make it become more 

Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread vino yang
Hi Shiyan,

Really sorry, I forgot to attach the reference, the relevant Jira ID is
HUDI-561: https://issues.apache.org/jira/browse/HUDI-561

It seems both of you faced the same issue. While the solution is not the
same. Never mind, you can move the discussion to that issue.

Best,
Vino


Shiyan Xu  于2020年2月24日周一 上午10:21写道:

> Thanks Vino. Are you referring to HUDI-613? How about making it an umbrella
> task due to its big scope? (btw it is stated as "bug", which should be
> fixed too). I can create another specific task under it for the idea of
> datetime -> partition path transformer, if it makes sense.
>
> On Sun, Feb 23, 2020 at 5:57 PM vino yang  wrote:
>
> > Hi Shiyan,
> >
> > Thanks for rasing this thread up again and sharing your thoughts. They
> are
> > valuable.
> >
> > Regarding the date-time specific transform, there is an issue[1] that
> > describes this business requirement.
> >
> > Best,
> > Vino
> >
> > Shiyan Xu  于2020年2月24日周一 上午7:22写道:
> >
> > > Late to the party. :P
> > >
> > > I really favor the idea of built-in support enrichment. It is a very
> > common
> > > case where we want to set datetime fields for partition path. We could
> > have
> > > a built-in support to normalize ISO format / unix timestamp. For
> example
> > > `HourlyPartitionTransformer` will normalize whatever field user
> specified
> > > as partition path. Let's say user set `create_ts` as partition path
> > field,
> > > the transfromer will apply change create_ts => _hoodie_partition_path
> > >
> > >
> > >- 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> > >- 1582497702.123456789 => 2020/02/23/22
> > >
> > > Does that make sense? If so, I may file a jira for this.
> > >
> > > As for FilterTransformer or FlatMapTransformer which is designed for
> > > generic purpose, they seem to belong to Spark or Flink's realm.
> > > You can do these 2 transformation with Spark Dataset now. Or once
> > > decoupled from Spark, you'll probably have an abstract Dataset class
> > > to perform engine-agnostic transformation
> > >
> > > My understanding of transformer in HUDI is more specifically purposed,
> > > where the underlying transformation is handled by the actual
> > > processing engine (Spark or Flink)
> > >
> > >
> > > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Thanks Hamid and Vinoyang for the great discussion
> > > >
> > > > On Fri, Feb 14, 2020 at 5:18 AM vino yang 
> > wrote:
> > > >
> > > > > I have filed a Jira issue[1] to track this work.
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > > >
> > > > > vino yang  于2020年2月13日周四 下午9:51写道:
> > > > >
> > > > > > Hi hamid,
> > > > > >
> > > > > > Agree with your opinion.
> > > > > >
> > > > > > Let's move forward step by step.
> > > > > >
> > > > > > Will file an issue to track refactor about Transformer.
> > > > > >
> > > > > > Best,
> > > > > > Vino
> > > > > >
> > > > > > hamid pirahesh  于2020年2月13日周四 下午6:38写道:
> > > > > >
> > > > > >> I think it is a good idea to decouple  the transformer from
> spark
> > so
> > > > > that
> > > > > >> it can be used with other flow engines.
> > > > > >> Once you do that, then it is worth considering a much bigger
> play
> > > > rather
> > > > > >> than another incremental play.
> > > > > >> Given the scale of Hudi, we need to look at airflow,
> particularly
> > in
> > > > the
> > > > > >> context of what google is doing with Composer, addressing
> > > autoscaling,
> > > > > >> scheduleing, monitoring, etc.
> > > > > >> You need all of that to manage a serious tetl/elt flow.
> > > > > >>
> > > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang  >
> > > > wrote:
> > > > > >>
> > > > > >> > Currently, Hudi has a component that has not been widely used:
> > > > > >> Transformer.
> > > > > >> > As we all know, before the original data fell into the data
> > lake,
> > > a
> > > > > very
> > > > > >> > common operation is data preprocessing and ETL. This is also
> the
> > > > most
> > > > > >> > common use scenario of many computing engines, such as Flink
> and
> > > > > Spark.
> > > > > >> Now
> > > > > >> > that Hudi has taken advantage of the power of the computing
> > > engine,
> > > > it
> > > > > >> can
> > > > > >> > also naturally take advantage of its ability of data
> > > preprocessing.
> > > > We
> > > > > >> can
> > > > > >> > refactor the Transformer to make it become more flexible. To
> > > > > summarize,
> > > > > >> we
> > > > > >> > can refactor from the following aspects:
> > > > > >> >
> > > > > >> >- Decouple Transformer from Spark
> > > > > >> >- Enrich the Transformer and provide built-in transformer
> > > > > >> >- Support Transformer-chain
> > > > > >> >
> > > > > >> > For the first point, the Transformer interface is tightly
> > coupled
> > > > with
> > > > > >> > Spark in design, and it contains a Spark-specific context.
> This
> > > > makes
> > > > > it
> > > > > >> > impossible for us to take advantage of the transform
> > capabilities
> > > 

Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

2020-02-23 Thread vino yang
Hi Sivabalan,

Thanks for your proposal.

Big +1 from my side, indexing for record granularity is really good for
performance. It is also towards the streaming processing.

Best,
Vino

Sivabalan  于2020年2月23日周日 上午12:52写道:

> As Aapche Hudi is getting widely adopted, performance has become the need
> of the hour. This RFC focusses on improving performance of the Hudi index
> by introducing record level index. The proposal is to implement a new index
> format that is a mapping of (recordKey <-> partition, fileId) or
> ((recordKey, partitionPath) → fileId). This mapping will be stored and
> maintained by Hudi as another implementation of HoodieIndex. This record
> level indexing will definitely give a boost to both read and write
> performance.
>
> Here
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
> >
> is the link to RFC.
>
> Appreciate your review and thoughts.
>
> --
> Regards,
> -Sivabalan
>


Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread Shiyan Xu
Thanks Vino. Are you referring to HUDI-613? How about making it an umbrella
task due to its big scope? (btw it is stated as "bug", which should be
fixed too). I can create another specific task under it for the idea of
datetime -> partition path transformer, if it makes sense.

On Sun, Feb 23, 2020 at 5:57 PM vino yang  wrote:

> Hi Shiyan,
>
> Thanks for rasing this thread up again and sharing your thoughts. They are
> valuable.
>
> Regarding the date-time specific transform, there is an issue[1] that
> describes this business requirement.
>
> Best,
> Vino
>
> Shiyan Xu  于2020年2月24日周一 上午7:22写道:
>
> > Late to the party. :P
> >
> > I really favor the idea of built-in support enrichment. It is a very
> common
> > case where we want to set datetime fields for partition path. We could
> have
> > a built-in support to normalize ISO format / unix timestamp. For example
> > `HourlyPartitionTransformer` will normalize whatever field user specified
> > as partition path. Let's say user set `create_ts` as partition path
> field,
> > the transfromer will apply change create_ts => _hoodie_partition_path
> >
> >
> >- 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> >- 1582497702.123456789 => 2020/02/23/22
> >
> > Does that make sense? If so, I may file a jira for this.
> >
> > As for FilterTransformer or FlatMapTransformer which is designed for
> > generic purpose, they seem to belong to Spark or Flink's realm.
> > You can do these 2 transformation with Spark Dataset now. Or once
> > decoupled from Spark, you'll probably have an abstract Dataset class
> > to perform engine-agnostic transformation
> >
> > My understanding of transformer in HUDI is more specifically purposed,
> > where the underlying transformation is handled by the actual
> > processing engine (Spark or Flink)
> >
> >
> > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar 
> wrote:
> >
> > > Thanks Hamid and Vinoyang for the great discussion
> > >
> > > On Fri, Feb 14, 2020 at 5:18 AM vino yang 
> wrote:
> > >
> > > > I have filed a Jira issue[1] to track this work.
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > >
> > > > vino yang  于2020年2月13日周四 下午9:51写道:
> > > >
> > > > > Hi hamid,
> > > > >
> > > > > Agree with your opinion.
> > > > >
> > > > > Let's move forward step by step.
> > > > >
> > > > > Will file an issue to track refactor about Transformer.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > hamid pirahesh  于2020年2月13日周四 下午6:38写道:
> > > > >
> > > > >> I think it is a good idea to decouple  the transformer from spark
> so
> > > > that
> > > > >> it can be used with other flow engines.
> > > > >> Once you do that, then it is worth considering a much bigger play
> > > rather
> > > > >> than another incremental play.
> > > > >> Given the scale of Hudi, we need to look at airflow, particularly
> in
> > > the
> > > > >> context of what google is doing with Composer, addressing
> > autoscaling,
> > > > >> scheduleing, monitoring, etc.
> > > > >> You need all of that to manage a serious tetl/elt flow.
> > > > >>
> > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang 
> > > wrote:
> > > > >>
> > > > >> > Currently, Hudi has a component that has not been widely used:
> > > > >> Transformer.
> > > > >> > As we all know, before the original data fell into the data
> lake,
> > a
> > > > very
> > > > >> > common operation is data preprocessing and ETL. This is also the
> > > most
> > > > >> > common use scenario of many computing engines, such as Flink and
> > > > Spark.
> > > > >> Now
> > > > >> > that Hudi has taken advantage of the power of the computing
> > engine,
> > > it
> > > > >> can
> > > > >> > also naturally take advantage of its ability of data
> > preprocessing.
> > > We
> > > > >> can
> > > > >> > refactor the Transformer to make it become more flexible. To
> > > > summarize,
> > > > >> we
> > > > >> > can refactor from the following aspects:
> > > > >> >
> > > > >> >- Decouple Transformer from Spark
> > > > >> >- Enrich the Transformer and provide built-in transformer
> > > > >> >- Support Transformer-chain
> > > > >> >
> > > > >> > For the first point, the Transformer interface is tightly
> coupled
> > > with
> > > > >> > Spark in design, and it contains a Spark-specific context. This
> > > makes
> > > > it
> > > > >> > impossible for us to take advantage of the transform
> capabilities
> > > > >> provided
> > > > >> > by other engines (such as Flink) after supporting multiple
> > engines.
> > > > >> > Therefore, we need to decouple it from Spark in design.
> > > > >> >
> > > > >> > For the second point, we can enhance the Transformer and provide
> > > some
> > > > >> > out-of-the-box Transformers, such as FilterTransformer,
> > > > >> FlatMapTrnasformer,
> > > > >> > and so on.
> > > > >> >
> > > > >> > For the third point, the most common pattern for data processing
> > is
> > > > the
> > > > >> > pipeline model, and the common implementation of the pipeline
> > model
> > > is
> > > 

Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread vino yang
Hi Shiyan,

Thanks for rasing this thread up again and sharing your thoughts. They are
valuable.

Regarding the date-time specific transform, there is an issue[1] that
describes this business requirement.

Best,
Vino

Shiyan Xu  于2020年2月24日周一 上午7:22写道:

> Late to the party. :P
>
> I really favor the idea of built-in support enrichment. It is a very common
> case where we want to set datetime fields for partition path. We could have
> a built-in support to normalize ISO format / unix timestamp. For example
> `HourlyPartitionTransformer` will normalize whatever field user specified
> as partition path. Let's say user set `create_ts` as partition path field,
> the transfromer will apply change create_ts => _hoodie_partition_path
>
>
>- 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
>- 1582497702.123456789 => 2020/02/23/22
>
> Does that make sense? If so, I may file a jira for this.
>
> As for FilterTransformer or FlatMapTransformer which is designed for
> generic purpose, they seem to belong to Spark or Flink's realm.
> You can do these 2 transformation with Spark Dataset now. Or once
> decoupled from Spark, you'll probably have an abstract Dataset class
> to perform engine-agnostic transformation
>
> My understanding of transformer in HUDI is more specifically purposed,
> where the underlying transformation is handled by the actual
> processing engine (Spark or Flink)
>
>
> On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar  wrote:
>
> > Thanks Hamid and Vinoyang for the great discussion
> >
> > On Fri, Feb 14, 2020 at 5:18 AM vino yang  wrote:
> >
> > > I have filed a Jira issue[1] to track this work.
> > >
> > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > >
> > > vino yang  于2020年2月13日周四 下午9:51写道:
> > >
> > > > Hi hamid,
> > > >
> > > > Agree with your opinion.
> > > >
> > > > Let's move forward step by step.
> > > >
> > > > Will file an issue to track refactor about Transformer.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > hamid pirahesh  于2020年2月13日周四 下午6:38写道:
> > > >
> > > >> I think it is a good idea to decouple  the transformer from spark so
> > > that
> > > >> it can be used with other flow engines.
> > > >> Once you do that, then it is worth considering a much bigger play
> > rather
> > > >> than another incremental play.
> > > >> Given the scale of Hudi, we need to look at airflow, particularly in
> > the
> > > >> context of what google is doing with Composer, addressing
> autoscaling,
> > > >> scheduleing, monitoring, etc.
> > > >> You need all of that to manage a serious tetl/elt flow.
> > > >>
> > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang 
> > wrote:
> > > >>
> > > >> > Currently, Hudi has a component that has not been widely used:
> > > >> Transformer.
> > > >> > As we all know, before the original data fell into the data lake,
> a
> > > very
> > > >> > common operation is data preprocessing and ETL. This is also the
> > most
> > > >> > common use scenario of many computing engines, such as Flink and
> > > Spark.
> > > >> Now
> > > >> > that Hudi has taken advantage of the power of the computing
> engine,
> > it
> > > >> can
> > > >> > also naturally take advantage of its ability of data
> preprocessing.
> > We
> > > >> can
> > > >> > refactor the Transformer to make it become more flexible. To
> > > summarize,
> > > >> we
> > > >> > can refactor from the following aspects:
> > > >> >
> > > >> >- Decouple Transformer from Spark
> > > >> >- Enrich the Transformer and provide built-in transformer
> > > >> >- Support Transformer-chain
> > > >> >
> > > >> > For the first point, the Transformer interface is tightly coupled
> > with
> > > >> > Spark in design, and it contains a Spark-specific context. This
> > makes
> > > it
> > > >> > impossible for us to take advantage of the transform capabilities
> > > >> provided
> > > >> > by other engines (such as Flink) after supporting multiple
> engines.
> > > >> > Therefore, we need to decouple it from Spark in design.
> > > >> >
> > > >> > For the second point, we can enhance the Transformer and provide
> > some
> > > >> > out-of-the-box Transformers, such as FilterTransformer,
> > > >> FlatMapTrnasformer,
> > > >> > and so on.
> > > >> >
> > > >> > For the third point, the most common pattern for data processing
> is
> > > the
> > > >> > pipeline model, and the common implementation of the pipeline
> model
> > is
> > > >> the
> > > >> > responsibility chain model, which can be compared to the Apache
> > > commons
> > > >> > chain[1], combining multiple Transformers can make data-processing
> > > >> become
> > > >> > more flexible and expandable.
> > > >> >
> > > >> > If we enhance the capabilities of Transformer components, Hudi
> will
> > > >> provide
> > > >> > richer data processing capabilities based on the computing engine.
> > > >> >
> > > >> > What do you think?
> > > >> >
> > > >> > Any opinions and feedback are welcome and appreciated.
> > > >> >
> > > >> > Best,
> > > >> > Vino
> > > >> >
> > > >> > 

Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread Shiyan Xu
Late to the party. :P

I really favor the idea of built-in support enrichment. It is a very common
case where we want to set datetime fields for partition path. We could have
a built-in support to normalize ISO format / unix timestamp. For example
`HourlyPartitionTransformer` will normalize whatever field user specified
as partition path. Let's say user set `create_ts` as partition path field,
the transfromer will apply change create_ts => _hoodie_partition_path


   - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
   - 1582497702.123456789 => 2020/02/23/22

Does that make sense? If so, I may file a jira for this.

As for FilterTransformer or FlatMapTransformer which is designed for
generic purpose, they seem to belong to Spark or Flink's realm.
You can do these 2 transformation with Spark Dataset now. Or once
decoupled from Spark, you'll probably have an abstract Dataset class
to perform engine-agnostic transformation

My understanding of transformer in HUDI is more specifically purposed,
where the underlying transformation is handled by the actual
processing engine (Spark or Flink)


On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar  wrote:

> Thanks Hamid and Vinoyang for the great discussion
>
> On Fri, Feb 14, 2020 at 5:18 AM vino yang  wrote:
>
> > I have filed a Jira issue[1] to track this work.
> >
> > [1]: https://issues.apache.org/jira/browse/HUDI-613
> >
> > vino yang  于2020年2月13日周四 下午9:51写道:
> >
> > > Hi hamid,
> > >
> > > Agree with your opinion.
> > >
> > > Let's move forward step by step.
> > >
> > > Will file an issue to track refactor about Transformer.
> > >
> > > Best,
> > > Vino
> > >
> > > hamid pirahesh  于2020年2月13日周四 下午6:38写道:
> > >
> > >> I think it is a good idea to decouple  the transformer from spark so
> > that
> > >> it can be used with other flow engines.
> > >> Once you do that, then it is worth considering a much bigger play
> rather
> > >> than another incremental play.
> > >> Given the scale of Hudi, we need to look at airflow, particularly in
> the
> > >> context of what google is doing with Composer, addressing autoscaling,
> > >> scheduleing, monitoring, etc.
> > >> You need all of that to manage a serious tetl/elt flow.
> > >>
> > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang 
> wrote:
> > >>
> > >> > Currently, Hudi has a component that has not been widely used:
> > >> Transformer.
> > >> > As we all know, before the original data fell into the data lake, a
> > very
> > >> > common operation is data preprocessing and ETL. This is also the
> most
> > >> > common use scenario of many computing engines, such as Flink and
> > Spark.
> > >> Now
> > >> > that Hudi has taken advantage of the power of the computing engine,
> it
> > >> can
> > >> > also naturally take advantage of its ability of data preprocessing.
> We
> > >> can
> > >> > refactor the Transformer to make it become more flexible. To
> > summarize,
> > >> we
> > >> > can refactor from the following aspects:
> > >> >
> > >> >- Decouple Transformer from Spark
> > >> >- Enrich the Transformer and provide built-in transformer
> > >> >- Support Transformer-chain
> > >> >
> > >> > For the first point, the Transformer interface is tightly coupled
> with
> > >> > Spark in design, and it contains a Spark-specific context. This
> makes
> > it
> > >> > impossible for us to take advantage of the transform capabilities
> > >> provided
> > >> > by other engines (such as Flink) after supporting multiple engines.
> > >> > Therefore, we need to decouple it from Spark in design.
> > >> >
> > >> > For the second point, we can enhance the Transformer and provide
> some
> > >> > out-of-the-box Transformers, such as FilterTransformer,
> > >> FlatMapTrnasformer,
> > >> > and so on.
> > >> >
> > >> > For the third point, the most common pattern for data processing is
> > the
> > >> > pipeline model, and the common implementation of the pipeline model
> is
> > >> the
> > >> > responsibility chain model, which can be compared to the Apache
> > commons
> > >> > chain[1], combining multiple Transformers can make data-processing
> > >> become
> > >> > more flexible and expandable.
> > >> >
> > >> > If we enhance the capabilities of Transformer components, Hudi will
> > >> provide
> > >> > richer data processing capabilities based on the computing engine.
> > >> >
> > >> > What do you think?
> > >> >
> > >> > Any opinions and feedback are welcome and appreciated.
> > >> >
> > >> > Best,
> > >> > Vino
> > >> >
> > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > >> >
> > >>
> > >
> >
>


Bring back support for spark 2.3?

2020-02-23 Thread Pratyaksh Sharma
Hi,

As discussed in last to last week's weekly sync, I want to put forward this
point on our mailing list also. Since with 0.5.1 release, we have upgraded
spark to 2.4 in our master branch, we are facing difficulties after
rebasing our codebase with master. At our organisation we are using spark
2.3.2 cluster in production and spark 2.3 to 2.4 upgrade has breaking
changes.

So I wanted to know if any workaround is possible to bring back support for
spark 2.3? I am sure there are other guys/organisations using Hudi who are
not yet on spark 2.4.

We really do not want to miss out on any of the cool features being
developed by the community.


Need clarity on these test cases in TestHoodieDeltaStreamer

2020-02-23 Thread Pratyaksh Sharma
Hi,

While working on one of my PRs, I am stuck with the following test cases in
TestHoodieDeltaStreamer -
1. testUpsertsCOWContinuousMode
2. testUpsertsMORContinuousMode

For both of them, at line [1] and [2], we are adding 200 to totalRecords
while asserting record count and distance count respectively. I am unable
to understand what do these 200 records correspond to. Any leads are
appreciated.

I feel probably I am missing some piece of code where I need to do changes
for the above tests to pass.

[1]
https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L425
.
[2]
https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L426
.


Multiple clean instants with same timestamp

2020-02-23 Thread Pratyaksh Sharma
Hi,

I recently came across a strange issue for table T. For the same timestamp,
2 clean instants were present in .hoodie folder, one of them in completed
state and other one in inflight state. As a result, if I try to run cleaner
or DeltaStreamer for this table T, it was failing with the below exception
-

20/02/23 09:44:25 INFO HoodieCleanClient: There were previously unfinished
cleaner operations. Finishing Instant=[==>20200210174836__clean__INFLIGHT]
20/02/23 09:44:25 INFO HoodieCleanClient: hoodie clean instant in
execution: [20200210174836__clean__COMPLETED], with state: COMPLETED
20/02/23 09:44:25 INFO HoodieCleanClient: clean instant is inflight: false
20/02/23 09:44:25 ERROR ApplicationMaster: User class threw exception:
java.lang.IllegalArgumentException
java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
at org.apache.hudi.HoodieCleanClient.runClean(HoodieCleanClient.java:145)
at
org.apache.hudi.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:85)
at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
at
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)

Has anyone else faced a similar situation? What is the workaround to fix
this apart from manually deleting the file itself from S3 folder.

Attached screenshot shows the concerned instants. Also code is attached
with custom logs printed.