Another +1 ,  HoodieData abstraction will go a long way in reducing LoC.

Happy to work with you to see this through! I really encourage top
contributors to the Flink and Java clients as well,
actively review all PRs, given there are subtle differences everywhere.

This will help us smoothly provide all the core features across engines.
Also help us easily write a DataSet/Row based
client for Spark as well.

Onwards and upwards
Vinoth

On Wed, Sep 15, 2021 at 4:57 AM vino yang <[email protected]> wrote:

> Hi Ethan,
>
> Big +1 for the proposal.
>
> Actually, we have discussed this topic before.[1]
>
> Will review your refactor PR later.
>
> Best,
> Vino
>
> [1]:
>
> https://lists.apache.org/thread.html/r71d96d285c735b1611920fb3e7224c9ce6fd53d09bf0e8f144f4fcbd%40%3Cdev.hudi.apache.org%3E
>
>
> Y Ethan Guo <[email protected]> 于2021年9月15日周三 下午3:34写道:
>
> > Hi all,
> >
> > hudi-client module has core Hudi abstractions and client logic for
> > different engines like Spark, Flink, and Java.  While previous effort
> > (HUDI-538 [1]) has decoupled the integration with Spark, there is quite
> > some code duplication across different engines for almost the same logic
> > due to the current interface design.  Some part also has divergence among
> > engines, making debugging and support difficult.
> >
> > I propose to further refactor the hudi-client module with the goal of
> > improving the code reuse across multiple engines and reducing the
> > divergence of the logic among them, so that the core Hudi action
> execution
> > logic should be shared across engines, except for engine specific
> > transformations.  Such a pattern also allows easy support of core Hudi
> > functionality for all engines in the future.  Specifically,
> >
> > (1) Abstracts the transformation boilerplates inside the
> > HoodieEngineContext and implements the engine-specific data
> transformation
> > logic in the subclasses.  Type cast can be done inside the engine
> context.
> > (2) Creates new HoodieData abstraction for passing input and output along
> > the flow of execution, and uses it in different Hudi abstractions, e.g.,
> > HoodieTable, HoodieIOHandle, BaseActionExecutor, instead of enforcing
> type
> > parameters encountering RDD<HoodieRecord> and List<HoodieRecord> which
> are
> > one source of duplication.
> > (3) Extracts common execution logic to hudi-client-common module from
> > multiple engines.
> >
> > As a first step and exploration for item (1) and (3) above, I've tried to
> > refactor the rollback actions and the PR is here [HUDI-2433][2].  In this
> > PR, I completely remove all engine-specific rollback packages and only
> keep
> > one rollback package in hudi-client-common, adding ~350 LoC while
> deleting
> > 1.3K LoC.  My next step is to refactor the commit action which
> encompasses
> > item (2) above.
> >
> > What do you folks think and any other suggestions?
> >
> > [1] [HUDI-538] [UMBRELLA] Restructuring hudi client module for multi
> engine
> > support
> > https://issues.apache.org/jira/browse/HUDI-538
> > [2] PR: [HUDI-2433] Refactor rollback actions in hudi-client module
> > https://github.com/apache/hudi/pull/3664/files
> >
> > Best,
> > - Ethan
> >
>

Reply via email to