Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

Gary Li Thu, 16 Sep 2021 06:55:11 -0700

Huge +1. Recently I am working on making the Flink writer in a streaming
fashion and found the List<HoodieRecord> interface is limiting the
streaming power of Flink. By switching from
HoodieFlinkMergeOnReadTable.upsert(List<HoodieRecord>) to use the
AppendHandle.write(HoodieRecord) directly, the throughput was almost
doubled and the checkpoint time of the writer was reduced from minutes to
seconds. But I found it really difficult to fit this change into the
current client interface.


My 2 cents:

   - The HoodieIOHandle should only handle the IO, and not having
   HoodieTable inside.
   - We need a more streaming-friendly Handle. For Flink, we can definitely
   change all the batch mode List<HoodieRecord> to processing HoodieRecord one
   by one, just like the AppendHandle.write(HoodieRecord) and
   AppendHandle.close(). This will spread the computing cost and
   flattening the curve.
   - We can use the Handle to precisely control the JVM to avoid OOM and
   optimize the memory footprint. Then we don't need to implement another
   memory control mechanism in the compute engine itself.
   - HoodieClient, HoodieTable, HoodieIOHandle, HoodieTimeline,
   HoodieFileSystemView e.t.c should have a well-defined role and well-defined
   layer. We should know when to use what, it should be used by the driver in
   a single thread or used by the worker in a distributed way.

This is a big project and could benefit Hudi in long term. Happy to discuss
more in the design doc or PRs.

Best,
Gary

On Thu, Sep 16, 2021 at 3:21 AM Raymond Xu <[email protected]>
wrote:

> +1 that's a great improvement.
>
> On Wed, Sep 15, 2021 at 10:40 AM Sivabalan <[email protected]> wrote:
>
> > ++1. definitely help's Hudi scale and makes it more maintainable. Thanks
> > for driving this effort. Mostly devs show interest in major features and
> > don't like to spend time in such foundational work. But as the project
> > scales, these foundational work will have a higher returns in the long
> run.
> >
> > On Wed, Sep 15, 2021 at 8:29 AM Vinoth Chandar <[email protected]>
> wrote:
> >
> > > Another +1 ,  HoodieData abstraction will go a long way in reducing
> LoC.
> > >
> > > Happy to work with you to see this through! I really encourage top
> > > contributors to the Flink and Java clients as well,
> > > actively review all PRs, given there are subtle differences everywhere.
> > >
> > > This will help us smoothly provide all the core features across
> engines.
> > > Also help us easily write a DataSet/Row based
> > > client for Spark as well.
> > >
> > > Onwards and upwards
> > > Vinoth
> > >
> > > On Wed, Sep 15, 2021 at 4:57 AM vino yang <[email protected]>
> wrote:
> > >
> > > > Hi Ethan,
> > > >
> > > > Big +1 for the proposal.
> > > >
> > > > Actually, we have discussed this topic before.[1]
> > > >
> > > > Will review your refactor PR later.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > [1]:
> > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/r71d96d285c735b1611920fb3e7224c9ce6fd53d09bf0e8f144f4fcbd%40%3Cdev.hudi.apache.org%3E
> > > >
> > > >
> > > > Y Ethan Guo <[email protected]> 于2021年9月15日周三 下午3:34写道：
> > > >
> > > > > Hi all,
> > > > >
> > > > > hudi-client module has core Hudi abstractions and client logic for
> > > > > different engines like Spark, Flink, and Java.  While previous
> effort
> > > > > (HUDI-538 [1]) has decoupled the integration with Spark, there is
> > quite
> > > > > some code duplication across different engines for almost the same
> > > logic
> > > > > due to the current interface design.  Some part also has divergence
> > > among
> > > > > engines, making debugging and support difficult.
> > > > >
> > > > > I propose to further refactor the hudi-client module with the goal
> of
> > > > > improving the code reuse across multiple engines and reducing the
> > > > > divergence of the logic among them, so that the core Hudi action
> > > > execution
> > > > > logic should be shared across engines, except for engine specific
> > > > > transformations.  Such a pattern also allows easy support of core
> > Hudi
> > > > > functionality for all engines in the future.  Specifically,
> > > > >
> > > > > (1) Abstracts the transformation boilerplates inside the
> > > > > HoodieEngineContext and implements the engine-specific data
> > > > transformation
> > > > > logic in the subclasses.  Type cast can be done inside the engine
> > > > context.
> > > > > (2) Creates new HoodieData abstraction for passing input and output
> > > along
> > > > > the flow of execution, and uses it in different Hudi abstractions,
> > > e.g.,
> > > > > HoodieTable, HoodieIOHandle, BaseActionExecutor, instead of
> enforcing
> > > > type
> > > > > parameters encountering RDD<HoodieRecord> and List<HoodieRecord>
> > which
> > > > are
> > > > > one source of duplication.
> > > > > (3) Extracts common execution logic to hudi-client-common module
> from
> > > > > multiple engines.
> > > > >
> > > > > As a first step and exploration for item (1) and (3) above, I've
> > tried
> > > to
> > > > > refactor the rollback actions and the PR is here [HUDI-2433][2].
> In
> > > this
> > > > > PR, I completely remove all engine-specific rollback packages and
> > only
> > > > keep
> > > > > one rollback package in hudi-client-common, adding ~350 LoC while
> > > > deleting
> > > > > 1.3K LoC.  My next step is to refactor the commit action which
> > > > encompasses
> > > > > item (2) above.
> > > > >
> > > > > What do you folks think and any other suggestions?
> > > > >
> > > > > [1] [HUDI-538] [UMBRELLA] Restructuring hudi client module for
> multi
> > > > engine
> > > > > support
> > > > > https://issues.apache.org/jira/browse/HUDI-538
> > > > > [2] PR: [HUDI-2433] Refactor rollback actions in hudi-client module
> > > > > https://github.com/apache/hudi/pull/3664/files
> > > > >
> > > > > Best,
> > > > > - Ethan
> > > > >
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

Reply via email to