Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

Vinoth Chandar Fri, 17 Sep 2021 09:06:21 -0700

Hi Gary,

Thanks for the detailed response. Let me add my take on it.


>>HoodieFlinkMergeOnReadTable.upsert(List<HoodieRecord>) to use the
AppendHandle.write(HoodieRecord) directly,

I have the same issue on JavaClient, for the Kafka Connect implementation.
I have an idea of how we can implement this. Will raise a PR and get your
thoughts.
We can then see if this can be leveraged across Flink and Java clients.

On the IOHandle not having the Table inside, I think the File
reader/writer  abstraction exists already and having the Table in the io
layers helps us perform I/O
while maintaining consistency with the timeline.

+1 on the next two points.

I think these layers have well defined roles, and probably why we are able
to get this far :) . May be we need to pull I/O up into hudi-common ?

For this project, we can trim the scope to code reuse and moving all the
different engine specific implementations up into hudi-client-common.

What do you think?

Thanks
Vinoth


On Thu, Sep 16, 2021 at 6:55 AM Gary Li <[email protected]> wrote:

> Huge +1. Recently I am working on making the Flink writer in a streaming
> fashion and found the List<HoodieRecord> interface is limiting the
> streaming power of Flink. By switching from
> HoodieFlinkMergeOnReadTable.upsert(List<HoodieRecord>) to use the
> AppendHandle.write(HoodieRecord) directly, the throughput was almost
> doubled and the checkpoint time of the writer was reduced from minutes to
> seconds. But I found it really difficult to fit this change into the
> current client interface.
>
> My 2 cents:
>
>    - The HoodieIOHandle should only handle the IO, and not having
>    HoodieTable inside.
>    - We need a more streaming-friendly Handle. For Flink, we can definitely
>    change all the batch mode List<HoodieRecord> to processing HoodieRecord
> one
>    by one, just like the AppendHandle.write(HoodieRecord) and
>    AppendHandle.close(). This will spread the computing cost and
>    flattening the curve.
>    - We can use the Handle to precisely control the JVM to avoid OOM and
>    optimize the memory footprint. Then we don't need to implement another
>    memory control mechanism in the compute engine itself.
>    - HoodieClient, HoodieTable, HoodieIOHandle, HoodieTimeline,
>    HoodieFileSystemView e.t.c should have a well-defined role and
> well-defined
>    layer. We should know when to use what, it should be used by the driver
> in
>    a single thread or used by the worker in a distributed way.
>
> This is a big project and could benefit Hudi in long term. Happy to discuss
> more in the design doc or PRs.
>
> Best,
> Gary
>
> On Thu, Sep 16, 2021 at 3:21 AM Raymond Xu <[email protected]>
> wrote:
>
> > +1 that's a great improvement.
> >
> > On Wed, Sep 15, 2021 at 10:40 AM Sivabalan <[email protected]> wrote:
> >
> > > ++1. definitely help's Hudi scale and makes it more maintainable.
> Thanks
> > > for driving this effort. Mostly devs show interest in major features
> and
> > > don't like to spend time in such foundational work. But as the project
> > > scales, these foundational work will have a higher returns in the long
> > run.
> > >
> > > On Wed, Sep 15, 2021 at 8:29 AM Vinoth Chandar <[email protected]>
> > wrote:
> > >
> > > > Another +1 ,  HoodieData abstraction will go a long way in reducing
> > LoC.
> > > >
> > > > Happy to work with you to see this through! I really encourage top
> > > > contributors to the Flink and Java clients as well,
> > > > actively review all PRs, given there are subtle differences
> everywhere.
> > > >
> > > > This will help us smoothly provide all the core features across
> > engines.
> > > > Also help us easily write a DataSet/Row based
> > > > client for Spark as well.
> > > >
> > > > Onwards and upwards
> > > > Vinoth
> > > >
> > > > On Wed, Sep 15, 2021 at 4:57 AM vino yang <[email protected]>
> > wrote:
> > > >
> > > > > Hi Ethan,
> > > > >
> > > > > Big +1 for the proposal.
> > > > >
> > > > > Actually, we have discussed this topic before.[1]
> > > > >
> > > > > Will review your refactor PR later.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/r71d96d285c735b1611920fb3e7224c9ce6fd53d09bf0e8f144f4fcbd%40%3Cdev.hudi.apache.org%3E
> > > > >
> > > > >
> > > > > Y Ethan Guo <[email protected]> 于2021年9月15日周三 下午3:34写道：
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > hudi-client module has core Hudi abstractions and client logic
> for
> > > > > > different engines like Spark, Flink, and Java.  While previous
> > effort
> > > > > > (HUDI-538 [1]) has decoupled the integration with Spark, there is
> > > quite
> > > > > > some code duplication across different engines for almost the
> same
> > > > logic
> > > > > > due to the current interface design.  Some part also has
> divergence
> > > > among
> > > > > > engines, making debugging and support difficult.
> > > > > >
> > > > > > I propose to further refactor the hudi-client module with the
> goal
> > of
> > > > > > improving the code reuse across multiple engines and reducing the
> > > > > > divergence of the logic among them, so that the core Hudi action
> > > > > execution
> > > > > > logic should be shared across engines, except for engine specific
> > > > > > transformations.  Such a pattern also allows easy support of core
> > > Hudi
> > > > > > functionality for all engines in the future.  Specifically,
> > > > > >
> > > > > > (1) Abstracts the transformation boilerplates inside the
> > > > > > HoodieEngineContext and implements the engine-specific data
> > > > > transformation
> > > > > > logic in the subclasses.  Type cast can be done inside the engine
> > > > > context.
> > > > > > (2) Creates new HoodieData abstraction for passing input and
> output
> > > > along
> > > > > > the flow of execution, and uses it in different Hudi
> abstractions,
> > > > e.g.,
> > > > > > HoodieTable, HoodieIOHandle, BaseActionExecutor, instead of
> > enforcing
> > > > > type
> > > > > > parameters encountering RDD<HoodieRecord> and List<HoodieRecord>
> > > which
> > > > > are
> > > > > > one source of duplication.
> > > > > > (3) Extracts common execution logic to hudi-client-common module
> > from
> > > > > > multiple engines.
> > > > > >
> > > > > > As a first step and exploration for item (1) and (3) above, I've
> > > tried
> > > > to
> > > > > > refactor the rollback actions and the PR is here [HUDI-2433][2].
> > In
> > > > this
> > > > > > PR, I completely remove all engine-specific rollback packages and
> > > only
> > > > > keep
> > > > > > one rollback package in hudi-client-common, adding ~350 LoC while
> > > > > deleting
> > > > > > 1.3K LoC.  My next step is to refactor the commit action which
> > > > > encompasses
> > > > > > item (2) above.
> > > > > >
> > > > > > What do you folks think and any other suggestions?
> > > > > >
> > > > > > [1] [HUDI-538] [UMBRELLA] Restructuring hudi client module for
> > multi
> > > > > engine
> > > > > > support
> > > > > > https://issues.apache.org/jira/browse/HUDI-538
> > > > > > [2] PR: [HUDI-2433] Refactor rollback actions in hudi-client
> module
> > > > > > https://github.com/apache/hudi/pull/3664/files
> > > > > >
> > > > > > Best,
> > > > > > - Ethan
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

Reply via email to