Hi Vinoth,

Thanks for your response. For HoodieIOHandle, IMO we could define the scope
of the Handle during the initialization, so we don't need to care about the
timeline and table view when actually writing the data. Is that possible? A
HoodieTable could have many Handles writing data at the same time and it
will look cleaner if we can keep the timeline and file system view inside
the table itself.

Best,
Gary

On Sat, Sep 18, 2021 at 12:06 AM Vinoth Chandar <vin...@apache.org> wrote:

> Hi Gary,
>
> Thanks for the detailed response. Let me add my take on it.
>
> >>HoodieFlinkMergeOnReadTable.upsert(List<HoodieRecord>) to use the
> AppendHandle.write(HoodieRecord) directly,
>
> I have the same issue on JavaClient, for the Kafka Connect implementation.
> I have an idea of how we can implement this. Will raise a PR and get your
> thoughts.
> We can then see if this can be leveraged across Flink and Java clients.
>
> On the IOHandle not having the Table inside, I think the File
> reader/writer  abstraction exists already and having the Table in the io
> layers helps us perform I/O
> while maintaining consistency with the timeline.
>
> +1 on the next two points.
>
> I think these layers have well defined roles, and probably why we are able
> to get this far :) . May be we need to pull I/O up into hudi-common ?
>
> For this project, we can trim the scope to code reuse and moving all the
> different engine specific implementations up into hudi-client-common.
>
> What do you think?
>
> Thanks
> Vinoth
>
>
> On Thu, Sep 16, 2021 at 6:55 AM Gary Li <gar...@apache.org> wrote:
>
> > Huge +1. Recently I am working on making the Flink writer in a streaming
> > fashion and found the List<HoodieRecord> interface is limiting the
> > streaming power of Flink. By switching from
> > HoodieFlinkMergeOnReadTable.upsert(List<HoodieRecord>) to use the
> > AppendHandle.write(HoodieRecord) directly, the throughput was almost
> > doubled and the checkpoint time of the writer was reduced from minutes to
> > seconds. But I found it really difficult to fit this change into the
> > current client interface.
> >
> > My 2 cents:
> >
> >    - The HoodieIOHandle should only handle the IO, and not having
> >    HoodieTable inside.
> >    - We need a more streaming-friendly Handle. For Flink, we can
> definitely
> >    change all the batch mode List<HoodieRecord> to processing
> HoodieRecord
> > one
> >    by one, just like the AppendHandle.write(HoodieRecord) and
> >    AppendHandle.close(). This will spread the computing cost and
> >    flattening the curve.
> >    - We can use the Handle to precisely control the JVM to avoid OOM and
> >    optimize the memory footprint. Then we don't need to implement another
> >    memory control mechanism in the compute engine itself.
> >    - HoodieClient, HoodieTable, HoodieIOHandle, HoodieTimeline,
> >    HoodieFileSystemView e.t.c should have a well-defined role and
> > well-defined
> >    layer. We should know when to use what, it should be used by the
> driver
> > in
> >    a single thread or used by the worker in a distributed way.
> >
> > This is a big project and could benefit Hudi in long term. Happy to
> discuss
> > more in the design doc or PRs.
> >
> > Best,
> > Gary
> >
> > On Thu, Sep 16, 2021 at 3:21 AM Raymond Xu <xu.shiyan.raym...@gmail.com>
> > wrote:
> >
> > > +1 that's a great improvement.
> > >
> > > On Wed, Sep 15, 2021 at 10:40 AM Sivabalan <n.siv...@gmail.com> wrote:
> > >
> > > > ++1. definitely help's Hudi scale and makes it more maintainable.
> > Thanks
> > > > for driving this effort. Mostly devs show interest in major features
> > and
> > > > don't like to spend time in such foundational work. But as the
> project
> > > > scales, these foundational work will have a higher returns in the
> long
> > > run.
> > > >
> > > > On Wed, Sep 15, 2021 at 8:29 AM Vinoth Chandar <vin...@apache.org>
> > > wrote:
> > > >
> > > > > Another +1 ,  HoodieData abstraction will go a long way in reducing
> > > LoC.
> > > > >
> > > > > Happy to work with you to see this through! I really encourage top
> > > > > contributors to the Flink and Java clients as well,
> > > > > actively review all PRs, given there are subtle differences
> > everywhere.
> > > > >
> > > > > This will help us smoothly provide all the core features across
> > > engines.
> > > > > Also help us easily write a DataSet/Row based
> > > > > client for Spark as well.
> > > > >
> > > > > Onwards and upwards
> > > > > Vinoth
> > > > >
> > > > > On Wed, Sep 15, 2021 at 4:57 AM vino yang <yanghua1...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Ethan,
> > > > > >
> > > > > > Big +1 for the proposal.
> > > > > >
> > > > > > Actually, we have discussed this topic before.[1]
> > > > > >
> > > > > > Will review your refactor PR later.
> > > > > >
> > > > > > Best,
> > > > > > Vino
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/r71d96d285c735b1611920fb3e7224c9ce6fd53d09bf0e8f144f4fcbd%40%3Cdev.hudi.apache.org%3E
> > > > > >
> > > > > >
> > > > > > Y Ethan Guo <ethan.guoyi...@gmail.com> 于2021年9月15日周三 下午3:34写道:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > hudi-client module has core Hudi abstractions and client logic
> > for
> > > > > > > different engines like Spark, Flink, and Java.  While previous
> > > effort
> > > > > > > (HUDI-538 [1]) has decoupled the integration with Spark, there
> is
> > > > quite
> > > > > > > some code duplication across different engines for almost the
> > same
> > > > > logic
> > > > > > > due to the current interface design.  Some part also has
> > divergence
> > > > > among
> > > > > > > engines, making debugging and support difficult.
> > > > > > >
> > > > > > > I propose to further refactor the hudi-client module with the
> > goal
> > > of
> > > > > > > improving the code reuse across multiple engines and reducing
> the
> > > > > > > divergence of the logic among them, so that the core Hudi
> action
> > > > > > execution
> > > > > > > logic should be shared across engines, except for engine
> specific
> > > > > > > transformations.  Such a pattern also allows easy support of
> core
> > > > Hudi
> > > > > > > functionality for all engines in the future.  Specifically,
> > > > > > >
> > > > > > > (1) Abstracts the transformation boilerplates inside the
> > > > > > > HoodieEngineContext and implements the engine-specific data
> > > > > > transformation
> > > > > > > logic in the subclasses.  Type cast can be done inside the
> engine
> > > > > > context.
> > > > > > > (2) Creates new HoodieData abstraction for passing input and
> > output
> > > > > along
> > > > > > > the flow of execution, and uses it in different Hudi
> > abstractions,
> > > > > e.g.,
> > > > > > > HoodieTable, HoodieIOHandle, BaseActionExecutor, instead of
> > > enforcing
> > > > > > type
> > > > > > > parameters encountering RDD<HoodieRecord> and
> List<HoodieRecord>
> > > > which
> > > > > > are
> > > > > > > one source of duplication.
> > > > > > > (3) Extracts common execution logic to hudi-client-common
> module
> > > from
> > > > > > > multiple engines.
> > > > > > >
> > > > > > > As a first step and exploration for item (1) and (3) above,
> I've
> > > > tried
> > > > > to
> > > > > > > refactor the rollback actions and the PR is here
> [HUDI-2433][2].
> > > In
> > > > > this
> > > > > > > PR, I completely remove all engine-specific rollback packages
> and
> > > > only
> > > > > > keep
> > > > > > > one rollback package in hudi-client-common, adding ~350 LoC
> while
> > > > > > deleting
> > > > > > > 1.3K LoC.  My next step is to refactor the commit action which
> > > > > > encompasses
> > > > > > > item (2) above.
> > > > > > >
> > > > > > > What do you folks think and any other suggestions?
> > > > > > >
> > > > > > > [1] [HUDI-538] [UMBRELLA] Restructuring hudi client module for
> > > multi
> > > > > > engine
> > > > > > > support
> > > > > > > https://issues.apache.org/jira/browse/HUDI-538
> > > > > > > [2] PR: [HUDI-2433] Refactor rollback actions in hudi-client
> > module
> > > > > > > https://github.com/apache/hudi/pull/3664/files
> > > > > > >
> > > > > > > Best,
> > > > > > > - Ethan
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
>

Reply via email to