Re: [DISCUSS] Rethink the abstraction of current client

vino yang Tue, 02 Feb 2021 19:31:16 -0800

Hi,

> I think the proposed interfaces indeed look more intuitive and could
simplify the code structures. My concern is mostly around the ROI of such
refactoring work. Probably I lack some direct involvement in the flink
client work but it looks like it's mainly about code restructuring and
simplification for a new engine implementation?


My original intention for this proposal is as you said: refactoring the
code
abstraction and simplifying the client implementation. But Danny also has
an idea to redesign the abstraction for the DataFlow model. It depends on
whether we want to solve all problems in one shot? Maybe splitting into
multiple steps will focus a little bit.

Regarding ROI, I have explained in the original proposal. The current
implementation is too deep in abstraction and serious expansion of classes.
Any new engine implementation will create a dozen new classes.
When we make some adjustments in the Spark write client
(such as adding a new feature, or fixing a bug), other engines have to
perceive a high cost.

> Speaking of simplifying the client implementation, I wonder the
possibility of removing the table type concept, i.e., to make COW/MOR
tables the same thing by configuring each insert/upsert operation

About this idea, I will let @Vinoth Chandar <vin...@apache.org> chim in.

Best,
Vino

Raymond Xu <xu.shiyan.raym...@gmail.com> 于2021年1月21日周四 上午11:04写道：

> I think the proposed interfaces indeed look more intuitive and could
> simplify the code structures. My concern is mostly around the ROI of such
> refactoring work. Probably I lack some direct involvement in the flink
> client work but it looks like it's mainly about code restructuring and
> simplification for a new engine implementation?
>
> Speaking of simplifying the client implementation, I wonder the
> possibility of removing the table type concept, i.e., to make COW/MOR
> tables the same thing by configuring each insert/upsert operation
> - COW table should be ok to take in a new delta commit by just taking log
> files alongside with base files
> - MOR table should be ok to do a one-time compaction for all log files and
> the incoming records
> This also looks like a big refactoring work so I also concern the ROI.
> - Benefits I see: unifying the concepts for Hudi as a table format, less
> classes to implement for clients, more flexibility in writing
> - Some downsides: too much code change; MOR->COW can be too expensive (skip
> this case maybe?)
>
> Just thinking if we do carry out the client abstraction work, could this
> table type simplification also be done at the same time?
>
> On Tue, Jan 19, 2021 at 1:38 AM vino yang <vinoy...@apache.org> wrote:
>
> > Hi guys,
> >
> > *I open this thread to discuss if we can separate the attributes and
> > behaviors of HoodieTable, and rethink the abstraction of the client.*
> >
> > Currently, in the hudi-client-common module, there is a HoodieTable
> class,
> > which contains a set of attributes and behaviors. For different engines,
> it
> > has different implementations. The existing classes include:
> >
> >    - HoodieSparkTable;
> >    - HoodieFlinkTable;
> >    - HoodieJavaTable;
> >
> > In addition, for two different table types: COW and MOR, these classes
> are
> > further split. For example, HoodieSparkTable is split into:
> >
> >    - HoodieSparkCopyOnWriteTable;
> >    - HoodieSparkMergeOnReadTable;
> >
> > HoodieSparkTable degenerates into a factory to initialize these classes.
> >
> > This model looks clear but brings some problems.
> >
> > First of all, HoodieTable is a mixture of attributes and behaviors. The
> > attributes are independent of the engines, but the behavior varies
> > depending on the engine. Semantically speaking, HoodieTable should belong
> > to hudi-common, and should not only be associated with
> hudi-client-common.
> >
> > Second, the behaviors contained in HoodieTable, such as:
> >
> >    - upsert
> >    - insert
> >    - delete
> >    - insertOverwrite
> >
> > They are similar to the APIs provided by the client, but it is not
> > implemented directly in HoodieTable. Instead, the implementation is
> handed
> > over to a bunch of actions (executors), such as:
> >
> >    - commit
> >    - compact
> >    - clean
> >    - rollback
> >
> > In addition, these actions do not completely contain the implementation
> > logic. Part of their logic is separated into some Helper classes under
> the
> > same package, such as:
> >
> >    - SparkWriteHelper
> >    - SparkMergeHelper
> >    - SparkDeleteHelper
> >
> > To sum up, for abstraction, the implementation is moved backward layer by
> > layer (mainly completed by the executor + helper classes), which makes
> each
> > client need a lot of classes with similar patterns to implement the basic
> > API, and the class expansion is very serious.
> >
> > Let us reorganize it:
> >
> > What a write client does is to insert or upsert a batch of records to a
> > table with transaction semantics, and provide some additional operations
> to
> > the table. It contains three components:
> >
> >    - Two objects: a table, a batch of records;
> >    - One type of operation: insert or upsert (focus on records)
> >    - One type of additional operation: compact / clean (focus on the
> table
> >    itself)
> >
> > Therefore, the following improvements are proposed here:
> >
> >    - The table object does not contain behavior, the table should be
> public
> >    and engine independent;
> >    - Classify and abstract the operation behavior:
> >       - TableInsertOperation(interface)
> >       - TableUpsertOperation(interface)
> >       - TableTransactionOperation
> >       - TableManageOperation(compact/clean…)
> >
> > This kind of abstraction is more intuitive and focused so that there is
> > only one point of materialization. For example, the Spark engine for
> insert
> > operation will hatch the following specific implementation classes:
> >
> >    - CoWTableSparkInsertOperation;
> >    - MoRTableSparkInsertOperation;
> >
> > Of course, we can provide a factory class named
> TableSparkInsertOperation,
> > which is optional.
> >
> > Based on the new abstraction, a new engine only needs to reimplement the
> > interfaces of the above behaviors, and then provide a new client to
> > instantiate them.
> >
> > In order to focus here, I deliberately ignored an important object: the
> > index. The index should also be in the hudi-common module, and its
> > implementation may be engine-related, providing acceleration capabilities
> > for writing and querying at the same time.
> >
> > The above is just a preliminary idea, there are still many details that
> > have not been considered. I hope to hear your thoughts on this.
> >
> > Any opinions and thoughts are appreciated and welcome.
> >
> > Best,
> > Vino
> >
>

Re: [DISCUSS] Rethink the abstraction of current client

Reply via email to