Re: [DISCUSS] Rethink the abstraction of current client

Vinoth Chandar Tue, 02 Feb 2021 22:44:23 -0800

Sorry for the late reply. Standard excuse: 0.7.0 release.

+1 on the need to rethink this.


Some comments on issues in this thread IMO.

1. Agree that the hierarchy has gotten much taller now. and we need to
immediately pull back more code into hudi-client-common. IMO what we lack
is some kind of abstraction for the input data itself i.e
RDD<HoodieRecord>, DataSet<HoodieRecord>. I have had my fair share of
trying to merge these classes. it typically boils down to needing some rdd
method called in the Spark* class.
2. (not pushing back, just explaining). The current code split between the
WriteClient and HoodieTable/Subclasses is that WriteClient actually
contains all the runtime machinery and delegates individual operations to
the specific type of table. The *Operation classes you proposed, are very
similar to the ActionExecutor classes, we have today right?
3. On moving HoodieTable to hudi-common, I am bit undecided. HoodieTable
just has all the "write" operations. This stems from the fact that we write
data out using a bunch of different actions, but expose the same standard
way of querying via SQL from different engines. I am wondering if it
suffices to say rename HoodieTable to HoodieWritableTable or something to
make intent clearer. Moving to hudi-common, may not bring much value IMO,
unless reads can also be implemented there in the same way. But we all know
that each engine Hive,Spark,Presto all ahve different/non-standard ways of
integrating Hudi like table formats and as such, we have to work with a
bunch of different abstractions to list files, read stats or read records
out.
4. Fully agree with you all these Helper classes that have crept up. I
wanted to first trim up the ActionExecutor implementations before we did
multi-engine. but it kind of happened the other way around.

All that said, I am spending my time reading all (I mean all) query engine
abstractions for reading and writing. and want to start a RFC on a new
"bedrock" for the project.  Happy to combine efforts!

In the meantime, if someone can spend time thinking about how to reduce the
code duplciation and shorten the class hierarchies in the client module,
that would be great.
I imagine such a large refactoring undertaking will run for a good chunk of
time this year. So, immediately paving way for say Flink/Java clients to
get benefits of metadata table access has great value for our users.

my 2c.
Vinoth









On Tue, Feb 2, 2021 at 7:31 PM vino yang <yanghua1...@gmail.com> wrote:

> Hi,
>
> > I think the proposed interfaces indeed look more intuitive and could
> simplify the code structures. My concern is mostly around the ROI of such
> refactoring work. Probably I lack some direct involvement in the flink
> client work but it looks like it's mainly about code restructuring and
> simplification for a new engine implementation?
>
> My original intention for this proposal is as you said: refactoring the
> code
> abstraction and simplifying the client implementation. But Danny also has
> an idea to redesign the abstraction for the DataFlow model. It depends on
> whether we want to solve all problems in one shot? Maybe splitting into
> multiple steps will focus a little bit.
>
> Regarding ROI, I have explained in the original proposal. The current
> implementation is too deep in abstraction and serious expansion of
> classes.
> Any new engine implementation will create a dozen new classes.
> When we make some adjustments in the Spark write client
> (such as adding a new feature, or fixing a bug), other engines have to
> perceive a high cost.
>
> > Speaking of simplifying the client implementation, I wonder the
> possibility of removing the table type concept, i.e., to make COW/MOR
> tables the same thing by configuring each insert/upsert operation
>
> About this idea, I will let @Vinoth Chandar <vin...@apache.org> chim in.
>
> Best,
> Vino
>
> Raymond Xu <xu.shiyan.raym...@gmail.com> 于2021年1月21日周四 上午11:04写道：
>
>> I think the proposed interfaces indeed look more intuitive and could
>> simplify the code structures. My concern is mostly around the ROI of such
>> refactoring work. Probably I lack some direct involvement in the flink
>> client work but it looks like it's mainly about code restructuring and
>> simplification for a new engine implementation?
>>
>> Speaking of simplifying the client implementation, I wonder the
>> possibility of removing the table type concept, i.e., to make COW/MOR
>> tables the same thing by configuring each insert/upsert operation
>> - COW table should be ok to take in a new delta commit by just taking log
>> files alongside with base files
>> - MOR table should be ok to do a one-time compaction for all log files and
>> the incoming records
>> This also looks like a big refactoring work so I also concern the ROI.
>> - Benefits I see: unifying the concepts for Hudi as a table format, less
>> classes to implement for clients, more flexibility in writing
>> - Some downsides: too much code change; MOR->COW can be too expensive
>> (skip
>> this case maybe?)
>>
>> Just thinking if we do carry out the client abstraction work, could this
>> table type simplification also be done at the same time?
>>
>> On Tue, Jan 19, 2021 at 1:38 AM vino yang <vinoy...@apache.org> wrote:
>>
>> > Hi guys,
>> >
>> > *I open this thread to discuss if we can separate the attributes and
>> > behaviors of HoodieTable, and rethink the abstraction of the client.*
>> >
>> > Currently, in the hudi-client-common module, there is a HoodieTable
>> class,
>> > which contains a set of attributes and behaviors. For different
>> engines, it
>> > has different implementations. The existing classes include:
>> >
>> >    - HoodieSparkTable;
>> >    - HoodieFlinkTable;
>> >    - HoodieJavaTable;
>> >
>> > In addition, for two different table types: COW and MOR, these classes
>> are
>> > further split. For example, HoodieSparkTable is split into:
>> >
>> >    - HoodieSparkCopyOnWriteTable;
>> >    - HoodieSparkMergeOnReadTable;
>> >
>> > HoodieSparkTable degenerates into a factory to initialize these classes.
>> >
>> > This model looks clear but brings some problems.
>> >
>> > First of all, HoodieTable is a mixture of attributes and behaviors. The
>> > attributes are independent of the engines, but the behavior varies
>> > depending on the engine. Semantically speaking, HoodieTable should
>> belong
>> > to hudi-common, and should not only be associated with
>> hudi-client-common.
>> >
>> > Second, the behaviors contained in HoodieTable, such as:
>> >
>> >    - upsert
>> >    - insert
>> >    - delete
>> >    - insertOverwrite
>> >
>> > They are similar to the APIs provided by the client, but it is not
>> > implemented directly in HoodieTable. Instead, the implementation is
>> handed
>> > over to a bunch of actions (executors), such as:
>> >
>> >    - commit
>> >    - compact
>> >    - clean
>> >    - rollback
>> >
>> > In addition, these actions do not completely contain the implementation
>> > logic. Part of their logic is separated into some Helper classes under
>> the
>> > same package, such as:
>> >
>> >    - SparkWriteHelper
>> >    - SparkMergeHelper
>> >    - SparkDeleteHelper
>> >
>> > To sum up, for abstraction, the implementation is moved backward layer
>> by
>> > layer (mainly completed by the executor + helper classes), which makes
>> each
>> > client need a lot of classes with similar patterns to implement the
>> basic
>> > API, and the class expansion is very serious.
>> >
>> > Let us reorganize it:
>> >
>> > What a write client does is to insert or upsert a batch of records to a
>> > table with transaction semantics, and provide some additional
>> operations to
>> > the table. It contains three components:
>> >
>> >    - Two objects: a table, a batch of records;
>> >    - One type of operation: insert or upsert (focus on records)
>> >    - One type of additional operation: compact / clean (focus on the
>> table
>> >    itself)
>> >
>> > Therefore, the following improvements are proposed here:
>> >
>> >    - The table object does not contain behavior, the table should be
>> public
>> >    and engine independent;
>> >    - Classify and abstract the operation behavior:
>> >       - TableInsertOperation(interface)
>> >       - TableUpsertOperation(interface)
>> >       - TableTransactionOperation
>> >       - TableManageOperation(compact/clean…)
>> >
>> > This kind of abstraction is more intuitive and focused so that there is
>> > only one point of materialization. For example, the Spark engine for
>> insert
>> > operation will hatch the following specific implementation classes:
>> >
>> >    - CoWTableSparkInsertOperation;
>> >    - MoRTableSparkInsertOperation;
>> >
>> > Of course, we can provide a factory class named
>> TableSparkInsertOperation,
>> > which is optional.
>> >
>> > Based on the new abstraction, a new engine only needs to reimplement the
>> > interfaces of the above behaviors, and then provide a new client to
>> > instantiate them.
>> >
>> > In order to focus here, I deliberately ignored an important object: the
>> > index. The index should also be in the hudi-common module, and its
>> > implementation may be engine-related, providing acceleration
>> capabilities
>> > for writing and querying at the same time.
>> >
>> > The above is just a preliminary idea, there are still many details that
>> > have not been considered. I hope to hear your thoughts on this.
>> >
>> > Any opinions and thoughts are appreciated and welcome.
>> >
>> > Best,
>> > Vino
>> >
>>
>

Re: [DISCUSS] Rethink the abstraction of current client

Reply via email to