> It contains three components: - Two objects: a table, a batch of records;
For the Spark client, it is true because no matter Spark or Spark streaming engine, they write as batches, but things are different for pure streaming engines like Flink, Flink writes per-record, it does not accumulate buffers. The current Spark client encapsulate the logic like *index* and bucketing/partitioning strategies into it also, both are heavy resource consuming operation; it may be easy work for Spark because with a SparkEngineContext there, the pipeline can fire a new sub-pipeline work as they wish. Things are different for Flink too, the Flink engine can not fire sub-pipeline, all the pipelines are solid into one and once it fires, it runs there forever. The client also encapsulates the file rolling over action into it, which is hard for Flink to switch on Checkpoints for exactly-once semantics. So I suggest: 1. The new operation supports writing as per-record but not batches. 2. The new operation should expose plugins like indexing/partitioning out so that the engine can control it freely. 3. The new operation should expose the write handle (or file handle), so the engine can roll over when it is necessary. vino yang <[email protected]> 于2021年1月19日周二 下午5:39写道: > Hi guys, > > *I open this thread to discuss if we can separate the attributes and > behaviors of HoodieTable, and rethink the abstraction of the client.* > > Currently, in the hudi-client-common module, there is a HoodieTable class, > which contains a set of attributes and behaviors. For different engines, it > has different implementations. The existing classes include: > > - HoodieSparkTable; > - HoodieFlinkTable; > - HoodieJavaTable; > > In addition, for two different table types: COW and MOR, these classes are > further split. For example, HoodieSparkTable is split into: > > - HoodieSparkCopyOnWriteTable; > - HoodieSparkMergeOnReadTable; > > HoodieSparkTable degenerates into a factory to initialize these classes. > > This model looks clear but brings some problems. > > First of all, HoodieTable is a mixture of attributes and behaviors. The > attributes are independent of the engines, but the behavior varies > depending on the engine. Semantically speaking, HoodieTable should belong > to hudi-common, and should not only be associated with hudi-client-common. > > Second, the behaviors contained in HoodieTable, such as: > > - upsert > - insert > - delete > - insertOverwrite > > They are similar to the APIs provided by the client, but it is not > implemented directly in HoodieTable. Instead, the implementation is handed > over to a bunch of actions (executors), such as: > > - commit > - compact > - clean > - rollback > > In addition, these actions do not completely contain the implementation > logic. Part of their logic is separated into some Helper classes under the > same package, such as: > > - SparkWriteHelper > - SparkMergeHelper > - SparkDeleteHelper > > To sum up, for abstraction, the implementation is moved backward layer by > layer (mainly completed by the executor + helper classes), which makes each > client need a lot of classes with similar patterns to implement the basic > API, and the class expansion is very serious. > > Let us reorganize it: > > What a write client does is to insert or upsert a batch of records to a > table with transaction semantics, and provide some additional operations to > the table. It contains three components: > > - Two objects: a table, a batch of records; > - One type of operation: insert or upsert (focus on records) > - One type of additional operation: compact / clean (focus on the table > itself) > > Therefore, the following improvements are proposed here: > > - The table object does not contain behavior, the table should be public > and engine independent; > - Classify and abstract the operation behavior: > - TableInsertOperation(interface) > - TableUpsertOperation(interface) > - TableTransactionOperation > - TableManageOperation(compact/clean…) > > This kind of abstraction is more intuitive and focused so that there is > only one point of materialization. For example, the Spark engine for insert > operation will hatch the following specific implementation classes: > > - CoWTableSparkInsertOperation; > - MoRTableSparkInsertOperation; > > Of course, we can provide a factory class named TableSparkInsertOperation, > which is optional. > > Based on the new abstraction, a new engine only needs to reimplement the > interfaces of the above behaviors, and then provide a new client to > instantiate them. > > In order to focus here, I deliberately ignored an important object: the > index. The index should also be in the hudi-common module, and its > implementation may be engine-related, providing acceleration capabilities > for writing and querying at the same time. > > The above is just a preliminary idea, there are still many details that > have not been considered. I hope to hear your thoughts on this. > > Any opinions and thoughts are appreciated and welcome. > > Best, > Vino >
