Hi all, hudi-client module has core Hudi abstractions and client logic for different engines like Spark, Flink, and Java. While previous effort (HUDI-538 [1]) has decoupled the integration with Spark, there is quite some code duplication across different engines for almost the same logic due to the current interface design. Some part also has divergence among engines, making debugging and support difficult.
I propose to further refactor the hudi-client module with the goal of improving the code reuse across multiple engines and reducing the divergence of the logic among them, so that the core Hudi action execution logic should be shared across engines, except for engine specific transformations. Such a pattern also allows easy support of core Hudi functionality for all engines in the future. Specifically, (1) Abstracts the transformation boilerplates inside the HoodieEngineContext and implements the engine-specific data transformation logic in the subclasses. Type cast can be done inside the engine context. (2) Creates new HoodieData abstraction for passing input and output along the flow of execution, and uses it in different Hudi abstractions, e.g., HoodieTable, HoodieIOHandle, BaseActionExecutor, instead of enforcing type parameters encountering RDD<HoodieRecord> and List<HoodieRecord> which are one source of duplication. (3) Extracts common execution logic to hudi-client-common module from multiple engines. As a first step and exploration for item (1) and (3) above, I've tried to refactor the rollback actions and the PR is here [HUDI-2433][2]. In this PR, I completely remove all engine-specific rollback packages and only keep one rollback package in hudi-client-common, adding ~350 LoC while deleting 1.3K LoC. My next step is to refactor the commit action which encompasses item (2) above. What do you folks think and any other suggestions? [1] [HUDI-538] [UMBRELLA] Restructuring hudi client module for multi engine support https://issues.apache.org/jira/browse/HUDI-538 [2] PR: [HUDI-2433] Refactor rollback actions in hudi-client module https://github.com/apache/hudi/pull/3664/files Best, - Ethan