Hi guys, Currently, I, Taher and Vinay are working on issue HUDI-184.[1]
As a first step, we are discussing the design doc. After diving into the code, We listed some relevant classes about the Spark delta writer. - module: hoodie-utilities com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer com.uber.hoodie.utilities.deltastreamer.DeltaSyncService com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter com.uber.hoodie.utilities.schema.SchemaProvider com.uber.hoodie.utilities.transform.Transformer - module: hoodie-client com.uber.hoodie.HoodieWriteClient (to commit compaction) The fact is *hoodie-utilities* depends on *hoodie-client*, however, *hoodie-client* is also not a pure Hudi component, it also depends on Spark lib. So I propose hoodie should provide a pure hoodie-client and decouple with Spark. Then Flink and Spark modules should depend on it. Moreover, based on the old discussion[2], we all agree that Spark is not the only choice for Hudi, it could also be Flink/Beam. IMO, We should decouple Hudi from Spark at the height of the project, including but not limited to module splitting and renaming. Not sure if this requires a HIP to drive. We should first listen to the opinions of the community. Any ideas and suggestions are welcome and appreciated. Best, Vino [1]: https://issues.apache.org/jira/browse/HUDI-184?filter=-1 [2]: https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E