Hi guys,

Currently, I, Taher and Vinay are working on issue HUDI-184.[1]

As a first step, we are discussing the design doc.

After diving into the code, We listed some relevant classes about the Spark
delta writer.

   - module: hoodie-utilities

com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
com.uber.hoodie.utilities.deltastreamer.DeltaSyncService
com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter
com.uber.hoodie.utilities.schema.SchemaProvider
com.uber.hoodie.utilities.transform.Transformer

   - module: hoodie-client

com.uber.hoodie.HoodieWriteClient (to commit compaction)


The fact is *hoodie-utilities* depends on *hoodie-client*, however,
*hoodie-client* is also not a pure Hudi component, it also depends on Spark
lib.

So I propose hoodie should provide a pure hoodie-client and decouple with
Spark. Then Flink and Spark modules should depend on it.

Moreover, based on the old discussion[2], we all agree that Spark is not
the only choice for Hudi, it could also be Flink/Beam.

IMO, We should decouple Hudi from Spark at the height of the project,
including but not limited to module splitting and renaming.

Not sure if this requires a HIP to drive.

We should first listen to the opinions of the community. Any ideas and
suggestions are welcome and appreciated.

Best,
Vino

[1]: https://issues.apache.org/jira/browse/HUDI-184?filter=-1
[2]:
https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E

Reply via email to