>>More for my own edification, how does the recently introduced timeline service play into the delta writer components?
TimelineService runs in the Spark driver (DeltaStreamer is a Hudi Spark app) and answers metadata/timeline api calls from the executors.. it is not aware of Spark vs Flink or any runtime stuff. On Sat, Aug 3, 2019 at 12:50 PM Vinoth Chandar < mail.vinoth.chan...@gmail.com> wrote: > Decoupling Spark and Hudi is the first step to bring in a Flink runtime, > and its also the hardest part. > > On the decoupling itself, the IOHandle classes are (almost) unaware of > Spark itself, where the Write/ReadClient and the Table classes are very > aware.. > First step here is to probably draw out current hierrarchy and figure out > what the abstraction points are.. > In my opinion, the runtime (spark, flink) should be done at the > hoodie-client level and just used by hoodie-utilties seamlessly.. > > My 2c for folks working on this is to may be pick up few bugs/issues > across these areas to get more familiarity with code and then draw up the > proposals.. > (not a requirement, but will build more understanding of all > devils-in-the-details) > > >>Not sure if this requires a HIP to drive. > I think this definitely needs a HIP. Its a large enough change :) > > Also, we do have some Beam experts on the mailing list.. Can you please > weigh on viability of using Beam as the intermediate abstraction here > between Spark/Flink? > Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition, > reduceByKey, countByKey and also does custom partitioning a lot. > > > > > On Fri, Aug 2, 2019 at 9:46 AM Aaron Langford <aaron.langfor...@gmail.com> > wrote: > >> More for my own edification, how does the recently introduced timeline >> service play into the delta writer components? >> >> On Fri, Aug 2, 2019 at 7:53 AM vino yang <yanghua1...@gmail.com> wrote: >> >> > Hi Suneel, >> > >> > Thank you for your suggestion, let me clarify. >> > >> > >> > *The context of this email is that we are evaluating how to implement a >> > Stream Delta writer base on Flink.* >> > About the discussion between me, Taher and Vinay, those are just some >> > trivial details in the preparation of the document, and the discussion >> is >> > also based on mail. >> > >> > When we don't have the first draft, discussing the details on the >> mailing >> > list may confuse others and easily deviate from the topic. Our initial >> plan >> > was to facilitate community discussions and reviews when we had a draft >> of >> > the documentation available to the community. >> > >> > Best, >> > Vino >> > >> > Suneel Marthi <smar...@apache.org> 于2019年8月2日周五 下午10:37写道: >> > >> > > Please keep all discussions to Mailing lists here - no offline >> > discussions >> > > please. >> > > >> > > On Fri, Aug 2, 2019 at 10:22 AM vino yang <yanghua1...@gmail.com> >> wrote: >> > > >> > > > Hi guys, >> > > > >> > > > Currently, I, Taher and Vinay are working on issue HUDI-184.[1] >> > > > >> > > > As a first step, we are discussing the design doc. >> > > > >> > > > After diving into the code, We listed some relevant classes about >> the >> > > Spark >> > > > delta writer. >> > > > >> > > > - module: hoodie-utilities >> > > > >> > > > com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer >> > > > com.uber.hoodie.utilities.deltastreamer.DeltaSyncService >> > > > com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter >> > > > com.uber.hoodie.utilities.schema.SchemaProvider >> > > > com.uber.hoodie.utilities.transform.Transformer >> > > > >> > > > - module: hoodie-client >> > > > >> > > > com.uber.hoodie.HoodieWriteClient (to commit compaction) >> > > > >> > > > >> > > > The fact is *hoodie-utilities* depends on *hoodie-client*, however, >> > > > *hoodie-client* is also not a pure Hudi component, it also depends >> on >> > > Spark >> > > > lib. >> > > > >> > > > So I propose hoodie should provide a pure hoodie-client and decouple >> > with >> > > > Spark. Then Flink and Spark modules should depend on it. >> > > > >> > > > Moreover, based on the old discussion[2], we all agree that Spark is >> > not >> > > > the only choice for Hudi, it could also be Flink/Beam. >> > > > >> > > > IMO, We should decouple Hudi from Spark at the height of the >> project, >> > > > including but not limited to module splitting and renaming. >> > > > >> > > > Not sure if this requires a HIP to drive. >> > > > >> > > > We should first listen to the opinions of the community. Any ideas >> and >> > > > suggestions are welcome and appreciated. >> > > > >> > > > Best, >> > > > Vino >> > > > >> > > > [1]: https://issues.apache.org/jira/browse/HUDI-184?filter=-1 >> > > > [2]: >> > > > >> > > > >> > > >> > >> https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E >> > > > >> > > >> > >> >