Re: [DISCUSS] Decouple Hudi and Spark

Vinoth Chandar Sat, 03 Aug 2019 12:52:33 -0700

>>More for my own edification, how does the recently introduced
timeline service play into the delta writer components?


TimelineService runs in the Spark driver (DeltaStreamer is a Hudi Spark
app) and answers metadata/timeline api calls from the executors.. it is not
aware of Spark vs Flink or any runtime stuff.

On Sat, Aug 3, 2019 at 12:50 PM Vinoth Chandar <
mail.vinoth.chan...@gmail.com> wrote:

> Decoupling Spark and Hudi is the first step to bring in a Flink runtime,
> and its also the hardest part.
>
> On the decoupling itself, the IOHandle classes are (almost) unaware of
> Spark itself, where the Write/ReadClient and the Table classes are very
> aware..
> First step here is to probably draw out current hierrarchy and figure out
> what the abstraction points are..
> In my opinion, the runtime (spark, flink) should be done at the
> hoodie-client level and just used by hoodie-utilties seamlessly..
>
> My 2c for folks working on this is to may be pick up few bugs/issues
> across these areas to get more familiarity with code and then draw up the
> proposals..
> (not a requirement, but will build more understanding of all
> devils-in-the-details)
>
> >>Not sure if this requires a HIP to drive.
> I think this definitely needs a HIP. Its a large enough change :)
>
> Also, we do have some Beam experts on the mailing list.. Can you please
> weigh on viability of using Beam as the intermediate abstraction here
> between Spark/Flink?
> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
> reduceByKey, countByKey and also does custom partitioning a lot.
>
>
>
>
> On Fri, Aug 2, 2019 at 9:46 AM Aaron Langford <aaron.langfor...@gmail.com>
> wrote:
>
>> More for my own edification, how does the recently introduced timeline
>> service play into the delta writer components?
>>
>> On Fri, Aug 2, 2019 at 7:53 AM vino yang <yanghua1...@gmail.com> wrote:
>>
>> > Hi Suneel,
>> >
>> > Thank you for your suggestion, let me clarify.
>> >
>> >
>> > *The context of this email is that we are evaluating how to implement a
>> > Stream Delta writer base on Flink.*
>> > About the discussion between me, Taher and Vinay, those are just some
>> > trivial details in the preparation of the document, and the discussion
>> is
>> > also based on mail.
>> >
>> > When we don't have the first draft, discussing the details on the
>> mailing
>> > list may confuse others and easily deviate from the topic. Our initial
>> plan
>> > was to facilitate community discussions and reviews when we had a draft
>> of
>> > the documentation available to the community.
>> >
>> > Best,
>> > Vino
>> >
>> > Suneel Marthi <smar...@apache.org> 于2019年8月2日周五 下午10:37写道：
>> >
>> > > Please keep all discussions to Mailing lists here - no offline
>> > discussions
>> > > please.
>> > >
>> > > On Fri, Aug 2, 2019 at 10:22 AM vino yang <yanghua1...@gmail.com>
>> wrote:
>> > >
>> > > > Hi guys,
>> > > >
>> > > > Currently, I, Taher and Vinay are working on issue HUDI-184.[1]
>> > > >
>> > > > As a first step, we are discussing the design doc.
>> > > >
>> > > > After diving into the code, We listed some relevant classes about
>> the
>> > > Spark
>> > > > delta writer.
>> > > >
>> > > >    - module: hoodie-utilities
>> > > >
>> > > > com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
>> > > > com.uber.hoodie.utilities.deltastreamer.DeltaSyncService
>> > > > com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter
>> > > > com.uber.hoodie.utilities.schema.SchemaProvider
>> > > > com.uber.hoodie.utilities.transform.Transformer
>> > > >
>> > > >    - module: hoodie-client
>> > > >
>> > > > com.uber.hoodie.HoodieWriteClient (to commit compaction)
>> > > >
>> > > >
>> > > > The fact is *hoodie-utilities* depends on *hoodie-client*, however,
>> > > > *hoodie-client* is also not a pure Hudi component, it also depends
>> on
>> > > Spark
>> > > > lib.
>> > > >
>> > > > So I propose hoodie should provide a pure hoodie-client and decouple
>> > with
>> > > > Spark. Then Flink and Spark modules should depend on it.
>> > > >
>> > > > Moreover, based on the old discussion[2], we all agree that Spark is
>> > not
>> > > > the only choice for Hudi, it could also be Flink/Beam.
>> > > >
>> > > > IMO, We should decouple Hudi from Spark at the height of the
>> project,
>> > > > including but not limited to module splitting and renaming.
>> > > >
>> > > > Not sure if this requires a HIP to drive.
>> > > >
>> > > > We should first listen to the opinions of the community. Any ideas
>> and
>> > > > suggestions are welcome and appreciated.
>> > > >
>> > > > Best,
>> > > > Vino
>> > > >
>> > > > [1]: https://issues.apache.org/jira/browse/HUDI-184?filter=-1
>> > > > [2]:
>> > > >
>> > > >
>> > >
>> >
>> https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to