Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

Vinoth Chandar Mon, 05 Aug 2019 09:10:07 -0700

Great discussions! Responded on the. original thread on decoupling..
Let's continue there?


On Mon, Aug 5, 2019 at 1:39 AM Semantic Beeng <n...@semanticbeeng.com>
wrote:

> "design is more important. When we have a clear idea, it is not too late
> to create an issue"
>
> 100% with Vino
>
>
> On August 5, 2019 at 2:50 AM taher koitawala <taher...@gmail.com> wrote:
>
> Sounds good. Let's do that first.
>
> On Mon, Aug 5, 2019, 11:59 AM vino yang < yanghua1...@gmail.com> wrote:
>
> Hi Taher,
>
> IMO, Let's listen to more comments, after all, this discussion took place
> over the weekend. Then listen to Vinoth and the community's comments and
> suggestions.
>
> I personally think that design is more important. When we have a clear
> idea, it is not too late to create an issue.
>
> I am sorting out classes that depend on Spark. Maybe we can discuss how to
> decouple.
>
> What do you think?
>
> Best,
> Vino
>
> taher koitawala < taher...@gmail.com> 于2019年8月5日周一 下午2:17写道：
>
> If everyone agrees that we should decouple Hudi and Spark to enable
> processing engine abstraction. Should I open a jira ticket for that?
>
> On Sun, Aug 4, 2019 at 6:59 PM taher koitawala < taher...@gmail.com>
> wrote:
>
> If anyone wants to see a Flink Streaming pipeline here is a really small
> and basic Flink pipeline.
> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example
>
> Consider users playing a game across multiple platforms and we only get
> the timestamp, username and the current score as the record. The pipelines
> has a custom source function which produces this stream record.
>
> The pipeline does aggregations(Sum score of current window with the total
> score of the user) every 2 seconds based on the event time attached with
> the record.
>
> User's score keeps increasing as new windows are fired and new outputs are
> emitted. That's where Hudi fits as per my vision now, where Hudi
> intelligently shows only the latest records written.
>
>
>
> On Sun, Aug 4, 2019, 6:43 PM taher koitawala < taher...@gmail.com> wrote:
>
> Fully agreed with Vino. I think let's chalk out the classes. Make
> hierarchies and start decoupling everything. Then we can move forward with
> the Flink and Beam streaming components.
>
> On Sun, Aug 4, 2019, 1:52 PM vino yang < yanghua1...@gmail.com> wrote:
>
> Hi Nick,
>
> Thank you for your more detailed thoughts, and I fully agree with your
> thoughts about HudiLink, which should also be part of the long-term
> planning of the Hudi Ecology.
>
>
> *But I found that the angle of our thinking and the starting point are not
> consistent. I pay more attention to the rationality of the existing
> architecture and whether the dependence on the computing engine is
> pluggable. Don't get me wrong, I know very well that although we have
> different perspectives, these views have value for Hudi.*
> Let me give more details on the discussion I made earlier.
>
> Currently, multiple submodules of the Hudi project are tightly coupled to
> Spark's design and dependencies. You can see that many of the class files
> contain statements such as "import org.apache.spark.xxx".
>
> I first put forward a discussion: "Integrate Hudi with Apache Flink", and
> then came up with a discussion: "Decouple Hudi and Spark".
>
> I think the word "Integrate" I used for the first discussion may not be
> accurate enough. My intention is to make the computing engine used by Hudi
> pluggable. Spark is equivalent to Hudi is just a library, it is not the
> core of Hudi, it should not be strongly coupled with Hudi. The features
> currently provided by Spark are also available from Flink. But in order to
> achieve this, we need to decouple Hudi from the code level with the use of
> Spark.
>
> This makes sense both in terms of structural rationality and community
> ecology.
>
> Best,
> Vino
>
>
> Semantic Beeng < n...@semanticbeeng.com> 于2019年8月4日周日 下午2:21写道：
>
> "+1 for both Beam and Flink" - what I propose implies this indeed.
>
> But/and am working from the desired functionality and a proposed design.
>
> (as opposed to starting with refactoring Hudi with the goal of close
> integration with Flink)
>
> I feel this is not necessary - but am not an expert in Hudi implementation.
>
> But am pretty sure it is not sufficient for the use cases I have in mind.
> The gist is using Hudi as a file based data lake + ML feature store that
> enables incremental analyses done with a combination of Flink, Beam, Spark,
> Tensorlflow (see Petastorm from UberEng for an idea.)
>
> Let us call this HudiLink from now on (think of it as a mediator, not
> another Hudi).
>
> The intuition behind looking at more then Flink is that both Beam and
> Flink have good design abstractions we might reuse and extend.
>
> Like I said before, do not believe in point to point integrations.
>
> Alternatively / in parallel,If you care to share your use cases it would
> be very useful. Working with explicit use cases helps others to relate and
> help.
>
> Also, if some of you know there believe in (see) value of refactoring Hudi
> implementation for a hard integration with Flink (but have no time to argue
> for it) ofc you please go ahead.
>
> That may be a valid bottom up approach but I cannot relate to it myself
> (due to lack of use cases).
>
> Working on a material on HudiLink - if any are interested I might publish
> when more mature.
>
> Hint: this was part of the inspiration https://eng.uber.com/michelangelo/
>
> One well thought use case will get you "in". :-) Kidding, ofc.
>
> Cheers
>
> Nick
>
>
> On August 3, 2019 at 10:55 PM vino yang < yanghua1...@gmail.com> wrote:
>
>
> +1 for both Beam and Flink
>
> First step here is to probably draw out current hierrarchy and figure out
> what the abstraction points are..
> In my opinion, the runtime (spark, flink) should be done at the
> hoodie-client level and just used by hoodie-utilties seamlessly..
>
>
> +1 for Vinoth's opinion, it should be the first step.
>
> No matter we hope Hudi to integrate with which computing framework.
> We need to decouple Hudi client and Spark.
>
> We may need a pure client module named for example
> hoodie-client-core(common)
>
> Then we could have: hoodie-client-spark, hoodie-client-flink and
> hoodie-client-beam
>
> Suneel Marthi < smar...@apache.org> 于2019年8月4日周日 上午10:45写道：
>
> +1 for Beam -- agree with Semantic Beeng's analysis.
>
> On Sat, Aug 3, 2019 at 10:30 PM taher koitawala < taher...@gmail.com>
> wrote:
>
> So the way to go around this is that file a hip. Chalk all th classes our
> and start moving towards Pure client.
>
> Secondly should we want to try beam?
>
> I think there is to much going on here and I'm not able to follow. If we
> want to try out beam all along I don't think it makes sense to do anything
> on Flink then.
>
> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng < n...@semanticbeeng.com>
> wrote:
>
> >> +1 My money is on this approach.
> >>
> >> The existing abstractions from Beam seem enough for the use cases as I
> >> imagine them.
> >>
> >> Flink also has "dynamic table", "table source" and "table sink" which
> >> seem very useful abstractions where Hudi might fit nicely.
> >>
> >>
> >>
>
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> >>
> >>
> >> Attached a screen shot.
> >>
> >> This seems to fit with the original premise of Hudi as well.
> >>
> >> Am exploring this venue with a use case that involves "temporal joins
> on
> >> streams" which I need for feature extraction.
> >>
> >> Anyone is interested in this or has concrete enough needs and use cases
> >> please let me know.
> >>
> >> Best to go from an agreed upon set of 2-3 use cases.
> >>
> >> Cheers
> >>
> >> Nick
> >>
> >>
> >> > Also, we do have some Beam experts on the mailing list.. Can you
> please
> >> weigh on viability of using Beam as the intermediate abstraction here
> >> between Spark/Flink?
> >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
> >> reduceByKey, countByKey and also does custom partitioning a lot.>
> >>
> >> >
> >>
> >
>
>

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

Reply via email to