Fully agreed with Vino. I think let's chalk out the classes. Make hierarchies and start decoupling everything. Then we can move forward with the Flink and Beam streaming components.
On Sun, Aug 4, 2019, 1:52 PM vino yang <yanghua1...@gmail.com> wrote: > Hi Nick, > > Thank you for your more detailed thoughts, and I fully agree with your > thoughts about HudiLink, which should also be part of the long-term > planning of the Hudi Ecology. > > > *But I found that the angle of our thinking and the starting point are not > consistent. I pay more attention to the rationality of the existing > architecture and whether the dependence on the computing engine is > pluggable. Don't get me wrong, I know very well that although we have > different perspectives, these views have value for Hudi.* > Let me give more details on the discussion I made earlier. > > Currently, multiple submodules of the Hudi project are tightly coupled to > Spark's design and dependencies. You can see that many of the class files > contain statements such as "import org.apache.spark.xxx". > > I first put forward a discussion: "Integrate Hudi with Apache Flink", and > then came up with a discussion: "Decouple Hudi and Spark". > > I think the word "Integrate" I used for the first discussion may not be > accurate enough. My intention is to make the computing engine used by Hudi > pluggable. Spark is equivalent to Hudi is just a library, it is not the > core of Hudi, it should not be strongly coupled with Hudi. The features > currently provided by Spark are also available from Flink. But in order to > achieve this, we need to decouple Hudi from the code level with the use of > Spark. > > This makes sense both in terms of structural rationality and community > ecology. > > Best, > Vino > > > Semantic Beeng <n...@semanticbeeng.com> 于2019年8月4日周日 下午2:21写道: > >> "+1 for both Beam and Flink" - what I propose implies this indeed. >> >> But/and am working from the desired functionality and a proposed design. >> >> (as opposed to starting with refactoring Hudi with the goal of close >> integration with Flink) >> >> I feel this is not necessary - but am not an expert in Hudi >> implementation. >> >> But am pretty sure it is not sufficient for the use cases I have in mind. >> The gist is using Hudi as a file based data lake + ML feature store that >> enables incremental analyses done with a combination of Flink, Beam, Spark, >> Tensorlflow (see Petastorm from UberEng for an idea.) >> >> Let us call this HudiLink from now on (think of it as a mediator, not >> another Hudi). >> >> The intuition behind looking at more then Flink is that both Beam and >> Flink have good design abstractions we might reuse and extend. >> >> Like I said before, do not believe in point to point integrations. >> >> Alternatively / in parallel,If you care to share your use cases it would >> be very useful. Working with explicit use cases helps others to relate and >> help. >> >> Also, if some of you know there believe in (see) value of refactoring >> Hudi implementation for a hard integration with Flink (but have no time to >> argue for it) ofc you please go ahead. >> >> That may be a valid bottom up approach but I cannot relate to it myself >> (due to lack of use cases). >> >> Working on a material on HudiLink - if any are interested I might publish >> when more mature. >> >> Hint: this was part of the inspiration https://eng.uber.com/michelangelo/ >> >> One well thought use case will get you "in". :-) Kidding, ofc. >> >> Cheers >> >> Nick >> >> >> On August 3, 2019 at 10:55 PM vino yang <yanghua1...@gmail.com> wrote: >> >> >> +1 for both Beam and Flink >> >> First step here is to probably draw out current hierrarchy and figure out >> what the abstraction points are.. >> In my opinion, the runtime (spark, flink) should be done at the >> hoodie-client level and just used by hoodie-utilties seamlessly.. >> >> >> +1 for Vinoth's opinion, it should be the first step. >> >> No matter we hope Hudi to integrate with which computing framework. >> We need to decouple Hudi client and Spark. >> >> We may need a pure client module named for example >> hoodie-client-core(common) >> >> Then we could have: hoodie-client-spark, hoodie-client-flink and >> hoodie-client-beam >> >> Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道: >> >> +1 for Beam -- agree with Semantic Beeng's analysis. >> >> On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <taher...@gmail.com> >> wrote: >> >> So the way to go around this is that file a hip. Chalk all th classes our >> and start moving towards Pure client. >> >> Secondly should we want to try beam? >> >> I think there is to much going on here and I'm not able to follow. If we >> want to try out beam all along I don't think it makes sense to do >> anything on Flink then. >> >> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <n...@semanticbeeng.com> >> wrote: >> >> >> +1 My money is on this approach. >> >> >> >> The existing abstractions from Beam seem enough for the use cases as I >> >> imagine them. >> >> >> >> Flink also has "dynamic table", "table source" and "table sink" which >> >> seem very useful abstractions where Hudi might fit nicely. >> >> >> >> >> >> >> >> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html >> >> >> >> >> >> Attached a screen shot. >> >> >> >> This seems to fit with the original premise of Hudi as well. >> >> >> >> Am exploring this venue with a use case that involves "temporal joins >> on >> >> streams" which I need for feature extraction. >> >> >> >> Anyone is interested in this or has concrete enough needs and use cases >> >> please let me know. >> >> >> >> Best to go from an agreed upon set of 2-3 use cases. >> >> >> >> Cheers >> >> >> >> Nick >> >> >> >> >> >> > Also, we do have some Beam experts on the mailing list.. Can you >> please >> >> weigh on viability of using Beam as the intermediate abstraction here >> >> between Spark/Flink? >> >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition, >> >> reduceByKey, countByKey and also does custom partitioning a lot.> >> >> >> >> > >> >> >> > >> >>