Thanks a ton Vinoth. On Wed, Aug 7, 2019 at 4:34 PM Vinoth Chandar <vin...@apache.org> wrote:
> >>Are there some tasks I can take up to ramp up the code? > Certainly. There are some open tasks that touch the hoodie-client and > hoodie-utilities module. > https://issues.apache.org/jira/browse/HUDI-37 > https://issues.apache.org/jira/browse/HUDI-194 > https://issues.apache.org/jira/browse/HUDI-145 > https://issues.apache.org/jira/browse/HUDI-130 > https://issues.apache.org/jira/browse/HUDI-62 > > IMO, getting hands dirty with a few of these and may be 1-2 more involved > ones, would set enough context to drive the hudi-on-flink project. > > > On Tue, Aug 6, 2019 at 1:04 PM nishith agarwal <n3.nas...@gmail.com> > wrote: > > > +1 for Approach 1 Point integration with each framework. > > > > Pros for point integration > > - Hudi community is already familiar with spark and spark based > > actions/shuffles etc. Since both modules can be decoupled, this enables > us > > to have a steady release for Hudi for 1 execution engine (spark) while we > > hone our skills and iterate on making flink dag optimized, performant > with > > the right configuration. > > - This might be a stepping stone towards rewriting the entire code base > > being agnostic of spark/flink. This approach will help us fix tests, > > intricacies and help make the code base ready for a larger rework. > > - Seems like the easiest way to add flink support > > > > Cons > > - More code paths to maintain and reason since the spark and flink > > integrations will naturally diverge over time. > > > > Theoretically, I do like the idea of being able to run the hudi dag on > beam > > more than point integrations, where there is one API/logic to reason > about. > > But practically, that may not be the right direction. > > > > Pros > > - Lesser cognitive burden in maintaining, evolving and releasing the > > project with one API to reason with. > > - Theoretically, going forward assuming beam is adopted as a standard > > programming paradigm for stream/batch, this would enable consumers > leverage > > the power of hudi more easily. > > > > Cons > > - Massive rewrite of the code base. Additionally, since we would have > moved > > away from directly using spark APIs, there is a bigger risk of > regression. > > We would have to be very thorough with all the intricacies and ensure the > > same stability of new releases. > > - Managing future features (which may be very spark driven) will either > > clash or pause or will need to be reworked. > > - Tuning jobs for Spark/Flink type execution frameworks individually > might > > be difficult and will get difficult over time as the project evolves, > where > > some beam integrations with spark/flink may not work as expected. > > - Also, as pointed above, need to probably support the hoodie-spark > module > > as a first-class. > > > > Thank, > > Nishith > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala <taher...@gmail.com> > wrote: > > > > > Hi Vinoth, > > > Are there some tasks I can take up to ramp up the code? Want to > > get > > > more used to the code and understand the existing implementation > better. > > > > > > Thanks, > > > Taher Koitawala > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar <vin...@apache.org> > wrote: > > > > > > > Let's see if others have any thoughts as well. We can plan to fix the > > > > approach by EOW. > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang <yanghua1...@gmail.com> > > wrote: > > > > > > > > > Hi guys, > > > > > > > > > > Also, +1 for Approach 1 like Taher. > > > > > > > > > > > If we can do a comprehensive analysis of this model and come up > > with. > > > > > means > > > > > > to refactor this cleanly, this would be promising. > > > > > > > > > > Yes, when we get the conclusion, we could start this work. > > > > > > > > > > Best, > > > > > Vino > > > > > > > > > > > > > > > taher koitawala <taher...@gmail.com> 于2019年8月6日周二 上午12:28写道: > > > > > > > > > > > +1 for Approch 1 Point integration with each framework > > > > > > > > > > > > Approach 2 has a problem as you said "Developers need to think > > about > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, > > > this > > > > > may > > > > > > not be the panacea that it seems to be" > > > > > > > > > > > > We have seen various pipelines in the beam dag being expressed > > > > > differently > > > > > > then we had them in our original usecase. And also switching > > between > > > > > spark > > > > > > and Flink runners in beam have various impact on the pipelines > like > > > > some > > > > > > features available in Flink are not available on the spark runner > > > etc. > > > > > > Refer to this compatible matrix -> > > > > > > https://beam.apache.org/documentation/runners/capability-matrix/ > > > > > > > > > > > > Hence my vote on Approch 1 let's decouple and build the abstract > > for > > > > each > > > > > > framework. That is a much better option. We will also have more > > > control > > > > > > over each framework's implement. > > > > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <vin...@apache.org> > > > wrote: > > > > > > > > > > > > > Would like to highlight that there are two distinct approaches > > here > > > > > with > > > > > > > different tradeoffs. Think of this as my braindump, as I have > > been > > > > > > thinking > > > > > > > about this quite a bit in the past. > > > > > > > > > > > > > > > > > > > > > *Approach 1 : Point integration with each framework * > > > > > > > > > > > > > > >>We may need a pure client module named for example > > > > > > > hoodie-client-core(common) > > > > > > > >> Then we could have: hoodie-client-spark, hoodie-client-flink > > > > > > > and hoodie-client-beam > > > > > > > > > > > > > > (+) This is the safest to do IMO, since we can isolate the > > current > > > > > Spark > > > > > > > execution (hoodie-spark, hoodie-client-spark) from the changes > > for > > > > > flink, > > > > > > > while it stabilizes over few releases. > > > > > > > (-) Downside is that the utilities needs to be redone : > > > > > > > hoodie-utilities-spark and hoodie-utilities-flink and > > > > > > > hoodie-utilities-core ? hoodie-cli? > > > > > > > > > > > > > > If we can do a comprehensive analysis of this model and come up > > > with. > > > > > > means > > > > > > > to refactor this cleanly, this would be promising. > > > > > > > > > > > > > > > > > > > > > *Approach 2: Beam as the compute abstraction* > > > > > > > > > > > > > > Another more drastic approach is to remove Spark as the compute > > > > > > abstraction > > > > > > > for writing data and replace it with Beam. > > > > > > > > > > > > > > (+) All of the code remains more or less similar and there is > one > > > > > compute > > > > > > > API to reason about. > > > > > > > > > > > > > > (-) The (very big) assumption here is that we are able to tune > > the > > > > > spark > > > > > > > runtime the same way using Beam : custom partitioners, support > > for > > > > all > > > > > > RDD > > > > > > > operations we invoke, caching etc etc. > > > > > > > (-) It will be a massive rewrite and testing of such a large > > > rewrite > > > > > > would > > > > > > > also be really challenging, since we need to pay attention to > all > > > > > > intricate > > > > > > > details to ensure the spark users today experience no > > > > > > > regressions/side-effects > > > > > > > (-) Note that we still need to probably support the > hoodie-spark > > > > module > > > > > > and > > > > > > > may be a first-class such integration with flink, for native > > > > > flink/spark > > > > > > > pipeline authoring. Users of say DeltaStreamer need to pass in > > > Spark > > > > or > > > > > > > Flink configs anyway.. Developers need to think about > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the > end, > > > > this > > > > > > may > > > > > > > not be the panacea that it seems to be. > > > > > > > > > > > > > > > > > > > > > > > > > > > > One goal for the HIP is to get us all to agree as a community > > which > > > > one > > > > > > to > > > > > > > pick, with sufficient investigation, testing, benchmarking.. > > > > > > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang < > yanghua1...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > +1 for both Beam and Flink > > > > > > > > > > > > > > > > > First step here is to probably draw out current hierrarchy > > and > > > > > figure > > > > > > > out > > > > > > > > > what the abstraction points are.. > > > > > > > > > In my opinion, the runtime (spark, flink) should be done at > > the > > > > > > > > > hoodie-client level and just used by hoodie-utilties > > > seamlessly.. > > > > > > > > > > > > > > > > +1 for Vinoth's opinion, it should be the first step. > > > > > > > > > > > > > > > > No matter we hope Hudi to integrate with which computing > > > framework. > > > > > > > > We need to decouple Hudi client and Spark. > > > > > > > > > > > > > > > > We may need a pure client module named for example > > > > > > > > hoodie-client-core(common) > > > > > > > > > > > > > > > > Then we could have: hoodie-client-spark, hoodie-client-flink > > and > > > > > > > > hoodie-client-beam > > > > > > > > > > > > > > > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道: > > > > > > > > > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's analysis. > > > > > > > > > > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala < > > > > > taher...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > So the way to go around this is that file a hip. Chalk > all > > th > > > > > > classes > > > > > > > > our > > > > > > > > > > and start moving towards Pure client. > > > > > > > > > > > > > > > > > > > > Secondly should we want to try beam? > > > > > > > > > > > > > > > > > > > > I think there is to much going on here and I'm not able > to > > > > > follow. > > > > > > If > > > > > > > > we > > > > > > > > > > want to try out beam all along I don't think it makes > sense > > > to > > > > do > > > > > > > > > anything > > > > > > > > > > on Flink then. > > > > > > > > > > > > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng < > > > > > > n...@semanticbeeng.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> +1 My money is on this approach. > > > > > > > > > >> > > > > > > > > > >> The existing abstractions from Beam seem enough for the > > use > > > > > cases > > > > > > > as I > > > > > > > > > >> imagine them. > > > > > > > > > >> > > > > > > > > > >> Flink also has "dynamic table", "table source" and > "table > > > > sink" > > > > > > > which > > > > > > > > > >> seem very useful abstractions where Hudi might fit > nicely. > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> Attached a screen shot. > > > > > > > > > >> > > > > > > > > > >> This seems to fit with the original premise of Hudi as > > well. > > > > > > > > > >> > > > > > > > > > >> Am exploring this venue with a use case that involves > > > > "temporal > > > > > > > joins > > > > > > > > on > > > > > > > > > >> streams" which I need for feature extraction. > > > > > > > > > >> > > > > > > > > > >> Anyone is interested in this or has concrete enough > needs > > > and > > > > > use > > > > > > > > cases > > > > > > > > > >> please let me know. > > > > > > > > > >> > > > > > > > > > >> Best to go from an agreed upon set of 2-3 use cases. > > > > > > > > > >> > > > > > > > > > >> Cheers > > > > > > > > > >> > > > > > > > > > >> Nick > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > Also, we do have some Beam experts on the mailing > list.. > > > Can > > > > > you > > > > > > > > > please > > > > > > > > > >> weigh on viability of using Beam as the intermediate > > > > abstraction > > > > > > > here > > > > > > > > > >> between Spark/Flink? > > > > > > > > > >> Hudi uses RDD apis like groupBy, mapToPair, > > > > sortAndRepartition, > > > > > > > > > >> reduceByKey, countByKey and also does custom > partitioning > > a > > > > > lot.> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >