Hi guys, Currently, I am busy with HUDI-203[1] and other things.
I agree with Vinoth that we should try to find a new solution to decouple the dependency with the Spark RDD cache. It's an excellent way to start this big work. [1]: https://issues.apache.org/jira/browse/HUDI-203 vbal...@apache.org <vbal...@apache.org> 于2019年9月17日周二 上午3:49写道: > > +1 This is a pretty large undertaking. While the community is getting > their hands dirty and ramping up on Hudi internals, it would be productive > if Vinoth shepherds this > Balaji.V On Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth Chandar > <vin...@apache.org> wrote: > > sg. :) > > I will wait for others on this thread as well to chime in. > > On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala <taher...@gmail.com> > wrote: > > > Vinoth, I think right now given your experience with the project you > should > > be scoping out what needs to be done to take us there. So +1 for giving > you > > more work :) > > > > We want to reach a point where we can start scoping out addition of Flink > > and Beam components within. Then I think will tremendous progress. > > > > On Mon, Sep 16, 2019, 11:43 PM Vinoth Chandar <vin...@apache.org> wrote: > > > > > I still feel the key thing here is reimplementing HoodieBloomIndex > > without > > > needing spark caching. > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global) > > > documents the spark DAG in detail. > > > > > > If everyone feels, it's best for me to scope the work out, then happy > to > > do > > > it! > > > > > > On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <taher...@gmail.com> > > > wrote: > > > > > > > Guys I think we are slowing down on this again. We need to start > > planning > > > > small small tasks towards this VC please can you help fast track > this? > > > > > > > > Regards, > > > > Taher Koitawala > > > > > > > > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar <vin...@apache.org> > > wrote: > > > > > > > > > Look forward to the analysis. A key class to read would be > > > > > HoodieBloomIndex, which uses a lot of spark caching and shuffles. > > > > > > > > > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <yanghua1...@gmail.com> > > > wrote: > > > > > > > > > > > >> Currently Spark Streaming micro batching fits well with Hudi, > > > since > > > > it > > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark > > micro > > > > > batch > > > > > > = 1 hudi commit > > > > > > With the per-record model in Flink, I am not sure how useful it > > will > > > be > > > > > to > > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, > it > > > will > > > > > be > > > > > > inefficient.. > > > > > > > > > > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient. > > > About > > > > > > Flink streaming, we can also implement the "batch" and > > "micro-batch" > > > > > model > > > > > > when process data. For example: > > > > > > > > > > > > - aggregation: use flexibility window mechanism; > > > > > > - non-aggregation: use Flink stateful state API cache a batch > > data > > > > > > > > > > > > > > > > > > >> On first focussing on decoupling of Spark and Hudi alone, yes > a > > > full > > > > > > summary of how Spark is being used in a wiki page is a good start > > > IMO. > > > > We > > > > > > can then hash out what can be generalized and what cannot be and > > > needs > > > > to > > > > > > be left in hudi-client-spark vs hudi-client-core > > > > > > > > > > > > agree > > > > > > > > > > > > Vinoth Chandar <vin...@apache.org> 于2019年8月14日周三 上午8:35写道: > > > > > > > > > > > > > >> We should only stick to Flink Streaming. Furthermore if > there > > > is a > > > > > > > requirement for batch then users > > > > > > > >> should use Spark or then we will anyway have a beam > > integration > > > > > coming > > > > > > > up. > > > > > > > > > > > > > > Currently Spark Streaming micro batching fits well with Hudi, > > since > > > > it > > > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark > > > micro > > > > > > batch > > > > > > > = 1 hudi commit > > > > > > > With the per-record model in Flink, I am not sure how useful it > > > will > > > > be > > > > > > to > > > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, > > it > > > > will > > > > > > be > > > > > > > inefficient.. > > > > > > > > > > > > > > On first focussing on decoupling of Spark and Hudi alone, yes a > > > full > > > > > > > summary of how Spark is being used in a wiki page is a good > start > > > > IMO. > > > > > We > > > > > > > can then hash out what can be generalized and what cannot be > and > > > > needs > > > > > to > > > > > > > be left in hudi-client-spark vs hudi-client-core > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang < > yanghua1...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Nick and Taher, > > > > > > > > > > > > > > > > I just want to answer Nishith's question. Reference his old > > > > > description > > > > > > > > here: > > > > > > > > > > > > > > > > > You can do a parallel investigation while we are deciding > on > > > the > > > > > > module > > > > > > > > structure. You could be looking at all the patterns in > Hudi's > > > > Spark > > > > > > APIs > > > > > > > > usage (RDD/DataSource/SparkContext) and see if such support > can > > > be > > > > > > > achieved > > > > > > > > in theory with Flink. If not, what is the workaround. > > Documenting > > > > > such > > > > > > > > patterns would be valuable when multiple engineers are > working > > on > > > > it. > > > > > > For > > > > > > > > e:g, Hudi relies on (a) custom partitioning logic for > > > upserts, > > > > > > > (b) > > > > > > > > caching RDDs to avoid reruns of costly stages (c) A Spark > > > > upsert > > > > > > task > > > > > > > > knowing its spark partition/task/attempt ids > > > > > > > > > > > > > > > > And just like the title of this thread, we are going to try > to > > > > > decouple > > > > > > > > Hudi and Spark. That means we can run the whole Hudi without > > > > > depending > > > > > > > > Spark. So we need to analyze all the usage of Spark in Hudi. > > > > > > > > > > > > > > > > Here we are not discussing the integration of Hudi and Flink > in > > > the > > > > > > > > application layer. Instead, I want Hudi to be decoupled from > > > Spark > > > > > and > > > > > > > > allow other engines (such as Flink) to replace Spark. > > > > > > > > > > > > > > > > It can be divided into long-term goals and short-term goals. > As > > > > > Nishith > > > > > > > > stated in a recent email. > > > > > > > > > > > > > > > > I mentioned the Flink Batch API here because Hudi can connect > > > with > > > > > many > > > > > > > > different Source/Sinks. Some file-based reads are not > > appropriate > > > > for > > > > > > > Flink > > > > > > > > Streaming. > > > > > > > > > > > > > > > > Therefore, this is a comprehensive survey of the use of Spark > > in > > > > > Hudi. > > > > > > > > > > > > > > > > Best, > > > > > > > > Vino > > > > > > > > > > > > > > > > > > > > > > > > taher koitawala <taher...@gmail.com> 于2019年8月13日周二 下午5:43写道: > > > > > > > > > > > > > > > > > Hi Vino, > > > > > > > > > According to what I've seen Hudi has a lot of spark > > > > component > > > > > > > > flowing > > > > > > > > > throwing it. Like Taskcontexts, JavaSparkContexts etc. The > > main > > > > > > > classes I > > > > > > > > > guess we should focus upon is HoodieTable and Hoodie write > > > > clients. > > > > > > > > > > > > > > > > > > Also Vino, I don't think we should be providing Flink > dataset > > > > > > > > > implementation. We should only stick to Flink Streaming. > > > > > > > > > Furthermore if there is a requirement for > > batch > > > > then > > > > > > > users > > > > > > > > > should use Spark or then we will anyway have a beam > > integration > > > > > > coming > > > > > > > > up. > > > > > > > > > > > > > > > > > > As of cache, How about we write our stateful Flink function > > and > > > > use > > > > > > > > > RocksDbStateBackend with some state TTL. > > > > > > > > > > > > > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang < > > yanghua1...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > After doing some research, let me share my information: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Limitation of computing engine capabilities: Hudi > uses > > > > > Spark's > > > > > > > > > > RDD#persist, and Flink currently has no API to cache > > > > datasets. > > > > > > > Maybe > > > > > > > > > we > > > > > > > > > > can > > > > > > > > > > only choose to use external storage or do not use > cache? > > > For > > > > > the > > > > > > > use > > > > > > > > > of > > > > > > > > > > other APIs, the two currently offer almost equivalent > > > > > > > capabilities. > > > > > > > > > > - The abstraction of the computing engine is > different: > > > > > > > Considering > > > > > > > > > the > > > > > > > > > > different usage scenarios of the computing engine in > > Hudi, > > > > > Flink > > > > > > > has > > > > > > > > > not > > > > > > > > > > yet implemented stream batch unification, so we may > use > > > both > > > > > > > Flink's > > > > > > > > > > DataSet API (batch processing) and DataStream API > > (stream > > > > > > > > processing). > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Vino > > > > > > > > > > > > > > > > > > > > nishith agarwal <n3.nas...@gmail.com> 于2019年8月8日周四 > > > 上午12:57写道: > > > > > > > > > > > > > > > > > > > > > Nick, > > > > > > > > > > > > > > > > > > > > > > You bring up a good point about the non-trivial > > programming > > > > > model > > > > > > > > > > > differences between these different technologies. From > a > > > > > > > theoretical > > > > > > > > > > > perspective, I'd say considering a higher level > > abstraction > > > > > makes > > > > > > > > > sense. > > > > > > > > > > I > > > > > > > > > > > think we have to decouple some objectives and concerns > > > here. > > > > > > > > > > > > > > > > > > > > > > a) The immediate desire is to have Hudi be able to run > > on a > > > > > Flink > > > > > > > (or > > > > > > > > > > > non-spark) engine. This naturally begs the question of > > > > > decoupling > > > > > > > > Hudi > > > > > > > > > > > concepts from direct Spark dependencies. > > > > > > > > > > > > > > > > > > > > > > b) If we do want to initiate the above effort, would it > > > make > > > > > > sense > > > > > > > to > > > > > > > > > > just > > > > > > > > > > > have a higher level abstraction, building on other > > > > technologies > > > > > > > like > > > > > > > > > beam > > > > > > > > > > > (euphoria etc) and provide single, clean API's that may > > be > > > > more > > > > > > > > > > > maintainable from a code perspective. But at the same > > time > > > > this > > > > > > > will > > > > > > > > > > > introduce challenges on how to maintain efficiency and > > > > > optimized > > > > > > > > > runtime > > > > > > > > > > > dags for Hudi (since the code would move away from > point > > > > > > > integrations > > > > > > > > > and > > > > > > > > > > > whenever this happens, tuning natively for specific > > engines > > > > > > becomes > > > > > > > > > more > > > > > > > > > > > and more difficult). > > > > > > > > > > > > > > > > > > > > > > My general opinion is that, as the community grows over > > > time > > > > > with > > > > > > > > more > > > > > > > > > > > folks having an in-depth understanding of Hudi, going > > from > > > > > > > > > current_state > > > > > > > > > > -> > > > > > > > > > > > (a) -> (b) might be the most reliable and adoptable > path > > > for > > > > > this > > > > > > > > > > project. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Nishith > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng < > > > > > > > > n...@semanticbeeng.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > There are some not trivial difference between > > programming > > > > > model > > > > > > > and > > > > > > > > > > > > runtime semantics between Beam, Spark and Flink. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how > > > > > > > > > > > > > > > > > > > > > > > > Nitish, Vino - thoughts? > > > > > > > > > > > > > > > > > > > > > > > > Does it feel to consider a higher level abstraction / > > DSL > > > > > > instead > > > > > > > > of > > > > > > > > > > > > maintaining different code with same functionality > but > > > > > > different > > > > > > > > > > > > programming models ? > > > > > > > > > > > > > > > > > > > > > > > > > > > https://beam.apache.org/documentation/sdks/java/euphoria/ > > > > > > > > > > > > > > > > > > > > > > > > Nick > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On August 6, 2019 at 4:04 PM nishith agarwal < > > > > > > > n3.nas...@gmail.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1 for Approach 1 Point integration with each > > framework. > > > > > > > > > > > > > > > > > > > > > > > > Pros for point integration > > > > > > > > > > > > > > > > > > > > > > > > - Hudi community is already familiar with spark > and > > > > spark > > > > > > > based > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > actions/shuffles etc. Since both modules can be > > > decoupled, > > > > > this > > > > > > > > > enables > > > > > > > > > > > us > > > > > > > > > > > > to have a steady release for Hudi for 1 execution > > engine > > > > > > (spark) > > > > > > > > > while > > > > > > > > > > we > > > > > > > > > > > > hone our skills and iterate on making flink dag > > > optimized, > > > > > > > > performant > > > > > > > > > > > with > > > > > > > > > > > > the right configuration. > > > > > > > > > > > > > > > > > > > > > > > > - This might be a stepping stone towards rewriting > > the > > > > > > entire > > > > > > > > code > > > > > > > > > > > base > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > being agnostic of spark/flink. This approach will > help > > us > > > > fix > > > > > > > > tests, > > > > > > > > > > > > intricacies and help make the code base ready for a > > > larger > > > > > > > rework. > > > > > > > > > > > > > > > > > > > > > > > > - Seems like the easiest way to add flink support > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cons > > > > > > > > > > > > > > > > > > > > > > > > - More code paths to maintain and reason since the > > > spark > > > > > and > > > > > > > > flink > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > integrations will naturally diverge over time. > > > > > > > > > > > > > > > > > > > > > > > > Theoretically, I do like the idea of being able to > run > > > the > > > > > hudi > > > > > > > dag > > > > > > > > > on > > > > > > > > > > > beam > > > > > > > > > > > > more than point integrations, where there is one > > > API/logic > > > > to > > > > > > > > reason > > > > > > > > > > > about. > > > > > > > > > > > > But practically, that may not be the right direction. > > > > > > > > > > > > > > > > > > > > > > > > Pros > > > > > > > > > > > > > > > > > > > > > > > > - Lesser cognitive burden in maintaining, evolving > > and > > > > > > > releasing > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > project with one API to reason with. > > > > > > > > > > > > > > > > > > > > > > > > - Theoretically, going forward assuming beam is > > > adopted > > > > > as a > > > > > > > > > > standard > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > programming paradigm for stream/batch, this would > > enable > > > > > > > consumers > > > > > > > > > > > leverage > > > > > > > > > > > > the power of hudi more easily. > > > > > > > > > > > > > > > > > > > > > > > > Cons > > > > > > > > > > > > > > > > > > > > > > > > - Massive rewrite of the code base. Additionally, > > > since > > > > we > > > > > > > would > > > > > > > > > > have > > > > > > > > > > > > moved > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > away from directly using spark APIs, there is a > bigger > > > risk > > > > > of > > > > > > > > > > > regression. > > > > > > > > > > > > We would have to be very thorough with all the > > > intricacies > > > > > and > > > > > > > > ensure > > > > > > > > > > the > > > > > > > > > > > > same stability of new releases. > > > > > > > > > > > > > > > > > > > > > > > > - Managing future features (which may be very > spark > > > > > driven) > > > > > > > will > > > > > > > > > > > either > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > clash or pause or will need to be reworked. > > > > > > > > > > > > > > > > > > > > > > > > - Tuning jobs for Spark/Flink type execution > > > frameworks > > > > > > > > > individually > > > > > > > > > > > > might > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > be difficult and will get difficult over time as the > > > > project > > > > > > > > evolves, > > > > > > > > > > > where > > > > > > > > > > > > some beam integrations with spark/flink may not work > as > > > > > > expected. > > > > > > > > > > > > > > > > > > > > > > > > - Also, as pointed above, need to probably support > > the > > > > > > > > > hoodie-spark > > > > > > > > > > > > module > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > as a first-class. > > > > > > > > > > > > > > > > > > > > > > > > Thank, > > > > > > > > > > > > Nishith > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala < > > > > > > > taher...@gmail.com > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Hi Vinoth, > > > > > > > > > > > > Are there some tasks I can take up to ramp up the > code? > > > > Want > > > > > to > > > > > > > get > > > > > > > > > > > > more used to the code and understand the existing > > > > > > implementation > > > > > > > > > > better. > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Taher Koitawala > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar < > > > > > > vin...@apache.org> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Let's see if others have any thoughts as well. We can > > > plan > > > > to > > > > > > fix > > > > > > > > the > > > > > > > > > > > > approach by EOW. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang < > > > > > > yanghua1...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Hi guys, > > > > > > > > > > > > > > > > > > > > > > > > Also, +1 for Approach 1 like Taher. > > > > > > > > > > > > > > > > > > > > > > > > If we can do a comprehensive analysis of this model > and > > > > come > > > > > up > > > > > > > > with. > > > > > > > > > > > > > > > > > > > > > > > > means > > > > > > > > > > > > > > > > > > > > > > > > to refactor this cleanly, this would be promising. > > > > > > > > > > > > > > > > > > > > > > > > Yes, when we get the conclusion, we could start this > > > work. > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Vino > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > taher koitawala <taher...@gmail.com> 于2019年8月6日周二 > > > > 上午12:28写道: > > > > > > > > > > > > > > > > > > > > > > > > +1 for Approch 1 Point integration with each > framework > > > > > > > > > > > > > > > > > > > > > > > > Approach 2 has a problem as you said "Developers need > > to > > > > > think > > > > > > > > about > > > > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So > > in > > > > the > > > > > > end, > > > > > > > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > > > > > may > > > > > > > > > > > > > > > > > > > > > > > > not be the panacea that it seems to be" > > > > > > > > > > > > > > > > > > > > > > > > We have seen various pipelines in the beam dag being > > > > > expressed > > > > > > > > > > > > > > > > > > > > > > > > differently > > > > > > > > > > > > > > > > > > > > > > > > then we had them in our original usecase. And also > > > > switching > > > > > > > > between > > > > > > > > > > > > > > > > > > > > > > > > spark > > > > > > > > > > > > > > > > > > > > > > > > and Flink runners in beam have various impact on the > > > > > pipelines > > > > > > > like > > > > > > > > > > > > > > > > > > > > > > > > some > > > > > > > > > > > > > > > > > > > > > > > > features available in Flink are not available on the > > > spark > > > > > > runner > > > > > > > > > > > > > > > > > > > > > > > > etc. > > > > > > > > > > > > > > > > > > > > > > > > Refer to this compatible matrix -> > > > > > > > > > > > > > > > > > > https://beam.apache.org/documentation/runners/capability-matrix/ > > > > > > > > > > > > > > > > > > > > > > > > Hence my vote on Approch 1 let's decouple and build > the > > > > > > abstract > > > > > > > > for > > > > > > > > > > > > > > > > > > > > > > > > each > > > > > > > > > > > > > > > > > > > > > > > > framework. That is a much better option. We will also > > > have > > > > > more > > > > > > > > > > > > > > > > > > > > > > > > control > > > > > > > > > > > > > > > > > > > > > > > > over each framework's implement. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar < > > > > > vin...@apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Would like to highlight that there are two distinct > > > > > approaches > > > > > > > here > > > > > > > > > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > > > > > > > > different tradeoffs. Think of this as my braindump, > as > > I > > > > have > > > > > > > been > > > > > > > > > > > > > > > > > > > > > > > > thinking > > > > > > > > > > > > > > > > > > > > > > > > about this quite a bit in the past. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Approach 1 : Point integration with each framework * > > > > > > > > > > > > > > > > > > > > > > > > We may need a pure client module named for example > > > > > > > > > > > > hoodie-client-core(common) > > > > > > > > > > > > >> Then we could have: hoodie-client-spark, > > > > > hoodie-client-flink > > > > > > > > > > > > and hoodie-client-beam > > > > > > > > > > > > > > > > > > > > > > > > (+) This is the safest to do IMO, since we can > isolate > > > the > > > > > > > current > > > > > > > > > > > > > > > > > > > > > > > > Spark > > > > > > > > > > > > > > > > > > > > > > > > execution (hoodie-spark, hoodie-client-spark) from > the > > > > > changes > > > > > > > for > > > > > > > > > > > > > > > > > > > > > > > > flink, > > > > > > > > > > > > > > > > > > > > > > > > while it stabilizes over few releases. > > > > > > > > > > > > (-) Downside is that the utilities needs to be > redone : > > > > > > > > > > > > hoodie-utilities-spark and hoodie-utilities-flink and > > > > > > > > > > > > hoodie-utilities-core ? hoodie-cli? > > > > > > > > > > > > > > > > > > > > > > > > If we can do a comprehensive analysis of this model > and > > > > come > > > > > up > > > > > > > > > > > > > > > > > > > > > > > > with. > > > > > > > > > > > > > > > > > > > > > > > > means > > > > > > > > > > > > > > > > > > > > > > > > to refactor this cleanly, this would be promising. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Approach 2: Beam as the compute abstraction* > > > > > > > > > > > > > > > > > > > > > > > > Another more drastic approach is to remove Spark as > the > > > > > compute > > > > > > > > > > > > > > > > > > > > > > > > abstraction > > > > > > > > > > > > > > > > > > > > > > > > for writing data and replace it with Beam. > > > > > > > > > > > > > > > > > > > > > > > > (+) All of the code remains more or less similar and > > > there > > > > is > > > > > > one > > > > > > > > > > > > > > > > > > > > > > > > compute > > > > > > > > > > > > > > > > > > > > > > > > API to reason about. > > > > > > > > > > > > > > > > > > > > > > > > (-) The (very big) assumption here is that we are > able > > to > > > > > tune > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > spark > > > > > > > > > > > > > > > > > > > > > > > > runtime the same way using Beam : custom > partitioners, > > > > > support > > > > > > > for > > > > > > > > > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > > > > > > > RDD > > > > > > > > > > > > > > > > > > > > > > > > operations we invoke, caching etc etc. > > > > > > > > > > > > (-) It will be a massive rewrite and testing of such > a > > > > large > > > > > > > > > > > > > > > > > > > > > > > > rewrite > > > > > > > > > > > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > > > > > > > > > > also be really challenging, since we need to pay > > > attention > > > > to > > > > > > all > > > > > > > > > > > > > > > > > > > > > > > > intricate > > > > > > > > > > > > > > > > > > > > > > > > details to ensure the spark users today experience no > > > > > > > > > > > > regressions/side-effects > > > > > > > > > > > > (-) Note that we still need to probably support the > > > > > > hoodie-spark > > > > > > > > > > > > > > > > > > > > > > > > module > > > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > > may be a first-class such integration with flink, for > > > > native > > > > > > > > > > > > > > > > > > > > > > > > flink/spark > > > > > > > > > > > > > > > > > > > > > > > > pipeline authoring. Users of say DeltaStreamer need > to > > > pass > > > > > in > > > > > > > > > > > > > > > > > > > > > > > > Spark > > > > > > > > > > > > > > > > > > > > > > > > or > > > > > > > > > > > > > > > > > > > > > > > > Flink configs anyway.. Developers need to think about > > > > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So > > in > > > > the > > > > > > end, > > > > > > > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > > > > > may > > > > > > > > > > > > > > > > > > > > > > > > not be the panacea that it seems to be. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One goal for the HIP is to get us all to agree as a > > > > community > > > > > > > which > > > > > > > > > > > > > > > > > > > > > > > > one > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > pick, with sufficient investigation, testing, > > > > benchmarking.. > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang < > > > > > > yanghua1...@gmail.com> > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > +1 for both Beam and Flink > > > > > > > > > > > > > > > > > > > > > > > > First step here is to probably draw out current > > > hierrarchy > > > > > and > > > > > > > > > > > > > > > > > > > > > > > > figure > > > > > > > > > > > > > > > > > > > > > > > > out > > > > > > > > > > > > > > > > > > > > > > > > what the abstraction points are.. > > > > > > > > > > > > In my opinion, the runtime (spark, flink) should be > > done > > > at > > > > > the > > > > > > > > > > > > hoodie-client level and just used by hoodie-utilties > > > > > > > > > > > > > > > > > > > > > > > > seamlessly.. > > > > > > > > > > > > > > > > > > > > > > > > +1 for Vinoth's opinion, it should be the first step. > > > > > > > > > > > > > > > > > > > > > > > > No matter we hope Hudi to integrate with which > > computing > > > > > > > > > > > > > > > > > > > > > > > > framework. > > > > > > > > > > > > > > > > > > > > > > > > We need to decouple Hudi client and Spark. > > > > > > > > > > > > > > > > > > > > > > > > We may need a pure client module named for example > > > > > > > > > > > > hoodie-client-core(common) > > > > > > > > > > > > > > > > > > > > > > > > Then we could have: hoodie-client-spark, > > > > hoodie-client-flink > > > > > > and > > > > > > > > > > > > hoodie-client-beam > > > > > > > > > > > > > > > > > > > > > > > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 > > > 上午10:45写道: > > > > > > > > > > > > > > > > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's analysis. > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala < > > > > > > > > > > > > > > > > > > > > > > > > taher...@gmail.com> > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > So the way to go around this is that file a hip. > Chalk > > > all > > > > th > > > > > > > > > > > > > > > > > > > > > > > > classes > > > > > > > > > > > > > > > > > > > > > > > > our > > > > > > > > > > > > > > > > > > > > > > > > and start moving towards Pure client. > > > > > > > > > > > > > > > > > > > > > > > > Secondly should we want to try beam? > > > > > > > > > > > > > > > > > > > > > > > > I think there is to much going on here and I'm not > able > > > to > > > > > > > > > > > > > > > > > > > > > > > > follow. > > > > > > > > > > > > > > > > > > > > > > > > If > > > > > > > > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > > > > > > want to try out beam all along I don't think it makes > > > sense > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > do > > > > > > > > > > > > > > > > > > > > > > > > anything > > > > > > > > > > > > > > > > > > > > > > > > on Flink then. > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng < > > > > > > > > > > > > > > > > > > > > > > > > n...@semanticbeeng.com> > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > >> +1 My money is on this approach. > > > > > > > > > > > > >> > > > > > > > > > > > > >> The existing abstractions from Beam seem enough > for > > > the > > > > > use > > > > > > > > > > > > > > > > > > > > > > > > cases > > > > > > > > > > > > > > > > > > > > > > > > as I > > > > > > > > > > > > > > > > > > > > > > > > imagine them. > > > > > > > > > > > > > > > > > > > > > > > > >> Flink also has "dynamic table", "table source" and > > > > "table > > > > > > > > > > > > > > > > > > > > > > > > sink" > > > > > > > > > > > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > > > > > > > > > seem very useful abstractions where Hudi might fit > > > nicely. > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > >> Attached a screen shot. > > > > > > > > > > > > >> > > > > > > > > > > > > >> This seems to fit with the original premise of > Hudi > > as > > > > > well. > > > > > > > > > > > > >> > > > > > > > > > > > > >> Am exploring this venue with a use case that > > involves > > > > > > > > > > > > > > > > > > > > > > > > "temporal > > > > > > > > > > > > > > > > > > > > > > > > joins > > > > > > > > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > > > > > > > streams" which I need for feature extraction. > > > > > > > > > > > > > > > > > > > > > > > > >> Anyone is interested in this or has concrete > enough > > > > needs > > > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > > use > > > > > > > > > > > > > > > > > > > > > > > > cases > > > > > > > > > > > > > > > > > > > > > > > > please let me know. > > > > > > > > > > > > > > > > > > > > > > > > >> Best to go from an agreed upon set of 2-3 use > cases. > > > > > > > > > > > > >> > > > > > > > > > > > > >> Cheers > > > > > > > > > > > > >> > > > > > > > > > > > > >> Nick > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > Also, we do have some Beam experts on the > mailing > > > > list.. > > > > > > > > > > > > > > > > > > > > > > > > Can > > > > > > > > > > > > > > > > > > > > > > > > you > > > > > > > > > > > > > > > > > > > > > > > > please > > > > > > > > > > > > >> weigh on viability of using Beam as the > intermediate > > > > > > > > > > > > > > > > > > > > > > > > abstraction > > > > > > > > > > > > > > > > > > > > > > > > here > > > > > > > > > > > > > > > > > > > > > > > > between Spark/Flink? > > > > > > > > > > > > Hudi uses RDD apis like groupBy, mapToPair, > > > > > > > > > > > > > > > > > > > > > > > > sortAndRepartition, > > > > > > > > > > > > > > > > > > > > > > > > reduceByKey, countByKey and also does custom > > > partitioning a > > > > > > > > > > > > > > > > > > > > > > > > lot.> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >