+1 for Spark runners based on different APIs RDD/Dataset and keeping the Spark versions as a deployment dependency.
The RDD API is stable & mature enough so it makes sense to have it on master, the Dataset API still have some work to do and from our own experience it just reached a comparable RDD API performance. The community is clearly heading in the Dataset API direction but the RDD API is still a viable option for most use cases. Just one quick question, today on master can we swap Spark 1.x by Spark 2.x and compile and use the Spark Runner ? Thanks, Abbass, On 15/03/2017 17:57, Amit Sela wrote: > So you're suggesting we copy-paste the current runner and adapt whatever is > necessary so it runs with Spark 2 ? > This also means any bug-fix / improvement would have to be maintained in > two runners, and I wouldn't wanna do that. > > I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset API. > Since the RDD API is mature, it should be the runner in master (not > preventing another runner once Dataset API is mature enough) and the > version (1.6.3 or 2.x) should be determined by the common installation. > > That's why I believe we still need to leave things as they are, but start > working on the Dataset API runner. > Otherwise, we'll have the current runner, another RDD API runner with Spark > 2, and a third one for the Dataset API. I don't want to maintain all of > them. It's a mess. > > On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía <ieme...@gmail.com> wrote: > >>> However, I do feel that we should use the Dataset API, starting with >> batch >>> support first. WDYT ? >> Well, this is the exact current status quo, and it will take us some >> time to have something as complete as what we have with the spark 1 >> runner for the spark 2. >> >> The other proposal has two advantages: >> >> One is that we can leverage on the existing implementation (with the >> needed adjustments) to run Beam pipelines on Spark 2, in the end final >> users don’t care so much if pipelines are translated via RDD/DStream >> or Dataset, they just want to know that with Beam they can run their >> code in their favorite data processing framework. >> >> The other advantage is that we can base the work on the latest spark >> version and advance simultaneously in translators for both APIs, and >> once we consider that the DataSet is mature enough we can stop >> maintaining the RDD one and make it the official one. >> >> The only missing piece is backporting new developments on the RDD >> based translator from the spark 2 version into the spark 1, but maybe >> this won’t be so hard if we consider what you said, that at this point >> we are getting closer to have streaming right (of course you are the >> most appropriate person to decide if we are in a sufficient good shape >> to make this, so backporting things won’t be so hard). >> >> Finally I agree with you, I would prefer a nice and full featured >> translator based on the Structured Streaming API but the question is >> how much time this will take to be in shape and the impact on final >> users who are already requesting this. This is the reason why I think >> the more conservative approach (keeping around the RDD translator) and >> moving incrementally makes sense. >> >> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela <amitsel...@gmail.com> wrote: >>> I feel that as we're getting closer to supporting streaming with Spark 1 >>> runner, and having Structured Streaming advance in Spark 2, we could >> start >>> work on Spark 2 runner in a separate branch. >>> >>> However, I do feel that we should use the Dataset API, starting with >> batch >>> support first. WDYT ? >>> >>> On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía <ieme...@gmail.com> wrote: >>> >>>>> So you propose to have the Spark 2 branch a clone of the current one >> with >>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. >>>> while >>>>> still using the RDD API ? >>>> Yes this is exactly what I have in mind. >>>> >>>>> I think that having another Spark runner is great if it has value, >>>>> otherwise, let's just bump the version. >>>> There is value because most people are already starting to move to >>>> spark 2 and all Big Data distribution providers support it now, as >>>> well as the Cloud-based distributions (Dataproc and EMR) not like the >>>> last time we had this discussion. >>>> >>>>> We could think of starting to migrate the Spark 1 runner to Spark 2 >> and >>>>> follow with Dataset API support feature-by-feature as ot advances, >> but I >>>>> think most Spark installations today still run 1.X, or am I wrong ? >>>> No, you are right, that’s why I didn’t even mentioned removing the >>>> spark 1 runner, I know that having to support things for both versions >>>> can add additional work for us, but maybe the best approach would be >>>> to continue the work only in the spark 2 runner (both refining the RDD >>>> based translator and starting to create the Dataset one there that >>>> co-exist until the DataSet API is mature enough) and keep the spark 1 >>>> runner only for bug-fixes for the users who are still using it (like >>>> this we don’t have to keep backporting stuff). Do you see any other >>>> particular issue? >>>> >>>> Ismaël >>>> >>>> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> >> wrote: >>>>> So you propose to have the Spark 2 branch a clone of the current one >> with >>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. >>>> while >>>>> still using the RDD API ? >>>>> >>>>> I think that having another Spark runner is great if it has value, >>>>> otherwise, let's just bump the version. >>>>> My idea of having another runner for Spark was not to support more >>>> versions >>>>> - we should always support the most popular version in terms of >>>>> compatibility - the idea was to try and make Beam work with Structured >>>>> Streaming, which is still not fully mature so that's why we're not >>>> heavily >>>>> investing there. >>>>> >>>>> We could think of starting to migrate the Spark 1 runner to Spark 2 >> and >>>>> follow with Dataset API support feature-by-feature as ot advances, >> but I >>>>> think most Spark installations today still run 1.X, or am I wrong ? >>>>> >>>>> On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com> >> wrote: >>>>>> BIG +1 JB, >>>>>> >>>>>> If we can just jump the version number with minor changes staying as >>>>>> close as possible to the current implementation for spark 1 we can go >>>>>> faster and offer in principle the exact same support but for version >>>>>> 2. >>>>>> >>>>>> I know that the advanced streaming stuff based on the DataSet API >>>>>> won't be there but with this common canvas the community can iterate >>>>>> to create a DataSet based translator at the same time. In particular >> I >>>>>> consider the most important thing is that the spark 2 branch should >>>>>> not live for long time, this should be merged into master really fast >>>>>> for the benefit of everybody. >>>>>> >>>>>> Ismaël >>>>>> >>>>>> >>>>>> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré < >> j...@nanthrax.net> >>>>>> wrote: >>>>>>> Hi Amit, >>>>>>> >>>>>>> What do you think of the following: >>>>>>> >>>>>>> - in the mean time that you reintroduce the Spark 2 branch, what >> about >>>>>>> "extending" the version in the current Spark runner ? Still using >>>>>>> RDD/DStream, I think we can support Spark 2.x even if we don't yet >>>>>> leverage >>>>>>> the new provided features. >>>>>>> >>>>>>> Thoughts ? >>>>>>> >>>>>>> Regards >>>>>>> JB >>>>>>> >>>>>>> >>>>>>> On 03/15/2017 07:39 PM, Amit Sela wrote: >>>>>>>> Hi Cody, >>>>>>>> >>>>>>>> I will re-introduce this branch soon as part of the work on >> BEAM-913 >>>>>>>> <https://issues.apache.org/jira/browse/BEAM-913>. >>>>>>>> For now, and from previous experience with the mentioned branch, >>>> batch >>>>>>>> implementation should be straight-forward. >>>>>>>> Only issue is with streaming support - in the current runner >> (Spark >>>> 1.x) >>>>>>>> we >>>>>>>> have experimental support for windows/triggers and we're working >>>> towards >>>>>>>> full streaming support. >>>>>>>> With Spark 2.x, there is no "general-purpose" stateful operator >> for >>>> the >>>>>>>> Dataset API, so I was waiting to see if the new operator >>>>>>>> <https://github.com/apache/spark/pull/17179> planned for next >>>> version >>>>>>>> could >>>>>>>> help with that. >>>>>>>> >>>>>>>> To summarize, I will introduce a skeleton for the Spark 2 runner >> with >>>>>>>> batch >>>>>>>> support as soon as I can as a separate branch. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Amit >>>>>>>> >>>>>>>> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere < >> e.neve...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi guys, >>>>>>>>> Is there anybody who's currently working on Spark 2.x runner? A >> old >>>> PR >>>>>>>>> for >>>>>>>>> spark 2.x runner was closed a few days ago, so I wonder what's >> the >>>>>> status >>>>>>>>> now, and is there a roadmap for this? >>>>>>>>> Thanks~ >>>>>>>>> >>>>>>> -- >>>>>>> Jean-Baptiste Onofré >>>>>>> jbono...@apache.org >>>>>>> http://blog.nanthrax.net >>>>>>> Talend - http://www.talend.com