I'm personally in favor of maintaining one single branch, e.g., spark-runner, which supports both Spark 1.6 & 2.1. Since there's currently no DataFrame support in spark 1.x runner, there should be no conflicts if we put two versions of Spark into one runner.
I'm also +1 for adding adapters in the branch to support both Spark versions. Also, we can have two translators, say, 1.x translator which translates into RDDs & DataStreams and 2.x translator which translates into DataSets. On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi guys, > > sorry, due to the time zone shift, I answer a bit late ;) > > I think we can have the same runner dealing with the two major Spark > version, introducing some adapters. For instance, in CarbonData, we created > some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The > dependencies come from Maven profiles. Of course, it's easier there as it's > more "user" code. > > My proposal is just it's worth to try ;) > > I just created a branch to experiment a bit and have more details. > > Regards > JB > > > On 03/16/2017 02:31 AM, Amit Sela wrote: > >> I answered inline to Abbass' comment, but I think he hit something - how >> about we have a branch with those adaptations ? same RDD implementation, >> but depending on the latest 2.x version with the minimal changes required. >> I'd be happy to do that, or guide anyone who wants to (I did most of it on >> my branch for Spark 2 anyway) but since it's a branch and not on master (I >> don't believe it "deserves" a place on master), it would always be a bit >> behind since we would have to rebase and merge once in a while. >> >> How does that sound ? >> >> On Wed, Mar 15, 2017 at 7:49 PM amarouni <amaro...@talend.com> wrote: >> >> +1 for Spark runners based on different APIs RDD/Dataset and keeping the >>> Spark versions as a deployment dependency. >>> >>> The RDD API is stable & mature enough so it makes sense to have it on >>> master, the Dataset API still have some work to do and from our own >>> experience it just reached a comparable RDD API performance. The >>> community is clearly heading in the Dataset API direction but the RDD >>> API is still a viable option for most use cases. >>> >>> Just one quick question, today on master can we swap Spark 1.x by Spark >>> 2.x and compile and use the Spark Runner ? >>> >>> Good question! >> I think this is the root cause of this problem - Spark 2 not only >> introduced a new API, but also broke a few such as: context is now >> session, >> Accumulators are AccumulatorV2, and this is what I recall right now. >> I don't think it's to hard to adapt those, and anyone who wants to could >> see how I did it on my branch: >> https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3 >> 5bae742d78a290cbbdc9 >> >> >> >>> Thanks, >>> >>> Abbass, >>> >>> >>> On 15/03/2017 17:57, Amit Sela wrote: >>> >>>> So you're suggesting we copy-paste the current runner and adapt whatever >>>> >>> is >>> >>>> necessary so it runs with Spark 2 ? >>>> This also means any bug-fix / improvement would have to be maintained in >>>> two runners, and I wouldn't wanna do that. >>>> >>>> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset >>>> >>> API. >>> >>>> Since the RDD API is mature, it should be the runner in master (not >>>> preventing another runner once Dataset API is mature enough) and the >>>> version (1.6.3 or 2.x) should be determined by the common installation. >>>> >>>> That's why I believe we still need to leave things as they are, but >>>> start >>>> working on the Dataset API runner. >>>> Otherwise, we'll have the current runner, another RDD API runner with >>>> >>> Spark >>> >>>> 2, and a third one for the Dataset API. I don't want to maintain all of >>>> them. It's a mess. >>>> >>>> On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía <ieme...@gmail.com> wrote: >>>> >>>> However, I do feel that we should use the Dataset API, starting with >>>>>> >>>>> batch >>>>> >>>>>> support first. WDYT ? >>>>>> >>>>> Well, this is the exact current status quo, and it will take us some >>>>> time to have something as complete as what we have with the spark 1 >>>>> runner for the spark 2. >>>>> >>>>> The other proposal has two advantages: >>>>> >>>>> One is that we can leverage on the existing implementation (with the >>>>> needed adjustments) to run Beam pipelines on Spark 2, in the end final >>>>> users don’t care so much if pipelines are translated via RDD/DStream >>>>> or Dataset, they just want to know that with Beam they can run their >>>>> code in their favorite data processing framework. >>>>> >>>>> The other advantage is that we can base the work on the latest spark >>>>> version and advance simultaneously in translators for both APIs, and >>>>> once we consider that the DataSet is mature enough we can stop >>>>> maintaining the RDD one and make it the official one. >>>>> >>>>> The only missing piece is backporting new developments on the RDD >>>>> based translator from the spark 2 version into the spark 1, but maybe >>>>> this won’t be so hard if we consider what you said, that at this point >>>>> we are getting closer to have streaming right (of course you are the >>>>> most appropriate person to decide if we are in a sufficient good shape >>>>> to make this, so backporting things won’t be so hard). >>>>> >>>>> Finally I agree with you, I would prefer a nice and full featured >>>>> translator based on the Structured Streaming API but the question is >>>>> how much time this will take to be in shape and the impact on final >>>>> users who are already requesting this. This is the reason why I think >>>>> the more conservative approach (keeping around the RDD translator) and >>>>> moving incrementally makes sense. >>>>> >>>>> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela <amitsel...@gmail.com> >>>>> >>>> wrote: >>> >>>> I feel that as we're getting closer to supporting streaming with Spark >>>>>> >>>>> 1 >>> >>>> runner, and having Structured Streaming advance in Spark 2, we could >>>>>> >>>>> start >>>>> >>>>>> work on Spark 2 runner in a separate branch. >>>>>> >>>>>> However, I do feel that we should use the Dataset API, starting with >>>>>> >>>>> batch >>>>> >>>>>> support first. WDYT ? >>>>>> >>>>>> On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía <ieme...@gmail.com> >>>>>> >>>>> wrote: >>> >>>> >>>>>> So you propose to have the Spark 2 branch a clone of the current one >>>>>>>> >>>>>>> with >>>>> >>>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. >>>>>>>> >>>>>>> while >>>>>>> >>>>>>>> still using the RDD API ? >>>>>>>> >>>>>>> Yes this is exactly what I have in mind. >>>>>>> >>>>>>> I think that having another Spark runner is great if it has value, >>>>>>>> otherwise, let's just bump the version. >>>>>>>> >>>>>>> There is value because most people are already starting to move to >>>>>>> spark 2 and all Big Data distribution providers support it now, as >>>>>>> well as the Cloud-based distributions (Dataproc and EMR) not like the >>>>>>> last time we had this discussion. >>>>>>> >>>>>>> We could think of starting to migrate the Spark 1 runner to Spark 2 >>>>>>>> >>>>>>> and >>>>> >>>>>> follow with Dataset API support feature-by-feature as ot advances, >>>>>>>> >>>>>>> but I >>>>> >>>>>> think most Spark installations today still run 1.X, or am I wrong ? >>>>>>>> >>>>>>> No, you are right, that’s why I didn’t even mentioned removing the >>>>>>> spark 1 runner, I know that having to support things for both >>>>>>> versions >>>>>>> can add additional work for us, but maybe the best approach would be >>>>>>> to continue the work only in the spark 2 runner (both refining the >>>>>>> RDD >>>>>>> based translator and starting to create the Dataset one there that >>>>>>> co-exist until the DataSet API is mature enough) and keep the spark 1 >>>>>>> runner only for bug-fixes for the users who are still using it (like >>>>>>> this we don’t have to keep backporting stuff). Do you see any other >>>>>>> particular issue? >>>>>>> >>>>>>> Ismaël >>>>>>> >>>>>>> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> >>>>>>> >>>>>> wrote: >>>>> >>>>>> So you propose to have the Spark 2 branch a clone of the current one >>>>>>>> >>>>>>> with >>>>> >>>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. >>>>>>>> >>>>>>> while >>>>>>> >>>>>>>> still using the RDD API ? >>>>>>>> >>>>>>>> I think that having another Spark runner is great if it has value, >>>>>>>> otherwise, let's just bump the version. >>>>>>>> My idea of having another runner for Spark was not to support more >>>>>>>> >>>>>>> versions >>>>>>> >>>>>>>> - we should always support the most popular version in terms of >>>>>>>> compatibility - the idea was to try and make Beam work with >>>>>>>> >>>>>>> Structured >>> >>>> Streaming, which is still not fully mature so that's why we're not >>>>>>>> >>>>>>> heavily >>>>>>> >>>>>>>> investing there. >>>>>>>> >>>>>>>> We could think of starting to migrate the Spark 1 runner to Spark 2 >>>>>>>> >>>>>>> and >>>>> >>>>>> follow with Dataset API support feature-by-feature as ot advances, >>>>>>>> >>>>>>> but I >>>>> >>>>>> think most Spark installations today still run 1.X, or am I wrong ? >>>>>>>> >>>>>>>> On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com> >>>>>>>> >>>>>>> wrote: >>>>> >>>>>> BIG +1 JB, >>>>>>>>> >>>>>>>>> If we can just jump the version number with minor changes staying >>>>>>>>> as >>>>>>>>> close as possible to the current implementation for spark 1 we can >>>>>>>>> >>>>>>>> go >>> >>>> faster and offer in principle the exact same support but for version >>>>>>>>> 2. >>>>>>>>> >>>>>>>>> I know that the advanced streaming stuff based on the DataSet API >>>>>>>>> won't be there but with this common canvas the community can >>>>>>>>> iterate >>>>>>>>> to create a DataSet based translator at the same time. In >>>>>>>>> particular >>>>>>>>> >>>>>>>> I >>>>> >>>>>> consider the most important thing is that the spark 2 branch should >>>>>>>>> not live for long time, this should be merged into master really >>>>>>>>> >>>>>>>> fast >>> >>>> for the benefit of everybody. >>>>>>>>> >>>>>>>>> Ismaël >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré < >>>>>>>>> >>>>>>>> j...@nanthrax.net> >>>>> >>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Amit, >>>>>>>>>> >>>>>>>>>> What do you think of the following: >>>>>>>>>> >>>>>>>>>> - in the mean time that you reintroduce the Spark 2 branch, what >>>>>>>>>> >>>>>>>>> about >>>>> >>>>>> "extending" the version in the current Spark runner ? Still using >>>>>>>>>> RDD/DStream, I think we can support Spark 2.x even if we don't yet >>>>>>>>>> >>>>>>>>> leverage >>>>>>>>> >>>>>>>>>> the new provided features. >>>>>>>>>> >>>>>>>>>> Thoughts ? >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> JB >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 03/15/2017 07:39 PM, Amit Sela wrote: >>>>>>>>>> >>>>>>>>>>> Hi Cody, >>>>>>>>>>> >>>>>>>>>>> I will re-introduce this branch soon as part of the work on >>>>>>>>>>> >>>>>>>>>> BEAM-913 >>>>> >>>>>> <https://issues.apache.org/jira/browse/BEAM-913>. >>>>>>>>>>> For now, and from previous experience with the mentioned branch, >>>>>>>>>>> >>>>>>>>>> batch >>>>>>> >>>>>>>> implementation should be straight-forward. >>>>>>>>>>> Only issue is with streaming support - in the current runner >>>>>>>>>>> >>>>>>>>>> (Spark >>>>> >>>>>> 1.x) >>>>>>> >>>>>>>> we >>>>>>>>>>> have experimental support for windows/triggers and we're working >>>>>>>>>>> >>>>>>>>>> towards >>>>>>> >>>>>>>> full streaming support. >>>>>>>>>>> With Spark 2.x, there is no "general-purpose" stateful operator >>>>>>>>>>> >>>>>>>>>> for >>>>> >>>>>> the >>>>>>> >>>>>>>> Dataset API, so I was waiting to see if the new operator >>>>>>>>>>> <https://github.com/apache/spark/pull/17179> planned for next >>>>>>>>>>> >>>>>>>>>> version >>>>>>> >>>>>>>> could >>>>>>>>>>> help with that. >>>>>>>>>>> >>>>>>>>>>> To summarize, I will introduce a skeleton for the Spark 2 runner >>>>>>>>>>> >>>>>>>>>> with >>>>> >>>>>> batch >>>>>>>>>>> support as soon as I can as a separate branch. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Amit >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere < >>>>>>>>>>> >>>>>>>>>> e.neve...@gmail.com> >>>>> >>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi guys, >>>>>>>>>>>> Is there anybody who's currently working on Spark 2.x runner? A >>>>>>>>>>>> >>>>>>>>>>> old >>>>> >>>>>> PR >>>>>>> >>>>>>>> for >>>>>>>>>>>> spark 2.x runner was closed a few days ago, so I wonder what's >>>>>>>>>>>> >>>>>>>>>>> the >>>>> >>>>>> status >>>>>>>>> >>>>>>>>>> now, and is there a roadmap for this? >>>>>>>>>>>> Thanks~ >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>> Jean-Baptiste Onofré >>>>>>>>>> jbono...@apache.org >>>>>>>>>> http://blog.nanthrax.net >>>>>>>>>> Talend - http://www.talend.com >>>>>>>>>> >>>>>>>>> >>> >>> >>> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >