I feel that as we're getting closer to supporting streaming with Spark 1 runner, and having Structured Streaming advance in Spark 2, we could start work on Spark 2 runner in a separate branch.
However, I do feel that we should use the Dataset API, starting with batch support first. WDYT ? On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía <ieme...@gmail.com> wrote: > > So you propose to have the Spark 2 branch a clone of the current one with > > adaptations around Context->Session, Accumulator->AccumulatorV2 etc. > while > > still using the RDD API ? > > Yes this is exactly what I have in mind. > > > I think that having another Spark runner is great if it has value, > > otherwise, let's just bump the version. > > There is value because most people are already starting to move to > spark 2 and all Big Data distribution providers support it now, as > well as the Cloud-based distributions (Dataproc and EMR) not like the > last time we had this discussion. > > > We could think of starting to migrate the Spark 1 runner to Spark 2 and > > follow with Dataset API support feature-by-feature as ot advances, but I > > think most Spark installations today still run 1.X, or am I wrong ? > > No, you are right, that’s why I didn’t even mentioned removing the > spark 1 runner, I know that having to support things for both versions > can add additional work for us, but maybe the best approach would be > to continue the work only in the spark 2 runner (both refining the RDD > based translator and starting to create the Dataset one there that > co-exist until the DataSet API is mature enough) and keep the spark 1 > runner only for bug-fixes for the users who are still using it (like > this we don’t have to keep backporting stuff). Do you see any other > particular issue? > > Ismaël > > On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> wrote: > > So you propose to have the Spark 2 branch a clone of the current one with > > adaptations around Context->Session, Accumulator->AccumulatorV2 etc. > while > > still using the RDD API ? > > > > I think that having another Spark runner is great if it has value, > > otherwise, let's just bump the version. > > My idea of having another runner for Spark was not to support more > versions > > - we should always support the most popular version in terms of > > compatibility - the idea was to try and make Beam work with Structured > > Streaming, which is still not fully mature so that's why we're not > heavily > > investing there. > > > > We could think of starting to migrate the Spark 1 runner to Spark 2 and > > follow with Dataset API support feature-by-feature as ot advances, but I > > think most Spark installations today still run 1.X, or am I wrong ? > > > > On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com> wrote: > > > >> BIG +1 JB, > >> > >> If we can just jump the version number with minor changes staying as > >> close as possible to the current implementation for spark 1 we can go > >> faster and offer in principle the exact same support but for version > >> 2. > >> > >> I know that the advanced streaming stuff based on the DataSet API > >> won't be there but with this common canvas the community can iterate > >> to create a DataSet based translator at the same time. In particular I > >> consider the most important thing is that the spark 2 branch should > >> not live for long time, this should be merged into master really fast > >> for the benefit of everybody. > >> > >> Ismaël > >> > >> > >> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré <j...@nanthrax.net> > >> wrote: > >> > Hi Amit, > >> > > >> > What do you think of the following: > >> > > >> > - in the mean time that you reintroduce the Spark 2 branch, what about > >> > "extending" the version in the current Spark runner ? Still using > >> > RDD/DStream, I think we can support Spark 2.x even if we don't yet > >> leverage > >> > the new provided features. > >> > > >> > Thoughts ? > >> > > >> > Regards > >> > JB > >> > > >> > > >> > On 03/15/2017 07:39 PM, Amit Sela wrote: > >> >> > >> >> Hi Cody, > >> >> > >> >> I will re-introduce this branch soon as part of the work on BEAM-913 > >> >> <https://issues.apache.org/jira/browse/BEAM-913>. > >> >> For now, and from previous experience with the mentioned branch, > batch > >> >> implementation should be straight-forward. > >> >> Only issue is with streaming support - in the current runner (Spark > 1.x) > >> >> we > >> >> have experimental support for windows/triggers and we're working > towards > >> >> full streaming support. > >> >> With Spark 2.x, there is no "general-purpose" stateful operator for > the > >> >> Dataset API, so I was waiting to see if the new operator > >> >> <https://github.com/apache/spark/pull/17179> planned for next > version > >> >> could > >> >> help with that. > >> >> > >> >> To summarize, I will introduce a skeleton for the Spark 2 runner with > >> >> batch > >> >> support as soon as I can as a separate branch. > >> >> > >> >> Thanks, > >> >> Amit > >> >> > >> >> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere <e.neve...@gmail.com> > >> >> wrote: > >> >> > >> >>> Hi guys, > >> >>> Is there anybody who's currently working on Spark 2.x runner? A old > PR > >> >>> for > >> >>> spark 2.x runner was closed a few days ago, so I wonder what's the > >> status > >> >>> now, and is there a roadmap for this? > >> >>> Thanks~ > >> >>> > >> >> > >> > > >> > -- > >> > Jean-Baptiste Onofré > >> > jbono...@apache.org > >> > http://blog.nanthrax.net > >> > Talend - http://www.talend.com > >> >