Re: Beam spark 2.x runner status

Amit Sela Wed, 15 Mar 2017 08:52:48 -0700

I feel that as we're getting closer to supporting streaming with Spark 1
runner, and having Structured Streaming advance in Spark 2, we could start
work on Spark 2 runner in a separate branch.


However, I do feel that we should use the Dataset API, starting with batch
support first. WDYT ?

On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía <ieme...@gmail.com> wrote:

> > So you propose to have the Spark 2 branch a clone of the current one with
> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
> while
> > still using the RDD API ?
>
> Yes this is exactly what I have in mind.
>
> > I think that having another Spark runner is great if it has value,
> > otherwise, let's just bump the version.
>
> There is value because most people are already starting to move to
> spark 2 and all Big Data distribution providers support it now, as
> well as the Cloud-based distributions (Dataproc and EMR) not like the
> last time we had this discussion.
>
> > We could think of starting to migrate the Spark 1 runner to Spark 2 and
> > follow with Dataset API support feature-by-feature as ot advances, but I
> > think most Spark installations today still run 1.X, or am I wrong ?
>
> No, you are right, that’s why I didn’t even mentioned removing the
> spark 1 runner, I know that having to support things for both versions
> can add additional work for us, but maybe the best approach would be
> to continue the work only in the spark 2 runner (both refining the RDD
> based translator and starting to create the Dataset one there that
> co-exist until the DataSet API is mature enough) and keep the spark 1
> runner only for bug-fixes for the users who are still using it (like
> this we don’t have to keep backporting stuff). Do you see any other
> particular issue?
>
> Ismaël
>
> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> wrote:
> > So you propose to have the Spark 2 branch a clone of the current one with
> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
> while
> > still using the RDD API ?
> >
> > I think that having another Spark runner is great if it has value,
> > otherwise, let's just bump the version.
> > My idea of having another runner for Spark was not to support more
> versions
> > - we should always support the most popular version in terms of
> > compatibility - the idea was to try and make Beam work with Structured
> > Streaming, which is still not fully mature so that's why we're not
> heavily
> > investing there.
> >
> > We could think of starting to migrate the Spark 1 runner to Spark 2 and
> > follow with Dataset API support feature-by-feature as ot advances, but I
> > think most Spark installations today still run 1.X, or am I wrong ?
> >
> > On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com> wrote:
> >
> >> BIG +1 JB,
> >>
> >> If we can just jump the version number with minor changes staying as
> >> close as possible to the current implementation for spark 1 we can go
> >> faster and offer in principle the exact same support but for version
> >> 2.
> >>
> >> I know that the advanced streaming stuff based on the DataSet API
> >> won't be there but with this common canvas the community can iterate
> >> to create a DataSet based translator at the same time. In particular I
> >> consider the most important thing is that the spark 2 branch should
> >> not live for long time, this should be merged into master really fast
> >> for the benefit of everybody.
> >>
> >> Ismaël
> >>
> >>
> >> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> >> wrote:
> >> > Hi Amit,
> >> >
> >> > What do you think of the following:
> >> >
> >> > - in the mean time that you reintroduce the Spark 2 branch, what about
> >> > "extending" the version in the current Spark runner ? Still using
> >> > RDD/DStream, I think we can support Spark 2.x even if we don't yet
> >> leverage
> >> > the new provided features.
> >> >
> >> > Thoughts ?
> >> >
> >> > Regards
> >> > JB
> >> >
> >> >
> >> > On 03/15/2017 07:39 PM, Amit Sela wrote:
> >> >>
> >> >> Hi Cody,
> >> >>
> >> >> I will re-introduce this branch soon as part of the work on BEAM-913
> >> >> <https://issues.apache.org/jira/browse/BEAM-913>.
> >> >> For now, and from previous experience with the mentioned branch,
> batch
> >> >> implementation should be straight-forward.
> >> >> Only issue is with streaming support - in the current runner (Spark
> 1.x)
> >> >> we
> >> >> have experimental support for windows/triggers and we're working
> towards
> >> >> full streaming support.
> >> >> With Spark 2.x, there is no "general-purpose" stateful operator for
> the
> >> >> Dataset API, so I was waiting to see if the new operator
> >> >> <https://github.com/apache/spark/pull/17179> planned for next
> version
> >> >> could
> >> >> help with that.
> >> >>
> >> >> To summarize, I will introduce a skeleton for the Spark 2 runner with
> >> >> batch
> >> >> support as soon as I can as a separate branch.
> >> >>
> >> >> Thanks,
> >> >> Amit
> >> >>
> >> >> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere <e.neve...@gmail.com>
> >> >> wrote:
> >> >>
> >> >>> Hi guys,
> >> >>> Is there anybody who's currently working on Spark 2.x runner? A old
> PR
> >> >>> for
> >> >>> spark 2.x runner was closed a few days ago, so I wonder what's the
> >> status
> >> >>> now, and is there a roadmap for this?
> >> >>> Thanks~
> >> >>>
> >> >>
> >> >
> >> > --
> >> > Jean-Baptiste Onofré
> >> > jbono...@apache.org
> >> > http://blog.nanthrax.net
> >> > Talend - http://www.talend.com
> >>
>

Re: Beam spark 2.x runner status

Reply via email to