Re: Beam spark 2.x runner status

Jean-Baptiste Onofré Thu, 16 Mar 2017 05:03:43 -0700

Hi guys,

Yes, I started to experiment the profiles a bit and Amit and I plan to discussabout that during the week end.

Give me some time to move forward a bit and I will get back to you with moredetails.


Regards
JB

On 03/16/2017 05:15 PM, amarouni wrote:

Yeah maintaining 2 RDD branches (master + 2.x branch) is doable but will
add more maintenance merge work.

The maven profiles solution is worth investigating, with Spark 1.6 RDD
as the default profile and an additional Spark 2.x profile.

As JBO mentioned carbondata I had a quick look and it looks like an good
solution :
https://github.com/apache/incubator-carbondata/blob/master/pom.xml#L347

What do you think ?

Abbass,

On 16/03/2017 07:00, Cody Innowhere wrote:

I'm personally in favor of maintaining one single branch, e.g.,
spark-runner, which supports both Spark 1.6 & 2.1.
Since there's currently no DataFrame support in spark 1.x runner, there
should be no conflicts if we put two versions of Spark into one runner.

I'm also +1 for adding adapters in the branch to support both Spark
versions.

Also, we can have two translators, say, 1.x translator which translates
into RDDs & DataStreams and 2.x translator which translates into DataSets.

On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi guys,

sorry, due to the time zone shift, I answer a bit late ;)

I think we can have the same runner dealing with the two major Spark
version, introducing some adapters. For instance, in CarbonData, we created
some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The
dependencies come from Maven profiles. Of course, it's easier there as it's
more "user" code.

My proposal is just it's worth to try ;)

I just created a branch to experiment a bit and have more details.

Regards
JB


On 03/16/2017 02:31 AM, Amit Sela wrote:

I answered inline to Abbass' comment, but I think he hit something - how
about we have a branch with those adaptations ? same RDD implementation,
but depending on the latest 2.x version with the minimal changes required.
I'd be happy to do that, or guide anyone who wants to (I did most of it on
my branch for Spark 2 anyway) but since it's a branch and not on master (I
don't believe it "deserves" a place on master), it would always be a bit
behind since we would have to rebase and merge once in a while.

How does that sound ?

On Wed, Mar 15, 2017 at 7:49 PM amarouni <amaro...@talend.com> wrote:

+1 for Spark runners based on different APIs RDD/Dataset and keeping the

Spark versions as a deployment dependency.

The RDD API is stable & mature enough so it makes sense to have it on
master, the Dataset API still have some work to do and from our own
experience it just reached a comparable RDD API performance. The
community is clearly heading in the Dataset API direction but the RDD
API is still a viable option for most use cases.

Just one quick question, today on master can we swap Spark 1.x by Spark
2.x  and compile and use the Spark Runner ?

Good question!

I think this is the root cause of this problem - Spark 2 not only
introduced a new API, but also broke a few such as: context is now
session,
Accumulators are AccumulatorV2, and this is what I recall right now.
I don't think it's to hard to adapt those, and anyone who wants to could
see how I did it on my branch:
https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3
5bae742d78a290cbbdc9

Thanks,

Abbass,


On 15/03/2017 17:57, Amit Sela wrote:

So you're suggesting we copy-paste the current runner and adapt whatever

is

necessary so it runs with Spark 2 ?
This also means any bug-fix / improvement would have to be maintained in
two runners, and I wouldn't wanna do that.

I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset

API.

Since the RDD API is mature, it should be the runner in master (not
preventing another runner once Dataset API is mature enough) and the
version (1.6.3 or 2.x) should be determined by the common installation.

That's why I believe we still need to leave things as they are, but
start
working on the Dataset API runner.
Otherwise, we'll have the current runner, another RDD API runner with

Spark

2, and a third one for the Dataset API. I don't want to maintain all of
them. It's a mess.

On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía <ieme...@gmail.com> wrote:

However, I do feel that we should use the Dataset API, starting with

batch

support first. WDYT ?

Well, this is the exact current status quo, and it will take us some
time to have something as complete as what we have with the spark 1
runner for the spark 2.

The other proposal has two advantages:

One is that we can leverage on the existing implementation (with the
needed adjustments) to run Beam pipelines on Spark 2, in the end final
users don’t care so much if pipelines are translated via RDD/DStream
or Dataset, they just want to know that with Beam they can run their
code in their favorite data processing framework.

The other advantage is that we can base the work on the latest spark
version and advance simultaneously in translators for both APIs, and
once we consider that the DataSet is mature enough we can stop
maintaining the RDD one and make it the official one.

The only missing piece is backporting new developments on the RDD
based translator from the spark 2 version into the spark 1, but maybe
this won’t be so hard if we consider what you said, that at this point
we are getting closer to have streaming right (of course you are the
most appropriate person to decide if we are in a sufficient good shape
to make this, so backporting things won’t be so hard).

Finally I agree with you, I would prefer a nice and full featured
translator based on the Structured Streaming API but the question is
how much time this will take to be in shape and the impact on final
users who are already requesting this. This is the reason why I think
the more conservative approach (keeping around the RDD translator) and
moving incrementally makes sense.

On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela <amitsel...@gmail.com>

wrote:
I feel that as we're getting closer to supporting streaming with Spark

runner, and having Structured Streaming advance in Spark 2, we could

start

work on Spark 2 runner in a separate branch.

However, I do feel that we should use the Dataset API, starting with

batch

support first. WDYT ?

On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía <ieme...@gmail.com>

wrote:

So you propose to have the Spark 2 branch a clone of the current one

with

adaptations around Context->Session, Accumulator->AccumulatorV2 etc.

while

still using the RDD API ?

Yes this is exactly what I have in mind.

I think that having another Spark runner is great if it has value,

otherwise, let's just bump the version.

There is value because most people are already starting to move to
spark 2 and all Big Data distribution providers support it now, as
well as the Cloud-based distributions (Dataproc and EMR) not like the
last time we had this discussion.

We could think of starting to migrate the Spark 1 runner to Spark 2
and

follow with Dataset API support feature-by-feature as ot advances,

but I

think most Spark installations today still run 1.X, or am I wrong ?

No, you are right, that’s why I didn’t even mentioned removing the
spark 1 runner, I know that having to support things for both
versions
can add additional work for us, but maybe the best approach would be
to continue the work only in the spark 2 runner (both refining the
RDD
based translator and starting to create the Dataset one there that
co-exist until the DataSet API is mature enough) and keep the spark 1
runner only for bug-fixes for the users who are still using it (like
this we don’t have to keep backporting stuff). Do you see any other
particular issue?

Ismaël

On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com>

wrote:
So you propose to have the Spark 2 branch a clone of the current one

with

adaptations around Context->Session, Accumulator->AccumulatorV2 etc.

while

still using the RDD API ?

I think that having another Spark runner is great if it has value,
otherwise, let's just bump the version.
My idea of having another runner for Spark was not to support more

versions

- we should always support the most popular version in terms of
compatibility - the idea was to try and make Beam work with

Structured

Streaming, which is still not fully mature so that's why we're not

heavily

investing there.

We could think of starting to migrate the Spark 1 runner to Spark 2

and

follow with Dataset API support feature-by-feature as ot advances,

but I

think most Spark installations today still run 1.X, or am I wrong ?

On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com>

wrote:

BIG +1 JB,

If we can just jump the version number with minor changes staying
as
close as possible to the current implementation for spark 1 we can

go

faster and offer in principle the exact same support but for version

2.

I know that the advanced streaming stuff based on the DataSet API
won't be there but with this common canvas the community can
iterate
to create a DataSet based translator at the same time. In
particular

consider the most important thing is that the spark 2 branch should

not live for long time, this should be merged into master really

fast

for the benefit of everybody.

Ismaël


On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré <

j...@nanthrax.net>

wrote:

Hi Amit,

What do you think of the following:

- in the mean time that you reintroduce the Spark 2 branch, what

about

"extending" the version in the current Spark runner ? Still using

RDD/DStream, I think we can support Spark 2.x even if we don't yet

leverage

the new provided features.

Thoughts ?

Regards
JB


On 03/15/2017 07:39 PM, Amit Sela wrote:

Hi Cody,

I will re-introduce this branch soon as part of the work on

BEAM-913

<https://issues.apache.org/jira/browse/BEAM-913>.

For now, and from previous experience with the mentioned branch,

batch

implementation should be straight-forward.

Only issue is with streaming support - in the current runner

(Spark

1.x)

we

have experimental support for windows/triggers and we're working

towards

full streaming support.

With Spark 2.x, there is no "general-purpose" stateful operator

for

the

Dataset API, so I was waiting to see if the new operator

<https://github.com/apache/spark/pull/17179> planned for next

version

could

help with that.

To summarize, I will introduce a skeleton for the Spark 2 runner

with

batch

support as soon as I can as a separate branch.

Thanks,
Amit

On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere <

e.neve...@gmail.com>

wrote:

Hi guys,

Is there anybody who's currently working on Spark 2.x runner? A

old

PR

for

spark 2.x runner was closed a few days ago, so I wonder what's

the

status

now, and is there a roadmap for this?

Thanks~

--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Beam spark 2.x runner status

Reply via email to