Re: Beam spark 2.x runner status

amarouni Thu, 16 Mar 2017 04:51:27 -0700

Yeah maintaining 2 RDD branches (master + 2.x branch) is doable but will
add more maintenance merge work.


The maven profiles solution is worth investigating, with Spark 1.6 RDD
as the default profile and an additional Spark 2.x profile.

As JBO mentioned carbondata I had a quick look and it looks like an good
solution :
https://github.com/apache/incubator-carbondata/blob/master/pom.xml#L347

What do you think ?

Abbass,

On 16/03/2017 07:00, Cody Innowhere wrote:
> I'm personally in favor of maintaining one single branch, e.g.,
> spark-runner, which supports both Spark 1.6 & 2.1.
> Since there's currently no DataFrame support in spark 1.x runner, there
> should be no conflicts if we put two versions of Spark into one runner.
>
> I'm also +1 for adding adapters in the branch to support both Spark
> versions.
>
> Also, we can have two translators, say, 1.x translator which translates
> into RDDs & DataStreams and 2.x translator which translates into DataSets.
>
> On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi guys,
>>
>> sorry, due to the time zone shift, I answer a bit late ;)
>>
>> I think we can have the same runner dealing with the two major Spark
>> version, introducing some adapters. For instance, in CarbonData, we created
>> some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The
>> dependencies come from Maven profiles. Of course, it's easier there as it's
>> more "user" code.
>>
>> My proposal is just it's worth to try ;)
>>
>> I just created a branch to experiment a bit and have more details.
>>
>> Regards
>> JB
>>
>>
>> On 03/16/2017 02:31 AM, Amit Sela wrote:
>>
>>> I answered inline to Abbass' comment, but I think he hit something - how
>>> about we have a branch with those adaptations ? same RDD implementation,
>>> but depending on the latest 2.x version with the minimal changes required.
>>> I'd be happy to do that, or guide anyone who wants to (I did most of it on
>>> my branch for Spark 2 anyway) but since it's a branch and not on master (I
>>> don't believe it "deserves" a place on master), it would always be a bit
>>> behind since we would have to rebase and merge once in a while.
>>>
>>> How does that sound ?
>>>
>>> On Wed, Mar 15, 2017 at 7:49 PM amarouni <amaro...@talend.com> wrote:
>>>
>>> +1 for Spark runners based on different APIs RDD/Dataset and keeping the
>>>> Spark versions as a deployment dependency.
>>>>
>>>> The RDD API is stable & mature enough so it makes sense to have it on
>>>> master, the Dataset API still have some work to do and from our own
>>>> experience it just reached a comparable RDD API performance. The
>>>> community is clearly heading in the Dataset API direction but the RDD
>>>> API is still a viable option for most use cases.
>>>>
>>>> Just one quick question, today on master can we swap Spark 1.x by Spark
>>>> 2.x  and compile and use the Spark Runner ?
>>>>
>>>> Good question!
>>> I think this is the root cause of this problem - Spark 2 not only
>>> introduced a new API, but also broke a few such as: context is now
>>> session,
>>> Accumulators are AccumulatorV2, and this is what I recall right now.
>>> I don't think it's to hard to adapt those, and anyone who wants to could
>>> see how I did it on my branch:
>>> https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3
>>> 5bae742d78a290cbbdc9
>>>
>>>
>>>
>>>> Thanks,
>>>>
>>>> Abbass,
>>>>
>>>>
>>>> On 15/03/2017 17:57, Amit Sela wrote:
>>>>
>>>>> So you're suggesting we copy-paste the current runner and adapt whatever
>>>>>
>>>> is
>>>>
>>>>> necessary so it runs with Spark 2 ?
>>>>> This also means any bug-fix / improvement would have to be maintained in
>>>>> two runners, and I wouldn't wanna do that.
>>>>>
>>>>> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset
>>>>>
>>>> API.
>>>>
>>>>> Since the RDD API is mature, it should be the runner in master (not
>>>>> preventing another runner once Dataset API is mature enough) and the
>>>>> version (1.6.3 or 2.x) should be determined by the common installation.
>>>>>
>>>>> That's why I believe we still need to leave things as they are, but
>>>>> start
>>>>> working on the Dataset API runner.
>>>>> Otherwise, we'll have the current runner, another RDD API runner with
>>>>>
>>>> Spark
>>>>
>>>>> 2, and a third one for the Dataset API. I don't want to maintain all of
>>>>> them. It's a mess.
>>>>>
>>>>> On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>>>
>>>>> However, I do feel that we should use the Dataset API, starting with
>>>>>> batch
>>>>>>
>>>>>>> support first. WDYT ?
>>>>>>>
>>>>>> Well, this is the exact current status quo, and it will take us some
>>>>>> time to have something as complete as what we have with the spark 1
>>>>>> runner for the spark 2.
>>>>>>
>>>>>> The other proposal has two advantages:
>>>>>>
>>>>>> One is that we can leverage on the existing implementation (with the
>>>>>> needed adjustments) to run Beam pipelines on Spark 2, in the end final
>>>>>> users don’t care so much if pipelines are translated via RDD/DStream
>>>>>> or Dataset, they just want to know that with Beam they can run their
>>>>>> code in their favorite data processing framework.
>>>>>>
>>>>>> The other advantage is that we can base the work on the latest spark
>>>>>> version and advance simultaneously in translators for both APIs, and
>>>>>> once we consider that the DataSet is mature enough we can stop
>>>>>> maintaining the RDD one and make it the official one.
>>>>>>
>>>>>> The only missing piece is backporting new developments on the RDD
>>>>>> based translator from the spark 2 version into the spark 1, but maybe
>>>>>> this won’t be so hard if we consider what you said, that at this point
>>>>>> we are getting closer to have streaming right (of course you are the
>>>>>> most appropriate person to decide if we are in a sufficient good shape
>>>>>> to make this, so backporting things won’t be so hard).
>>>>>>
>>>>>> Finally I agree with you, I would prefer a nice and full featured
>>>>>> translator based on the Structured Streaming API but the question is
>>>>>> how much time this will take to be in shape and the impact on final
>>>>>> users who are already requesting this. This is the reason why I think
>>>>>> the more conservative approach (keeping around the RDD translator) and
>>>>>> moving incrementally makes sense.
>>>>>>
>>>>>> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela <amitsel...@gmail.com>
>>>>>>
>>>>> wrote:
>>>>> I feel that as we're getting closer to supporting streaming with Spark
>>>>>> 1
>>>>> runner, and having Structured Streaming advance in Spark 2, we could
>>>>>> start
>>>>>>
>>>>>>> work on Spark 2 runner in a separate branch.
>>>>>>>
>>>>>>> However, I do feel that we should use the Dataset API, starting with
>>>>>>>
>>>>>> batch
>>>>>>
>>>>>>> support first. WDYT ?
>>>>>>>
>>>>>>> On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía <ieme...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>>> So you propose to have the Spark 2 branch a clone of the current one
>>>>>>>> with
>>>>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>>>>>>>> while
>>>>>>>>
>>>>>>>>> still using the RDD API ?
>>>>>>>>>
>>>>>>>> Yes this is exactly what I have in mind.
>>>>>>>>
>>>>>>>> I think that having another Spark runner is great if it has value,
>>>>>>>>> otherwise, let's just bump the version.
>>>>>>>>>
>>>>>>>> There is value because most people are already starting to move to
>>>>>>>> spark 2 and all Big Data distribution providers support it now, as
>>>>>>>> well as the Cloud-based distributions (Dataproc and EMR) not like the
>>>>>>>> last time we had this discussion.
>>>>>>>>
>>>>>>>> We could think of starting to migrate the Spark 1 runner to Spark 2
>>>>>>>> and
>>>>>>> follow with Dataset API support feature-by-feature as ot advances,
>>>>>>>> but I
>>>>>>> think most Spark installations today still run 1.X, or am I wrong ?
>>>>>>>> No, you are right, that’s why I didn’t even mentioned removing the
>>>>>>>> spark 1 runner, I know that having to support things for both
>>>>>>>> versions
>>>>>>>> can add additional work for us, but maybe the best approach would be
>>>>>>>> to continue the work only in the spark 2 runner (both refining the
>>>>>>>> RDD
>>>>>>>> based translator and starting to create the Dataset one there that
>>>>>>>> co-exist until the DataSet API is mature enough) and keep the spark 1
>>>>>>>> runner only for bug-fixes for the users who are still using it (like
>>>>>>>> this we don’t have to keep backporting stuff). Do you see any other
>>>>>>>> particular issue?
>>>>>>>>
>>>>>>>> Ismaël
>>>>>>>>
>>>>>>>> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>>> So you propose to have the Spark 2 branch a clone of the current one
>>>>>>>> with
>>>>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>>>>>>>> while
>>>>>>>>
>>>>>>>>> still using the RDD API ?
>>>>>>>>>
>>>>>>>>> I think that having another Spark runner is great if it has value,
>>>>>>>>> otherwise, let's just bump the version.
>>>>>>>>> My idea of having another runner for Spark was not to support more
>>>>>>>>>
>>>>>>>> versions
>>>>>>>>
>>>>>>>>> - we should always support the most popular version in terms of
>>>>>>>>> compatibility - the idea was to try and make Beam work with
>>>>>>>>>
>>>>>>>> Structured
>>>>> Streaming, which is still not fully mature so that's why we're not
>>>>>>>> heavily
>>>>>>>>
>>>>>>>>> investing there.
>>>>>>>>>
>>>>>>>>> We could think of starting to migrate the Spark 1 runner to Spark 2
>>>>>>>>>
>>>>>>>> and
>>>>>>> follow with Dataset API support feature-by-feature as ot advances,
>>>>>>>> but I
>>>>>>> think most Spark installations today still run 1.X, or am I wrong ?
>>>>>>>>> On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>> BIG +1 JB,
>>>>>>>>>> If we can just jump the version number with minor changes staying
>>>>>>>>>> as
>>>>>>>>>> close as possible to the current implementation for spark 1 we can
>>>>>>>>>>
>>>>>>>>> go
>>>>> faster and offer in principle the exact same support but for version
>>>>>>>>>> 2.
>>>>>>>>>>
>>>>>>>>>> I know that the advanced streaming stuff based on the DataSet API
>>>>>>>>>> won't be there but with this common canvas the community can
>>>>>>>>>> iterate
>>>>>>>>>> to create a DataSet based translator at the same time. In
>>>>>>>>>> particular
>>>>>>>>>>
>>>>>>>>> I
>>>>>>> consider the most important thing is that the spark 2 branch should
>>>>>>>>>> not live for long time, this should be merged into master really
>>>>>>>>>>
>>>>>>>>> fast
>>>>> for the benefit of everybody.
>>>>>>>>>> Ismaël
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré <
>>>>>>>>>>
>>>>>>>>> j...@nanthrax.net>
>>>>>>> wrote:
>>>>>>>>>>> Hi Amit,
>>>>>>>>>>>
>>>>>>>>>>> What do you think of the following:
>>>>>>>>>>>
>>>>>>>>>>> - in the mean time that you reintroduce the Spark 2 branch, what
>>>>>>>>>>>
>>>>>>>>>> about
>>>>>>> "extending" the version in the current Spark runner ? Still using
>>>>>>>>>>> RDD/DStream, I think we can support Spark 2.x even if we don't yet
>>>>>>>>>>>
>>>>>>>>>> leverage
>>>>>>>>>>
>>>>>>>>>>> the new provided features.
>>>>>>>>>>>
>>>>>>>>>>> Thoughts ?
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> JB
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 03/15/2017 07:39 PM, Amit Sela wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Cody,
>>>>>>>>>>>>
>>>>>>>>>>>> I will re-introduce this branch soon as part of the work on
>>>>>>>>>>>>
>>>>>>>>>>> BEAM-913
>>>>>>> <https://issues.apache.org/jira/browse/BEAM-913>.
>>>>>>>>>>>> For now, and from previous experience with the mentioned branch,
>>>>>>>>>>>>
>>>>>>>>>>> batch
>>>>>>>>> implementation should be straight-forward.
>>>>>>>>>>>> Only issue is with streaming support - in the current runner
>>>>>>>>>>>>
>>>>>>>>>>> (Spark
>>>>>>> 1.x)
>>>>>>>>> we
>>>>>>>>>>>> have experimental support for windows/triggers and we're working
>>>>>>>>>>>>
>>>>>>>>>>> towards
>>>>>>>>> full streaming support.
>>>>>>>>>>>> With Spark 2.x, there is no "general-purpose" stateful operator
>>>>>>>>>>>>
>>>>>>>>>>> for
>>>>>>> the
>>>>>>>>> Dataset API, so I was waiting to see if the new operator
>>>>>>>>>>>> <https://github.com/apache/spark/pull/17179> planned for next
>>>>>>>>>>>>
>>>>>>>>>>> version
>>>>>>>>> could
>>>>>>>>>>>> help with that.
>>>>>>>>>>>>
>>>>>>>>>>>> To summarize, I will introduce a skeleton for the Spark 2 runner
>>>>>>>>>>>>
>>>>>>>>>>> with
>>>>>>> batch
>>>>>>>>>>>> support as soon as I can as a separate branch.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Amit
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere <
>>>>>>>>>>>>
>>>>>>>>>>> e.neve...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>> Is there anybody who's currently working on Spark 2.x runner? A
>>>>>>>>>>>>>
>>>>>>>>>>>> old
>>>>>>> PR
>>>>>>>>> for
>>>>>>>>>>>>> spark 2.x runner was closed a few days ago, so I wonder what's
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>>>> status
>>>>>>>>>>> now, and is there a roadmap for this?
>>>>>>>>>>>>> Thanks~
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>>>> jbono...@apache.org
>>>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>>>
>>>>
>>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: Beam spark 2.x runner status

Reply via email to