Re: Beam spark 2.x runner status

Cody Innowhere Wed, 15 Mar 2017 23:02:09 -0700

I'm personally in favor of maintaining one single branch, e.g.,
spark-runner, which supports both Spark 1.6 & 2.1.
Since there's currently no DataFrame support in spark 1.x runner, there
should be no conflicts if we put two versions of Spark into one runner.


I'm also +1 for adding adapters in the branch to support both Spark
versions.

Also, we can have two translators, say, 1.x translator which translates
into RDDs & DataStreams and 2.x translator which translates into DataSets.

On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi guys,
>
> sorry, due to the time zone shift, I answer a bit late ;)
>
> I think we can have the same runner dealing with the two major Spark
> version, introducing some adapters. For instance, in CarbonData, we created
> some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The
> dependencies come from Maven profiles. Of course, it's easier there as it's
> more "user" code.
>
> My proposal is just it's worth to try ;)
>
> I just created a branch to experiment a bit and have more details.
>
> Regards
> JB
>
>
> On 03/16/2017 02:31 AM, Amit Sela wrote:
>
>> I answered inline to Abbass' comment, but I think he hit something - how
>> about we have a branch with those adaptations ? same RDD implementation,
>> but depending on the latest 2.x version with the minimal changes required.
>> I'd be happy to do that, or guide anyone who wants to (I did most of it on
>> my branch for Spark 2 anyway) but since it's a branch and not on master (I
>> don't believe it "deserves" a place on master), it would always be a bit
>> behind since we would have to rebase and merge once in a while.
>>
>> How does that sound ?
>>
>> On Wed, Mar 15, 2017 at 7:49 PM amarouni <amaro...@talend.com> wrote:
>>
>> +1 for Spark runners based on different APIs RDD/Dataset and keeping the
>>> Spark versions as a deployment dependency.
>>>
>>> The RDD API is stable & mature enough so it makes sense to have it on
>>> master, the Dataset API still have some work to do and from our own
>>> experience it just reached a comparable RDD API performance. The
>>> community is clearly heading in the Dataset API direction but the RDD
>>> API is still a viable option for most use cases.
>>>
>>> Just one quick question, today on master can we swap Spark 1.x by Spark
>>> 2.x  and compile and use the Spark Runner ?
>>>
>>> Good question!
>> I think this is the root cause of this problem - Spark 2 not only
>> introduced a new API, but also broke a few such as: context is now
>> session,
>> Accumulators are AccumulatorV2, and this is what I recall right now.
>> I don't think it's to hard to adapt those, and anyone who wants to could
>> see how I did it on my branch:
>> https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3
>> 5bae742d78a290cbbdc9
>>
>>
>>
>>> Thanks,
>>>
>>> Abbass,
>>>
>>>
>>> On 15/03/2017 17:57, Amit Sela wrote:
>>>
>>>> So you're suggesting we copy-paste the current runner and adapt whatever
>>>>
>>> is
>>>
>>>> necessary so it runs with Spark 2 ?
>>>> This also means any bug-fix / improvement would have to be maintained in
>>>> two runners, and I wouldn't wanna do that.
>>>>
>>>> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset
>>>>
>>> API.
>>>
>>>> Since the RDD API is mature, it should be the runner in master (not
>>>> preventing another runner once Dataset API is mature enough) and the
>>>> version (1.6.3 or 2.x) should be determined by the common installation.
>>>>
>>>> That's why I believe we still need to leave things as they are, but
>>>> start
>>>> working on the Dataset API runner.
>>>> Otherwise, we'll have the current runner, another RDD API runner with
>>>>
>>> Spark
>>>
>>>> 2, and a third one for the Dataset API. I don't want to maintain all of
>>>> them. It's a mess.
>>>>
>>>> On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>>
>>>> However, I do feel that we should use the Dataset API, starting with
>>>>>>
>>>>> batch
>>>>>
>>>>>> support first. WDYT ?
>>>>>>
>>>>> Well, this is the exact current status quo, and it will take us some
>>>>> time to have something as complete as what we have with the spark 1
>>>>> runner for the spark 2.
>>>>>
>>>>> The other proposal has two advantages:
>>>>>
>>>>> One is that we can leverage on the existing implementation (with the
>>>>> needed adjustments) to run Beam pipelines on Spark 2, in the end final
>>>>> users don’t care so much if pipelines are translated via RDD/DStream
>>>>> or Dataset, they just want to know that with Beam they can run their
>>>>> code in their favorite data processing framework.
>>>>>
>>>>> The other advantage is that we can base the work on the latest spark
>>>>> version and advance simultaneously in translators for both APIs, and
>>>>> once we consider that the DataSet is mature enough we can stop
>>>>> maintaining the RDD one and make it the official one.
>>>>>
>>>>> The only missing piece is backporting new developments on the RDD
>>>>> based translator from the spark 2 version into the spark 1, but maybe
>>>>> this won’t be so hard if we consider what you said, that at this point
>>>>> we are getting closer to have streaming right (of course you are the
>>>>> most appropriate person to decide if we are in a sufficient good shape
>>>>> to make this, so backporting things won’t be so hard).
>>>>>
>>>>> Finally I agree with you, I would prefer a nice and full featured
>>>>> translator based on the Structured Streaming API but the question is
>>>>> how much time this will take to be in shape and the impact on final
>>>>> users who are already requesting this. This is the reason why I think
>>>>> the more conservative approach (keeping around the RDD translator) and
>>>>> moving incrementally makes sense.
>>>>>
>>>>> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela <amitsel...@gmail.com>
>>>>>
>>>> wrote:
>>>
>>>> I feel that as we're getting closer to supporting streaming with Spark
>>>>>>
>>>>> 1
>>>
>>>> runner, and having Structured Streaming advance in Spark 2, we could
>>>>>>
>>>>> start
>>>>>
>>>>>> work on Spark 2 runner in a separate branch.
>>>>>>
>>>>>> However, I do feel that we should use the Dataset API, starting with
>>>>>>
>>>>> batch
>>>>>
>>>>>> support first. WDYT ?
>>>>>>
>>>>>> On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía <ieme...@gmail.com>
>>>>>>
>>>>> wrote:
>>>
>>>>
>>>>>> So you propose to have the Spark 2 branch a clone of the current one
>>>>>>>>
>>>>>>> with
>>>>>
>>>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>>>>>>>>
>>>>>>> while
>>>>>>>
>>>>>>>> still using the RDD API ?
>>>>>>>>
>>>>>>> Yes this is exactly what I have in mind.
>>>>>>>
>>>>>>> I think that having another Spark runner is great if it has value,
>>>>>>>> otherwise, let's just bump the version.
>>>>>>>>
>>>>>>> There is value because most people are already starting to move to
>>>>>>> spark 2 and all Big Data distribution providers support it now, as
>>>>>>> well as the Cloud-based distributions (Dataproc and EMR) not like the
>>>>>>> last time we had this discussion.
>>>>>>>
>>>>>>> We could think of starting to migrate the Spark 1 runner to Spark 2
>>>>>>>>
>>>>>>> and
>>>>>
>>>>>> follow with Dataset API support feature-by-feature as ot advances,
>>>>>>>>
>>>>>>> but I
>>>>>
>>>>>> think most Spark installations today still run 1.X, or am I wrong ?
>>>>>>>>
>>>>>>> No, you are right, that’s why I didn’t even mentioned removing the
>>>>>>> spark 1 runner, I know that having to support things for both
>>>>>>> versions
>>>>>>> can add additional work for us, but maybe the best approach would be
>>>>>>> to continue the work only in the spark 2 runner (both refining the
>>>>>>> RDD
>>>>>>> based translator and starting to create the Dataset one there that
>>>>>>> co-exist until the DataSet API is mature enough) and keep the spark 1
>>>>>>> runner only for bug-fixes for the users who are still using it (like
>>>>>>> this we don’t have to keep backporting stuff). Do you see any other
>>>>>>> particular issue?
>>>>>>>
>>>>>>> Ismaël
>>>>>>>
>>>>>>> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>> So you propose to have the Spark 2 branch a clone of the current one
>>>>>>>>
>>>>>>> with
>>>>>
>>>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>>>>>>>>
>>>>>>> while
>>>>>>>
>>>>>>>> still using the RDD API ?
>>>>>>>>
>>>>>>>> I think that having another Spark runner is great if it has value,
>>>>>>>> otherwise, let's just bump the version.
>>>>>>>> My idea of having another runner for Spark was not to support more
>>>>>>>>
>>>>>>> versions
>>>>>>>
>>>>>>>> - we should always support the most popular version in terms of
>>>>>>>> compatibility - the idea was to try and make Beam work with
>>>>>>>>
>>>>>>> Structured
>>>
>>>> Streaming, which is still not fully mature so that's why we're not
>>>>>>>>
>>>>>>> heavily
>>>>>>>
>>>>>>>> investing there.
>>>>>>>>
>>>>>>>> We could think of starting to migrate the Spark 1 runner to Spark 2
>>>>>>>>
>>>>>>> and
>>>>>
>>>>>> follow with Dataset API support feature-by-feature as ot advances,
>>>>>>>>
>>>>>>> but I
>>>>>
>>>>>> think most Spark installations today still run 1.X, or am I wrong ?
>>>>>>>>
>>>>>>>> On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>
>>>>>> BIG +1 JB,
>>>>>>>>>
>>>>>>>>> If we can just jump the version number with minor changes staying
>>>>>>>>> as
>>>>>>>>> close as possible to the current implementation for spark 1 we can
>>>>>>>>>
>>>>>>>> go
>>>
>>>> faster and offer in principle the exact same support but for version
>>>>>>>>> 2.
>>>>>>>>>
>>>>>>>>> I know that the advanced streaming stuff based on the DataSet API
>>>>>>>>> won't be there but with this common canvas the community can
>>>>>>>>> iterate
>>>>>>>>> to create a DataSet based translator at the same time. In
>>>>>>>>> particular
>>>>>>>>>
>>>>>>>> I
>>>>>
>>>>>> consider the most important thing is that the spark 2 branch should
>>>>>>>>> not live for long time, this should be merged into master really
>>>>>>>>>
>>>>>>>> fast
>>>
>>>> for the benefit of everybody.
>>>>>>>>>
>>>>>>>>> Ismaël
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré <
>>>>>>>>>
>>>>>>>> j...@nanthrax.net>
>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Amit,
>>>>>>>>>>
>>>>>>>>>> What do you think of the following:
>>>>>>>>>>
>>>>>>>>>> - in the mean time that you reintroduce the Spark 2 branch, what
>>>>>>>>>>
>>>>>>>>> about
>>>>>
>>>>>> "extending" the version in the current Spark runner ? Still using
>>>>>>>>>> RDD/DStream, I think we can support Spark 2.x even if we don't yet
>>>>>>>>>>
>>>>>>>>> leverage
>>>>>>>>>
>>>>>>>>>> the new provided features.
>>>>>>>>>>
>>>>>>>>>> Thoughts ?
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 03/15/2017 07:39 PM, Amit Sela wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Cody,
>>>>>>>>>>>
>>>>>>>>>>> I will re-introduce this branch soon as part of the work on
>>>>>>>>>>>
>>>>>>>>>> BEAM-913
>>>>>
>>>>>> <https://issues.apache.org/jira/browse/BEAM-913>.
>>>>>>>>>>> For now, and from previous experience with the mentioned branch,
>>>>>>>>>>>
>>>>>>>>>> batch
>>>>>>>
>>>>>>>> implementation should be straight-forward.
>>>>>>>>>>> Only issue is with streaming support - in the current runner
>>>>>>>>>>>
>>>>>>>>>> (Spark
>>>>>
>>>>>> 1.x)
>>>>>>>
>>>>>>>> we
>>>>>>>>>>> have experimental support for windows/triggers and we're working
>>>>>>>>>>>
>>>>>>>>>> towards
>>>>>>>
>>>>>>>> full streaming support.
>>>>>>>>>>> With Spark 2.x, there is no "general-purpose" stateful operator
>>>>>>>>>>>
>>>>>>>>>> for
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> Dataset API, so I was waiting to see if the new operator
>>>>>>>>>>> <https://github.com/apache/spark/pull/17179> planned for next
>>>>>>>>>>>
>>>>>>>>>> version
>>>>>>>
>>>>>>>> could
>>>>>>>>>>> help with that.
>>>>>>>>>>>
>>>>>>>>>>> To summarize, I will introduce a skeleton for the Spark 2 runner
>>>>>>>>>>>
>>>>>>>>>> with
>>>>>
>>>>>> batch
>>>>>>>>>>> support as soon as I can as a separate branch.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Amit
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere <
>>>>>>>>>>>
>>>>>>>>>> e.neve...@gmail.com>
>>>>>
>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>> Is there anybody who's currently working on Spark 2.x runner? A
>>>>>>>>>>>>
>>>>>>>>>>> old
>>>>>
>>>>>> PR
>>>>>>>
>>>>>>>> for
>>>>>>>>>>>> spark 2.x runner was closed a few days ago, so I wonder what's
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>
>>>>>> status
>>>>>>>>>
>>>>>>>>>> now, and is there a roadmap for this?
>>>>>>>>>>>> Thanks~
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>>> jbono...@apache.org
>>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>>
>>>>>>>>>
>>>
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Beam spark 2.x runner status

Reply via email to