Re: Beam spark 2.x runner status

amarouni Wed, 15 Mar 2017 10:49:50 -0700

+1 for Spark runners based on different APIs RDD/Dataset and keeping the
Spark versions as a deployment dependency.


The RDD API is stable & mature enough so it makes sense to have it on
master, the Dataset API still have some work to do and from our own
experience it just reached a comparable RDD API performance. The
community is clearly heading in the Dataset API direction but the RDD
API is still a viable option for most use cases.

Just one quick question, today on master can we swap Spark 1.x by Spark
2.x  and compile and use the Spark Runner ?

Thanks,

Abbass,


On 15/03/2017 17:57, Amit Sela wrote:
> So you're suggesting we copy-paste the current runner and adapt whatever is
> necessary so it runs with Spark 2 ?
> This also means any bug-fix / improvement would have to be maintained in
> two runners, and I wouldn't wanna do that.
>
> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset API.
> Since the RDD API is mature, it should be the runner in master (not
> preventing another runner once Dataset API is mature enough) and the
> version (1.6.3 or 2.x) should be determined by the common installation.
>
> That's why I believe we still need to leave things as they are, but start
> working on the Dataset API runner.
> Otherwise, we'll have the current runner, another RDD API runner with Spark
> 2, and a third one for the Dataset API. I don't want to maintain all of
> them. It's a mess.
>
> On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>
>>> However, I do feel that we should use the Dataset API, starting with
>> batch
>>> support first. WDYT ?
>> Well, this is the exact current status quo, and it will take us some
>> time to have something as complete as what we have with the spark 1
>> runner for the spark 2.
>>
>> The other proposal has two advantages:
>>
>> One is that we can leverage on the existing implementation (with the
>> needed adjustments) to run Beam pipelines on Spark 2, in the end final
>> users don’t care so much if pipelines are translated via RDD/DStream
>> or Dataset, they just want to know that with Beam they can run their
>> code in their favorite data processing framework.
>>
>> The other advantage is that we can base the work on the latest spark
>> version and advance simultaneously in translators for both APIs, and
>> once we consider that the DataSet is mature enough we can stop
>> maintaining the RDD one and make it the official one.
>>
>> The only missing piece is backporting new developments on the RDD
>> based translator from the spark 2 version into the spark 1, but maybe
>> this won’t be so hard if we consider what you said, that at this point
>> we are getting closer to have streaming right (of course you are the
>> most appropriate person to decide if we are in a sufficient good shape
>> to make this, so backporting things won’t be so hard).
>>
>> Finally I agree with you, I would prefer a nice and full featured
>> translator based on the Structured Streaming API but the question is
>> how much time this will take to be in shape and the impact on final
>> users who are already requesting this. This is the reason why I think
>> the more conservative approach (keeping around the RDD translator) and
>> moving incrementally makes sense.
>>
>> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela <amitsel...@gmail.com> wrote:
>>> I feel that as we're getting closer to supporting streaming with Spark 1
>>> runner, and having Structured Streaming advance in Spark 2, we could
>> start
>>> work on Spark 2 runner in a separate branch.
>>>
>>> However, I do feel that we should use the Dataset API, starting with
>> batch
>>> support first. WDYT ?
>>>
>>> On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>
>>>>> So you propose to have the Spark 2 branch a clone of the current one
>> with
>>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>>>> while
>>>>> still using the RDD API ?
>>>> Yes this is exactly what I have in mind.
>>>>
>>>>> I think that having another Spark runner is great if it has value,
>>>>> otherwise, let's just bump the version.
>>>> There is value because most people are already starting to move to
>>>> spark 2 and all Big Data distribution providers support it now, as
>>>> well as the Cloud-based distributions (Dataproc and EMR) not like the
>>>> last time we had this discussion.
>>>>
>>>>> We could think of starting to migrate the Spark 1 runner to Spark 2
>> and
>>>>> follow with Dataset API support feature-by-feature as ot advances,
>> but I
>>>>> think most Spark installations today still run 1.X, or am I wrong ?
>>>> No, you are right, that’s why I didn’t even mentioned removing the
>>>> spark 1 runner, I know that having to support things for both versions
>>>> can add additional work for us, but maybe the best approach would be
>>>> to continue the work only in the spark 2 runner (both refining the RDD
>>>> based translator and starting to create the Dataset one there that
>>>> co-exist until the DataSet API is mature enough) and keep the spark 1
>>>> runner only for bug-fixes for the users who are still using it (like
>>>> this we don’t have to keep backporting stuff). Do you see any other
>>>> particular issue?
>>>>
>>>> Ismaël
>>>>
>>>> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com>
>> wrote:
>>>>> So you propose to have the Spark 2 branch a clone of the current one
>> with
>>>>> adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>>>> while
>>>>> still using the RDD API ?
>>>>>
>>>>> I think that having another Spark runner is great if it has value,
>>>>> otherwise, let's just bump the version.
>>>>> My idea of having another runner for Spark was not to support more
>>>> versions
>>>>> - we should always support the most popular version in terms of
>>>>> compatibility - the idea was to try and make Beam work with Structured
>>>>> Streaming, which is still not fully mature so that's why we're not
>>>> heavily
>>>>> investing there.
>>>>>
>>>>> We could think of starting to migrate the Spark 1 runner to Spark 2
>> and
>>>>> follow with Dataset API support feature-by-feature as ot advances,
>> but I
>>>>> think most Spark installations today still run 1.X, or am I wrong ?
>>>>>
>>>>> On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com>
>> wrote:
>>>>>> BIG +1 JB,
>>>>>>
>>>>>> If we can just jump the version number with minor changes staying as
>>>>>> close as possible to the current implementation for spark 1 we can go
>>>>>> faster and offer in principle the exact same support but for version
>>>>>> 2.
>>>>>>
>>>>>> I know that the advanced streaming stuff based on the DataSet API
>>>>>> won't be there but with this common canvas the community can iterate
>>>>>> to create a DataSet based translator at the same time. In particular
>> I
>>>>>> consider the most important thing is that the spark 2 branch should
>>>>>> not live for long time, this should be merged into master really fast
>>>>>> for the benefit of everybody.
>>>>>>
>>>>>> Ismaël
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré <
>> j...@nanthrax.net>
>>>>>> wrote:
>>>>>>> Hi Amit,
>>>>>>>
>>>>>>> What do you think of the following:
>>>>>>>
>>>>>>> - in the mean time that you reintroduce the Spark 2 branch, what
>> about
>>>>>>> "extending" the version in the current Spark runner ? Still using
>>>>>>> RDD/DStream, I think we can support Spark 2.x even if we don't yet
>>>>>> leverage
>>>>>>> the new provided features.
>>>>>>>
>>>>>>> Thoughts ?
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>>
>>>>>>> On 03/15/2017 07:39 PM, Amit Sela wrote:
>>>>>>>> Hi Cody,
>>>>>>>>
>>>>>>>> I will re-introduce this branch soon as part of the work on
>> BEAM-913
>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-913>.
>>>>>>>> For now, and from previous experience with the mentioned branch,
>>>> batch
>>>>>>>> implementation should be straight-forward.
>>>>>>>> Only issue is with streaming support - in the current runner
>> (Spark
>>>> 1.x)
>>>>>>>> we
>>>>>>>> have experimental support for windows/triggers and we're working
>>>> towards
>>>>>>>> full streaming support.
>>>>>>>> With Spark 2.x, there is no "general-purpose" stateful operator
>> for
>>>> the
>>>>>>>> Dataset API, so I was waiting to see if the new operator
>>>>>>>> <https://github.com/apache/spark/pull/17179> planned for next
>>>> version
>>>>>>>> could
>>>>>>>> help with that.
>>>>>>>>
>>>>>>>> To summarize, I will introduce a skeleton for the Spark 2 runner
>> with
>>>>>>>> batch
>>>>>>>> support as soon as I can as a separate branch.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Amit
>>>>>>>>
>>>>>>>> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere <
>> e.neve...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>> Is there anybody who's currently working on Spark 2.x runner? A
>> old
>>>> PR
>>>>>>>>> for
>>>>>>>>> spark 2.x runner was closed a few days ago, so I wonder what's
>> the
>>>>>> status
>>>>>>>>> now, and is there a roadmap for this?
>>>>>>>>> Thanks~
>>>>>>>>>
>>>>>>> --
>>>>>>> Jean-Baptiste Onofré
>>>>>>> jbono...@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com

Re: Beam spark 2.x runner status

Reply via email to