Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

Holden Karau Wed, 08 Nov 2017 21:25:42 -0800

Also, upgrading Spark 1 to 2 is generally easier than changing JVM
versions. For folks using YARN or the hosted environments it pretty much
trivial since you can effectively have distinct Spark clusters for each job.


On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> I'm +1 on dropping Spark 1. There are a lot of exciting improvements in
> Spark 2, and trying to write efficient code that runs between Spark 1 and
> Spark 2 is super painful in the long term. It would be one thing if there
> were a lot of people available to work on the Spark runners, but it seems
> like we'd be better spent focusing our energy on the future.
>
> I don't know a lot of folks who are stuck on Spark 1, and the few that I
> know are planning to migrate in the next few months anyways.
>
> Note: this is a non-binding vote as I'm not a committer or PMC member.
>
> On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Having both Spark1 and Spark2 modules would benefit wider user base.
>>
>> I would vote for that.
>>
>> Cheers
>>
>> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>> > Hi Robert,
>> >
>> > Thanks for your feedback !
>> >
>> > From an user perspective, with the current state of the PR, the same
>> > pipelines can run on both Spark 1.x and 2.x: the only difference is the
>> > dependencies set.
>> >
>> > I'm calling the vote to get suck kind of feedback: if we consider Spark
>> > 1.x still need to be supported, no problem, I will improve the PR to
>> have
>> > three modules (common, spark1, spark2) and let users pick the desired
>> > version.
>> >
>> > Let's wait a bit other feedbacks, I will update the PR accordingly.
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
>> >
>> >> I'm generally a -0.5 on this change, or at least doing so hastily.
>> >>
>> >> As with dropping Java 7 support, I think this should at least be
>> >> announced in release notes that we're considering dropping support in
>> >> the subsequent release, as this dev list likely does not reach a
>> >> substantial portion of the userbase.
>> >>
>> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x
>> >> cluster? I get the feeling it's not nearly as transparent as upgrading
>> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters,
>> >> or is a new cluster (and/or upgrading all pipelines) required (e.g.
>> >> for those who operate spark clusters shared among their many users)?
>> >>
>> >> Looks like the latest release of Spark 1.x was about a year ago,
>> >> overlapping a bit with the 2.x series which is coming up on 1.5 years
>> >> old, so I could see a lot of people still using 1.x even if 2.x is
>> >> clearly the future. But it sure doesn't seem very backwards
>> >> compatible.
>> >>
>> >> Mostly I'm not comfortable with dropping 1.x in the same release as
>> >> adding support for 2.x, giving no transition period, but could be
>> >> convinced if this transition is mostly a no-op or no one's still using
>> >> 1.x. If there's non-trivial code complexity issues, I would perhaps
>> >> revisit the issue of having a single Spark Runner that does chooses
>> >> the backend implicitly in favor of simply having two runners which
>> >> share the code that's easy to share and diverge otherwise (which seems
>> >> it would be much simpler both to implement and explain to users). I
>> >> would be OK with even letting the Spark 1.x runner be somewhat
>> >> stagnant (e.g. few or no new features) until we decide we can kill it
>> >> off.
>> >>
>> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>> >
>> >> wrote:
>> >>
>> >>> Hi all,
>> >>>
>> >>> as you might know, we are working on Spark 2.x support in the Spark
>> >>> runner.
>> >>>
>> >>> I'm working on a PR about that:
>> >>>
>> >>> https://github.com/apache/beam/pull/3808
>> >>>
>> >>> Today, we have something working with both Spark 1.x and 2.x from a
>> code
>> >>> standpoint, but I have to deal with dependencies. It's the first step
>> of
>> >>> the
>> >>> update as I'm still using RDD, the second step would be to support
>> >>> dataframe
>> >>> (but for that, I would need PCollection elements with schemas, that's
>> >>> another topic on which Eugene, Reuven and I are discussing).
>> >>>
>> >>> However, as all major distributions now ship Spark 2.x, I don't think
>> >>> it's
>> >>> required anymore to support Spark 1.x.
>> >>>
>> >>> If we agree, I will update and cleanup the PR to only support and
>> focus
>> >>> on
>> >>> Spark 2.x.
>> >>>
>> >>> So, that's why I'm calling for a vote:
>> >>>
>> >>>    [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>> >>>    [ ] 0 (I don't care ;))
>> >>>    [ ] -1, I would like to still support Spark 1.x, and so having
>> >>> support of
>> >>> both Spark 1.x and 2.x (please provide specific comment)
>> >>>
>> >>> This vote is open for 48 hours (I have the commits ready, just waiting
>> >>> the
>> >>> end of the vote to push on the PR).
>> >>>
>> >>> Thanks !
>> >>> Regards
>> >>> JB
>> >>> --
>> >>> Jean-Baptiste Onofré
>> >>> jbono...@apache.org
>> >>> http://blog.nanthrax.net
>> >>> Talend - http://www.talend.com
>> >>>
>> >>
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>> >
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau

Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

Reply via email to