+1 for the move to Spark 2 modulo preventing users and deciding on support:
I agree that having compatibility for both versions of Spark is desirable but I am not sure if is worth the effort. Apart of the reasons mentioned by Holden and Pei, I will add that the burden of simultaneous maintenance could be bigger than the return, and also that most Big Data/Cloud distributions have moved already to Spark 2, so it makes sense to prioritize the new users better than the legacy ones, in particular if we consider that Beam is a ‘recent’ project. We can announce the end of the support for Spark 1 in the release notes of Beam 2.2 and decide if we will support it in maintenance mode, in this case we will backport or fix any reported issue related to the Spark 1 runner on the 2.2.x branch let’s say for a year, but we won’t add new functionalities. Or we can just decide not to support it anymore and encourage users to move to Spark 2. On Thu, Nov 9, 2017 at 6:59 AM, Pei HE <pei...@gmail.com> wrote: > +1 on moving forward with Spark 2.x only. > Spark 1 users can still use already released Spark runners, and we can > support them with minor version releases for future bug fixes. > > I don't see how important it is to make future Beam releases available to > Spark 1 users. If they choose not to upgrade Spark clusters, maybe they > don't need the newest Beam releases as well. > > I think it is more important to 1). be able to leverage new features in > Spark 2.x, 2.) extend user base to Spark 2. > -- > Pei > > > On Thu, Nov 9, 2017 at 1:45 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > >> That's a good point about Oozie does only supporting only Spark 1 or 2 at a >> time on a cluster -- but do we know people using Oozie and Spark 1 that >> would still be using Spark 1 by the time of the next BEAM release? The last >> Spark 1 release was a year ago (and last non-maintenance release almost 20 >> months ago). >> >> On Wed, Nov 8, 2017 at 9:30 PM, NerdyNick <nerdyn...@gmail.com> wrote: >> >> > I don't know if ditching Spark 1 out right right now would be a great >> move >> > given that a lot of the main support applications around spark haven't >> yet >> > fully moved to Spark 2 yet. Yet alone have support for having a cluster >> > with both. Oozie for example is still pre stable release for their Spark >> 1 >> > and can't support a cluster with mixed Spark version. I think maybe doing >> > as suggested above with the common, spark1, spark2 packaging might be >> best >> > during this carry over phase. Maybe even just flag spark 1 as deprecated >> > and just being maintained might be enough. >> > >> > On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau <hol...@pigscanfly.ca> >> > wrote: >> > >> > > Also, upgrading Spark 1 to 2 is generally easier than changing JVM >> > > versions. For folks using YARN or the hosted environments it pretty >> much >> > > trivial since you can effectively have distinct Spark clusters for each >> > > job. >> > > >> > > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau <hol...@pigscanfly.ca> >> > wrote: >> > > >> > > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements >> in >> > > > Spark 2, and trying to write efficient code that runs between Spark 1 >> > and >> > > > Spark 2 is super painful in the long term. It would be one thing if >> > there >> > > > were a lot of people available to work on the Spark runners, but it >> > seems >> > > > like we'd be better spent focusing our energy on the future. >> > > > >> > > > I don't know a lot of folks who are stuck on Spark 1, and the few >> that >> > I >> > > > know are planning to migrate in the next few months anyways. >> > > > >> > > > Note: this is a non-binding vote as I'm not a committer or PMC >> member. >> > > > >> > > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> > > > >> > > >> Having both Spark1 and Spark2 modules would benefit wider user base. >> > > >> >> > > >> I would vote for that. >> > > >> >> > > >> Cheers >> > > >> >> > > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré < >> > j...@nanthrax.net> >> > > >> wrote: >> > > >> >> > > >> > Hi Robert, >> > > >> > >> > > >> > Thanks for your feedback ! >> > > >> > >> > > >> > From an user perspective, with the current state of the PR, the >> same >> > > >> > pipelines can run on both Spark 1.x and 2.x: the only difference >> is >> > > the >> > > >> > dependencies set. >> > > >> > >> > > >> > I'm calling the vote to get suck kind of feedback: if we consider >> > > Spark >> > > >> > 1.x still need to be supported, no problem, I will improve the PR >> to >> > > >> have >> > > >> > three modules (common, spark1, spark2) and let users pick the >> > desired >> > > >> > version. >> > > >> > >> > > >> > Let's wait a bit other feedbacks, I will update the PR >> accordingly. >> > > >> > >> > > >> > Regards >> > > >> > JB >> > > >> > >> > > >> > >> > > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote: >> > > >> > >> > > >> >> I'm generally a -0.5 on this change, or at least doing so >> hastily. >> > > >> >> >> > > >> >> As with dropping Java 7 support, I think this should at least be >> > > >> >> announced in release notes that we're considering dropping >> support >> > in >> > > >> >> the subsequent release, as this dev list likely does not reach a >> > > >> >> substantial portion of the userbase. >> > > >> >> >> > > >> >> How much work is it to move from a Spark 1.x cluster to a Spark >> 2.x >> > > >> >> cluster? I get the feeling it's not nearly as transparent as >> > > upgrading >> > > >> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x >> > clusters, >> > > >> >> or is a new cluster (and/or upgrading all pipelines) required >> (e.g. >> > > >> >> for those who operate spark clusters shared among their many >> > users)? >> > > >> >> >> > > >> >> Looks like the latest release of Spark 1.x was about a year ago, >> > > >> >> overlapping a bit with the 2.x series which is coming up on 1.5 >> > years >> > > >> >> old, so I could see a lot of people still using 1.x even if 2.x >> is >> > > >> >> clearly the future. But it sure doesn't seem very backwards >> > > >> >> compatible. >> > > >> >> >> > > >> >> Mostly I'm not comfortable with dropping 1.x in the same release >> as >> > > >> >> adding support for 2.x, giving no transition period, but could be >> > > >> >> convinced if this transition is mostly a no-op or no one's still >> > > using >> > > >> >> 1.x. If there's non-trivial code complexity issues, I would >> perhaps >> > > >> >> revisit the issue of having a single Spark Runner that does >> chooses >> > > >> >> the backend implicitly in favor of simply having two runners >> which >> > > >> >> share the code that's easy to share and diverge otherwise (which >> > > seems >> > > >> >> it would be much simpler both to implement and explain to >> users). I >> > > >> >> would be OK with even letting the Spark 1.x runner be somewhat >> > > >> >> stagnant (e.g. few or no new features) until we decide we can >> kill >> > it >> > > >> >> off. >> > > >> >> >> > > >> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré < >> > > j...@nanthrax.net >> > > >> > >> > > >> >> wrote: >> > > >> >> >> > > >> >>> Hi all, >> > > >> >>> >> > > >> >>> as you might know, we are working on Spark 2.x support in the >> > Spark >> > > >> >>> runner. >> > > >> >>> >> > > >> >>> I'm working on a PR about that: >> > > >> >>> >> > > >> >>> https://github.com/apache/beam/pull/3808 >> > > >> >>> >> > > >> >>> Today, we have something working with both Spark 1.x and 2.x >> from >> > a >> > > >> code >> > > >> >>> standpoint, but I have to deal with dependencies. It's the first >> > > step >> > > >> of >> > > >> >>> the >> > > >> >>> update as I'm still using RDD, the second step would be to >> support >> > > >> >>> dataframe >> > > >> >>> (but for that, I would need PCollection elements with schemas, >> > > that's >> > > >> >>> another topic on which Eugene, Reuven and I are discussing). >> > > >> >>> >> > > >> >>> However, as all major distributions now ship Spark 2.x, I don't >> > > think >> > > >> >>> it's >> > > >> >>> required anymore to support Spark 1.x. >> > > >> >>> >> > > >> >>> If we agree, I will update and cleanup the PR to only support >> and >> > > >> focus >> > > >> >>> on >> > > >> >>> Spark 2.x. >> > > >> >>> >> > > >> >>> So, that's why I'm calling for a vote: >> > > >> >>> >> > > >> >>> [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x >> only >> > > >> >>> [ ] 0 (I don't care ;)) >> > > >> >>> [ ] -1, I would like to still support Spark 1.x, and so >> having >> > > >> >>> support of >> > > >> >>> both Spark 1.x and 2.x (please provide specific comment) >> > > >> >>> >> > > >> >>> This vote is open for 48 hours (I have the commits ready, just >> > > waiting >> > > >> >>> the >> > > >> >>> end of the vote to push on the PR). >> > > >> >>> >> > > >> >>> Thanks ! >> > > >> >>> Regards >> > > >> >>> JB >> > > >> >>> -- >> > > >> >>> Jean-Baptiste Onofré >> > > >> >>> jbono...@apache.org >> > > >> >>> http://blog.nanthrax.net >> > > >> >>> Talend - http://www.talend.com >> > > >> >>> >> > > >> >> >> > > >> > -- >> > > >> > Jean-Baptiste Onofré >> > > >> > jbono...@apache.org >> > > >> > http://blog.nanthrax.net >> > > >> > Talend - http://www.talend.com >> > > >> > >> > > >> >> > > > >> > > > >> > > > >> > > > -- >> > > > Twitter: https://twitter.com/holdenkarau >> > > > >> > > >> > > >> > > >> > > -- >> > > Twitter: https://twitter.com/holdenkarau >> > > >> > >> > >> > >> > -- >> > Nick Verbeck - NerdyNick >> > ---------------------------------------------------- >> > NerdyNick.com >> > TrailsOffroad.com >> > NoKnownBoundaries.com >> > >> >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >>