Re: A proposal for Spark 2.0

Nicholas Chammas Wed, 23 Dec 2015 08:53:18 -0800

Yeah, I'd also favor maintaining docs with strictly temporary relevance on
JIRA when possible. The wiki is like this weird backwater I only rarely
visit.


Don't we typically do this kind of stuff with an umbrella issue on JIRA?
Tom, wouldn't that work well for you?

Nick

On Wed, Dec 23, 2015 at 5:06 AM Sean Owen <so...@cloudera.com> wrote:

> I think this will be hard to maintain; we already have JIRA as the de
> facto central place to store discussions and prioritize work, and the
> 2.x stuff is already a JIRA. The wiki doesn't really hurt, just
> probably will never be looked at again. Let's point people in all
> cases to JIRA.
>
> On Tue, Dec 22, 2015 at 11:52 PM, Reynold Xin <r...@databricks.com> wrote:
> > I started a wiki page:
> >
> https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions
> >
> >
> > On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves <tgraves...@yahoo.com>
> wrote:
> >>
> >> Do we have a summary of all the discussions and what is planned for 2.0
> >> then?  Perhaps we should put on the wiki for reference.
> >>
> >> Tom
> >>
> >>
> >> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <
> r...@databricks.com>
> >> wrote:
> >>
> >>
> >> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
> >>
> >> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >>
> >> I’m starting a new thread since the other one got intermixed with
> feature
> >> requests. Please refrain from making feature request in this thread. Not
> >> that we shouldn’t be adding features, but we can always add features in
> 1.7,
> >> 2.1, 2.2, ...
> >>
> >> First - I want to propose a premise for how to think about Spark 2.0 and
> >> major releases in Spark, based on discussion with several members of the
> >> community: a major release should be low overhead and minimally
> disruptive
> >> to the Spark community. A major release should not be very different
> from a
> >> minor release and should not be gated based on new features. The main
> >> purpose of a major release is an opportunity to fix things that are
> broken
> >> in the current API and remove certain deprecated APIs (examples follow).
> >>
> >> For this reason, I would *not* propose doing major releases to break
> >> substantial API's or perform large re-architecting that prevent users
> from
> >> upgrading. Spark has always had a culture of evolving architecture
> >> incrementally and making changes - and I don't think we want to change
> this
> >> model. In fact, we’ve released many architectural changes on the 1.X
> line.
> >>
> >> If the community likes the above model, then to me it seems reasonable
> to
> >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
> immediately
> >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence
> of
> >> major releases every 2 years seems doable within the above model.
> >>
> >> Under this model, here is a list of example things I would propose doing
> >> in Spark 2.0, separated into APIs and Operation/Deployment:
> >>
> >>
> >> APIs
> >>
> >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> >> Spark 1.x.
> >>
> >> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> >> applications can use Akka (SPARK-5293). We have gotten a lot of
> complaints
> >> about user applications being unable to use Akka due to Spark’s
> dependency
> >> on Akka.
> >>
> >> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
> >>
> >> 4. Better class package structure for low level developer API’s. In
> >> particular, we have some DeveloperApi (mostly various listener-related
> >> classes) added over the years. Some packages include only one or two
> public
> >> classes but a lot of private classes. A better structure is to have
> public
> >> classes isolated to a few public packages, and these public packages
> should
> >> have minimal private classes for low level developer APIs.
> >>
> >> 5. Consolidate task metric and accumulator API. Although having some
> >> subtle differences, these two are very similar but have completely
> different
> >> code path.
> >>
> >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
> moving
> >> them to other package(s). They are already used beyond SQL, e.g. in ML
> >> pipelines, and will be used by streaming also.
> >>
> >>
> >> Operation/Deployment
> >>
> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> >> but it has been end-of-life.
> >>
> >> 2. Remove Hadoop 1 support.
> >>
> >> 3. Assembly-free distribution of Spark: don’t require building an
> enormous
> >> assembly jar in order to run Spark.
> >>
> >>
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: A proposal for Spark 2.0

Reply via email to