Would be also good to fix api breakages introduced as part of 1.0 (where there is missing functionality now), overhaul & remove all deprecated config/features/combinations, api changes that we need to make to public api which has been deferred for minor releases.
Regards, Mridul On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com> wrote: > I’m starting a new thread since the other one got intermixed with feature > requests. Please refrain from making feature request in this thread. Not > that we shouldn’t be adding features, but we can always add features in 1.7, > 2.1, 2.2, ... > > First - I want to propose a premise for how to think about Spark 2.0 and > major releases in Spark, based on discussion with several members of the > community: a major release should be low overhead and minimally disruptive > to the Spark community. A major release should not be very different from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are broken > in the current API and remove certain deprecated APIs (examples follow). > > For this reason, I would *not* propose doing major releases to break > substantial API's or perform large re-architecting that prevent users from > upgrading. Spark has always had a culture of evolving architecture > incrementally and making changes - and I don't think we want to change this > model. In fact, we’ve released many architectural changes on the 1.X line. > > If the community likes the above model, then to me it seems reasonable to do > Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after > Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major > releases every 2 years seems doable within the above model. > > Under this model, here is a list of example things I would propose doing in > Spark 2.0, separated into APIs and Operation/Deployment: > > > APIs > > 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark > 1.x. > > 2. Remove Akka from Spark’s API dependency (in streaming), so user > applications can use Akka (SPARK-5293). We have gotten a lot of complaints > about user applications being unable to use Akka due to Spark’s dependency > on Akka. > > 3. Remove Guava from Spark’s public API (JavaRDD Optional). > > 4. Better class package structure for low level developer API’s. In > particular, we have some DeveloperApi (mostly various listener-related > classes) added over the years. Some packages include only one or two public > classes but a lot of private classes. A better structure is to have public > classes isolated to a few public packages, and these public packages should > have minimal private classes for low level developer APIs. > > 5. Consolidate task metric and accumulator API. Although having some subtle > differences, these two are very similar but have completely different code > path. > > 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving > them to other package(s). They are already used beyond SQL, e.g. in ML > pipelines, and will be used by streaming also. > > > Operation/Deployment > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, but > it has been end-of-life. > > 2. Remove Hadoop 1 support. > > 3. Assembly-free distribution of Spark: don’t require building an enormous > assembly jar in order to run Spark. > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org