Oh and another question - should Spark 2.0 support Java 7? On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
> Another +1 to Reynold's proposal. > > Maybe this is obvious, but I'd like to advocate against a blanket removal > of deprecated / developer APIs. Many APIs can likely be removed without > material impact (e.g. the SparkContext constructor that takes preferred > node location data), while others likely see heavier usage (e.g. I wouldn't > be surprised if mapPartitionsWithContext was baked into a number of apps) > and merit a little extra consideration. > > Maybe also obvious, but I think a migration guide with API equivlents and > the like would be incredibly useful in easing the transition. > > -Sandy > > On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <r...@databricks.com> wrote: > >> Echoing Shivaram here. I don't think it makes a lot of sense to add more >> features to the 1.x line. We should still do critical bug fixes though. >> >> >> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >>> +1 >>> >>> On a related note I think making it lightweight will ensure that we >>> stay on the current release schedule and don't unnecessarily delay 2.0 >>> to wait for new features / big architectural changes. >>> >>> In terms of fixes to 1.x, I think our current policy of back-porting >>> fixes to older releases would still apply. I don't think developing >>> new features on both 1.x and 2.x makes a lot of sense as we would like >>> users to switch to 2.x. >>> >>> Shivaram >>> >>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <kos...@cloudera.com> >>> wrote: >>> > +1 on a lightweight 2.0 >>> > >>> > What is the thinking around the 1.x line after Spark 2.0 is released? >>> If not >>> > terminated, how will we determine what goes into each major version >>> line? >>> > Will 1.x only be for stability fixes? >>> > >>> > Thanks, >>> > Kostas >>> > >>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pwend...@gmail.com> >>> wrote: >>> >> >>> >> I also feel the same as Reynold. I agree we should minimize API >>> breaks and >>> >> focus on fixing things around the edge that were mistakes (e.g. >>> exposing >>> >> Guava and Akka) rather than any overhaul that could fragment the >>> community. >>> >> Ideally a major release is a lightweight process we can do every >>> couple of >>> >> years, with minimal impact for users. >>> >> >>> >> - Patrick >>> >> >>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >>> >> <nicholas.cham...@gmail.com> wrote: >>> >>> >>> >>> > For this reason, I would *not* propose doing major releases to >>> break >>> >>> > substantial API's or perform large re-architecting that prevent >>> users from >>> >>> > upgrading. Spark has always had a culture of evolving architecture >>> >>> > incrementally and making changes - and I don't think we want to >>> change this >>> >>> > model. >>> >>> >>> >>> +1 for this. The Python community went through a lot of turmoil over >>> the >>> >>> Python 2 -> Python 3 transition because the upgrade process was too >>> painful >>> >>> for too long. The Spark community will benefit greatly from our >>> explicitly >>> >>> looking to avoid a similar situation. >>> >>> >>> >>> > 3. Assembly-free distribution of Spark: don’t require building an >>> >>> > enormous assembly jar in order to run Spark. >>> >>> >>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >>> >>> distribution means. >>> >>> >>> >>> Nick >>> >>> >>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> >>> >>>> I’m starting a new thread since the other one got intermixed with >>> >>>> feature requests. Please refrain from making feature request in >>> this thread. >>> >>>> Not that we shouldn’t be adding features, but we can always add >>> features in >>> >>>> 1.7, 2.1, 2.2, ... >>> >>>> >>> >>>> First - I want to propose a premise for how to think about Spark >>> 2.0 and >>> >>>> major releases in Spark, based on discussion with several members >>> of the >>> >>>> community: a major release should be low overhead and minimally >>> disruptive >>> >>>> to the Spark community. A major release should not be very >>> different from a >>> >>>> minor release and should not be gated based on new features. The >>> main >>> >>>> purpose of a major release is an opportunity to fix things that are >>> broken >>> >>>> in the current API and remove certain deprecated APIs (examples >>> follow). >>> >>>> >>> >>>> For this reason, I would *not* propose doing major releases to break >>> >>>> substantial API's or perform large re-architecting that prevent >>> users from >>> >>>> upgrading. Spark has always had a culture of evolving architecture >>> >>>> incrementally and making changes - and I don't think we want to >>> change this >>> >>>> model. In fact, we’ve released many architectural changes on the >>> 1.X line. >>> >>>> >>> >>>> If the community likes the above model, then to me it seems >>> reasonable >>> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or >>> immediately >>> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A >>> cadence of >>> >>>> major releases every 2 years seems doable within the above model. >>> >>>> >>> >>>> Under this model, here is a list of example things I would propose >>> doing >>> >>>> in Spark 2.0, separated into APIs and Operation/Deployment: >>> >>>> >>> >>>> >>> >>>> APIs >>> >>>> >>> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated >>> in >>> >>>> Spark 1.x. >>> >>>> >>> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user >>> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of >>> complaints >>> >>>> about user applications being unable to use Akka due to Spark’s >>> dependency >>> >>>> on Akka. >>> >>>> >>> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional). >>> >>>> >>> >>>> 4. Better class package structure for low level developer API’s. In >>> >>>> particular, we have some DeveloperApi (mostly various >>> listener-related >>> >>>> classes) added over the years. Some packages include only one or >>> two public >>> >>>> classes but a lot of private classes. A better structure is to have >>> public >>> >>>> classes isolated to a few public packages, and these public >>> packages should >>> >>>> have minimal private classes for low level developer APIs. >>> >>>> >>> >>>> 5. Consolidate task metric and accumulator API. Although having some >>> >>>> subtle differences, these two are very similar but have completely >>> different >>> >>>> code path. >>> >>>> >>> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by >>> >>>> moving them to other package(s). They are already used beyond SQL, >>> e.g. in >>> >>>> ML pipelines, and will be used by streaming also. >>> >>>> >>> >>>> >>> >>>> Operation/Deployment >>> >>>> >>> >>>> 1. Scala 2.11 as the default build. We should still support Scala >>> 2.10, >>> >>>> but it has been end-of-life. >>> >>>> >>> >>>> 2. Remove Hadoop 1 support. >>> >>>> >>> >>>> 3. Assembly-free distribution of Spark: don’t require building an >>> >>>> enormous assembly jar in order to run Spark. >>> >>>> >>> >> >>> > >>> >> >> >