Re: A proposal for Spark 2.0

Sandy Ryza Tue, 10 Nov 2015 16:54:09 -0800

Another +1 to Reynold's proposal.

Maybe this is obvious, but I'd like to advocate against a blanket removal
of deprecated / developer APIs.  Many APIs can likely be removed without
material impact (e.g. the SparkContext constructor that takes preferred
node location data), while others likely see heavier usage (e.g. I wouldn't
be surprised if mapPartitionsWithContext was baked into a number of apps)
and merit a little extra consideration.


Maybe also obvious, but I think a migration guide with API equivlents and
the like would be incredibly useful in easing the transition.

-Sandy

On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <r...@databricks.com> wrote:

> Echoing Shivaram here. I don't think it makes a lot of sense to add more
> features to the 1.x line. We should still do critical bug fixes though.
>
>
> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> +1
>>
>> On a related note I think making it lightweight will ensure that we
>> stay on the current release schedule and don't unnecessarily delay 2.0
>> to wait for new features / big architectural changes.
>>
>> In terms of fixes to 1.x, I think our current policy of back-porting
>> fixes to older releases would still apply. I don't think developing
>> new features on both 1.x and 2.x makes a lot of sense as we would like
>> users to switch to 2.x.
>>
>> Shivaram
>>
>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <kos...@cloudera.com>
>> wrote:
>> > +1 on a lightweight 2.0
>> >
>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>> If not
>> > terminated, how will we determine what goes into each major version
>> line?
>> > Will 1.x only be for stability fixes?
>> >
>> > Thanks,
>> > Kostas
>> >
>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pwend...@gmail.com>
>> wrote:
>> >>
>> >> I also feel the same as Reynold. I agree we should minimize API breaks
>> and
>> >> focus on fixing things around the edge that were mistakes (e.g.
>> exposing
>> >> Guava and Akka) rather than any overhaul that could fragment the
>> community.
>> >> Ideally a major release is a lightweight process we can do every
>> couple of
>> >> years, with minimal impact for users.
>> >>
>> >> - Patrick
>> >>
>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>> >> <nicholas.cham...@gmail.com> wrote:
>> >>>
>> >>> > For this reason, I would *not* propose doing major releases to break
>> >>> > substantial API's or perform large re-architecting that prevent
>> users from
>> >>> > upgrading. Spark has always had a culture of evolving architecture
>> >>> > incrementally and making changes - and I don't think we want to
>> change this
>> >>> > model.
>> >>>
>> >>> +1 for this. The Python community went through a lot of turmoil over
>> the
>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>> painful
>> >>> for too long. The Spark community will benefit greatly from our
>> explicitly
>> >>> looking to avoid a similar situation.
>> >>>
>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>> >>> > enormous assembly jar in order to run Spark.
>> >>>
>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> >>> distribution means.
>> >>>
>> >>> Nick
>> >>>
>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <r...@databricks.com>
>> wrote:
>> >>>>
>> >>>> I’m starting a new thread since the other one got intermixed with
>> >>>> feature requests. Please refrain from making feature request in this
>> thread.
>> >>>> Not that we shouldn’t be adding features, but we can always add
>> features in
>> >>>> 1.7, 2.1, 2.2, ...
>> >>>>
>> >>>> First - I want to propose a premise for how to think about Spark 2.0
>> and
>> >>>> major releases in Spark, based on discussion with several members of
>> the
>> >>>> community: a major release should be low overhead and minimally
>> disruptive
>> >>>> to the Spark community. A major release should not be very different
>> from a
>> >>>> minor release and should not be gated based on new features. The main
>> >>>> purpose of a major release is an opportunity to fix things that are
>> broken
>> >>>> in the current API and remove certain deprecated APIs (examples
>> follow).
>> >>>>
>> >>>> For this reason, I would *not* propose doing major releases to break
>> >>>> substantial API's or perform large re-architecting that prevent
>> users from
>> >>>> upgrading. Spark has always had a culture of evolving architecture
>> >>>> incrementally and making changes - and I don't think we want to
>> change this
>> >>>> model. In fact, we’ve released many architectural changes on the 1.X
>> line.
>> >>>>
>> >>>> If the community likes the above model, then to me it seems
>> reasonable
>> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>> immediately
>> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>> cadence of
>> >>>> major releases every 2 years seems doable within the above model.
>> >>>>
>> >>>> Under this model, here is a list of example things I would propose
>> doing
>> >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>> >>>>
>> >>>>
>> >>>> APIs
>> >>>>
>> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> >>>> Spark 1.x.
>> >>>>
>> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of
>> complaints
>> >>>> about user applications being unable to use Akka due to Spark’s
>> dependency
>> >>>> on Akka.
>> >>>>
>> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>> >>>>
>> >>>> 4. Better class package structure for low level developer API’s. In
>> >>>> particular, we have some DeveloperApi (mostly various
>> listener-related
>> >>>> classes) added over the years. Some packages include only one or two
>> public
>> >>>> classes but a lot of private classes. A better structure is to have
>> public
>> >>>> classes isolated to a few public packages, and these public packages
>> should
>> >>>> have minimal private classes for low level developer APIs.
>> >>>>
>> >>>> 5. Consolidate task metric and accumulator API. Although having some
>> >>>> subtle differences, these two are very similar but have completely
>> different
>> >>>> code path.
>> >>>>
>> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> >>>> moving them to other package(s). They are already used beyond SQL,
>> e.g. in
>> >>>> ML pipelines, and will be used by streaming also.
>> >>>>
>> >>>>
>> >>>> Operation/Deployment
>> >>>>
>> >>>> 1. Scala 2.11 as the default build. We should still support Scala
>> 2.10,
>> >>>> but it has been end-of-life.
>> >>>>
>> >>>> 2. Remove Hadoop 1 support.
>> >>>>
>> >>>> 3. Assembly-free distribution of Spark: don’t require building an
>> >>>> enormous assembly jar in order to run Spark.
>> >>>>
>> >>
>> >
>>
>
>

Re: A proposal for Spark 2.0

Reply via email to