Re: A proposal for Spark 2.0

Sandy Ryza Tue, 10 Nov 2015 16:55:40 -0800

Oh and another question - should Spark 2.0 support Java 7?

On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:


> Another +1 to Reynold's proposal.
>
> Maybe this is obvious, but I'd like to advocate against a blanket removal
> of deprecated / developer APIs.  Many APIs can likely be removed without
> material impact (e.g. the SparkContext constructor that takes preferred
> node location data), while others likely see heavier usage (e.g. I wouldn't
> be surprised if mapPartitionsWithContext was baked into a number of apps)
> and merit a little extra consideration.
>
> Maybe also obvious, but I think a migration guide with API equivlents and
> the like would be incredibly useful in easing the transition.
>
> -Sandy
>
> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>> features to the 1.x line. We should still do critical bug fixes though.
>>
>>
>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> On a related note I think making it lightweight will ensure that we
>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>> to wait for new features / big architectural changes.
>>>
>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>> fixes to older releases would still apply. I don't think developing
>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>> users to switch to 2.x.
>>>
>>> Shivaram
>>>
>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <kos...@cloudera.com>
>>> wrote:
>>> > +1 on a lightweight 2.0
>>> >
>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>> If not
>>> > terminated, how will we determine what goes into each major version
>>> line?
>>> > Will 1.x only be for stability fixes?
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pwend...@gmail.com>
>>> wrote:
>>> >>
>>> >> I also feel the same as Reynold. I agree we should minimize API
>>> breaks and
>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>> exposing
>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>> community.
>>> >> Ideally a major release is a lightweight process we can do every
>>> couple of
>>> >> years, with minimal impact for users.
>>> >>
>>> >> - Patrick
>>> >>
>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>> >> <nicholas.cham...@gmail.com> wrote:
>>> >>>
>>> >>> > For this reason, I would *not* propose doing major releases to
>>> break
>>> >>> > substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>> >>> > incrementally and making changes - and I don't think we want to
>>> change this
>>> >>> > model.
>>> >>>
>>> >>> +1 for this. The Python community went through a lot of turmoil over
>>> the
>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>> painful
>>> >>> for too long. The Spark community will benefit greatly from our
>>> explicitly
>>> >>> looking to avoid a similar situation.
>>> >>>
>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> >>> > enormous assembly jar in order to run Spark.
>>> >>>
>>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> >>> distribution means.
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <r...@databricks.com>
>>> wrote:
>>> >>>>
>>> >>>> I’m starting a new thread since the other one got intermixed with
>>> >>>> feature requests. Please refrain from making feature request in
>>> this thread.
>>> >>>> Not that we shouldn’t be adding features, but we can always add
>>> features in
>>> >>>> 1.7, 2.1, 2.2, ...
>>> >>>>
>>> >>>> First - I want to propose a premise for how to think about Spark
>>> 2.0 and
>>> >>>> major releases in Spark, based on discussion with several members
>>> of the
>>> >>>> community: a major release should be low overhead and minimally
>>> disruptive
>>> >>>> to the Spark community. A major release should not be very
>>> different from a
>>> >>>> minor release and should not be gated based on new features. The
>>> main
>>> >>>> purpose of a major release is an opportunity to fix things that are
>>> broken
>>> >>>> in the current API and remove certain deprecated APIs (examples
>>> follow).
>>> >>>>
>>> >>>> For this reason, I would *not* propose doing major releases to break
>>> >>>> substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>>> upgrading. Spark has always had a culture of evolving architecture
>>> >>>> incrementally and making changes - and I don't think we want to
>>> change this
>>> >>>> model. In fact, we’ve released many architectural changes on the
>>> 1.X line.
>>> >>>>
>>> >>>> If the community likes the above model, then to me it seems
>>> reasonable
>>> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately
>>> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of
>>> >>>> major releases every 2 years seems doable within the above model.
>>> >>>>
>>> >>>> Under this model, here is a list of example things I would propose
>>> doing
>>> >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>> >>>>
>>> >>>>
>>> >>>> APIs
>>> >>>>
>>> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>> in
>>> >>>> Spark 1.x.
>>> >>>>
>>> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of
>>> complaints
>>> >>>> about user applications being unable to use Akka due to Spark’s
>>> dependency
>>> >>>> on Akka.
>>> >>>>
>>> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>> >>>>
>>> >>>> 4. Better class package structure for low level developer API’s. In
>>> >>>> particular, we have some DeveloperApi (mostly various
>>> listener-related
>>> >>>> classes) added over the years. Some packages include only one or
>>> two public
>>> >>>> classes but a lot of private classes. A better structure is to have
>>> public
>>> >>>> classes isolated to a few public packages, and these public
>>> packages should
>>> >>>> have minimal private classes for low level developer APIs.
>>> >>>>
>>> >>>> 5. Consolidate task metric and accumulator API. Although having some
>>> >>>> subtle differences, these two are very similar but have completely
>>> different
>>> >>>> code path.
>>> >>>>
>>> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>> >>>> moving them to other package(s). They are already used beyond SQL,
>>> e.g. in
>>> >>>> ML pipelines, and will be used by streaming also.
>>> >>>>
>>> >>>>
>>> >>>> Operation/Deployment
>>> >>>>
>>> >>>> 1. Scala 2.11 as the default build. We should still support Scala
>>> 2.10,
>>> >>>> but it has been end-of-life.
>>> >>>>
>>> >>>> 2. Remove Hadoop 1 support.
>>> >>>>
>>> >>>> 3. Assembly-free distribution of Spark: don’t require building an
>>> >>>> enormous assembly jar in order to run Spark.
>>> >>>>
>>> >>
>>> >
>>>
>>
>>
>

Re: A proposal for Spark 2.0

Reply via email to