Re: A proposal for Spark 2.0

Sudhir Menon Tue, 10 Nov 2015 18:50:00 -0800

Agree. If it is deprecated, get rid of it in 2.0
If the deprecation was a mistake, let's fix that.


Suds
Sent from my iPhone

On Nov 10, 2015, at 5:04 PM, Reynold Xin <r...@databricks.com> wrote:

Maybe a better idea is to un-deprecate an API if it is too important to not
be removed.

I don't think we can drop Java 7 support. It's way too soon.



On Tue, Nov 10, 2015 at 4:59 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
> we're not going to remove these with a major version change, then just when
> will we remove them?
>
> On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
>
>> Another +1 to Reynold's proposal.
>>
>> Maybe this is obvious, but I'd like to advocate against a blanket removal
>> of deprecated / developer APIs.  Many APIs can likely be removed without
>> material impact (e.g. the SparkContext constructor that takes preferred
>> node location data), while others likely see heavier usage (e.g. I wouldn't
>> be surprised if mapPartitionsWithContext was baked into a number of apps)
>> and merit a little extra consideration.
>>
>> Maybe also obvious, but I think a migration guide with API equivlents and
>> the like would be incredibly useful in easing the transition.
>>
>> -Sandy
>>
>> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>>> features to the 1.x line. We should still do critical bug fixes though.
>>>
>>>
>>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
>>>> +1
>>>>
>>>> On a related note I think making it lightweight will ensure that we
>>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>>> to wait for new features / big architectural changes.
>>>>
>>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>>> fixes to older releases would still apply. I don't think developing
>>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>>> users to switch to 2.x.
>>>>
>>>> Shivaram
>>>>
>>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <kos...@cloudera.com>
>>>> wrote:
>>>> > +1 on a lightweight 2.0
>>>> >
>>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>>> If not
>>>> > terminated, how will we determine what goes into each major version
>>>> line?
>>>> > Will 1.x only be for stability fixes?
>>>> >
>>>> > Thanks,
>>>> > Kostas
>>>> >
>>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pwend...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I also feel the same as Reynold. I agree we should minimize API
>>>> breaks and
>>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>>> exposing
>>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>>> community.
>>>> >> Ideally a major release is a lightweight process we can do every
>>>> couple of
>>>> >> years, with minimal impact for users.
>>>> >>
>>>> >> - Patrick
>>>> >>
>>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>>> >> <nicholas.cham...@gmail.com> wrote:
>>>> >>>
>>>> >>> > For this reason, I would *not* propose doing major releases to
>>>> break
>>>> >>> > substantial API's or perform large re-architecting that prevent
>>>> users from
>>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>>> >>> > incrementally and making changes - and I don't think we want to
>>>> change this
>>>> >>> > model.
>>>> >>>
>>>> >>> +1 for this. The Python community went through a lot of turmoil
>>>> over the
>>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>>> painful
>>>> >>> for too long. The Spark community will benefit greatly from our
>>>> explicitly
>>>> >>> looking to avoid a similar situation.
>>>> >>>
>>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>>> >>> > enormous assembly jar in order to run Spark.
>>>> >>>
>>>> >>> Could you elaborate a bit on this? I'm not sure what an
>>>> assembly-free
>>>> >>> distribution means.
>>>> >>>
>>>> >>> Nick
>>>> >>>
>>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> I’m starting a new thread since the other one got intermixed with
>>>> >>>> feature requests. Please refrain from making feature request in
>>>> this thread.
>>>> >>>> Not that we shouldn’t be adding features, but we can always add
>>>> features in
>>>> >>>> 1.7, 2.1, 2.2, ...
>>>> >>>>
>>>> >>>> First - I want to propose a premise for how to think about Spark
>>>> 2.0 and
>>>> >>>> major releases in Spark, based on discussion with several members
>>>> of the
>>>> >>>> community: a major release should be low overhead and minimally
>>>> disruptive
>>>> >>>> to the Spark community. A major release should not be very
>>>> different from a
>>>> >>>> minor release and should not be gated based on new features. The
>>>> main
>>>> >>>> purpose of a major release is an opportunity to fix things that
>>>> are broken
>>>> >>>> in the current API and remove certain deprecated APIs (examples
>>>> follow).
>>>> >>>>
>>>> >>>> For this reason, I would *not* propose doing major releases to
>>>> break
>>>> >>>> substantial API's or perform large re-architecting that prevent
>>>> users from
>>>> >>>> upgrading. Spark has always had a culture of evolving architecture
>>>> >>>> incrementally and making changes - and I don't think we want to
>>>> change this
>>>> >>>> model. In fact, we’ve released many architectural changes on the
>>>> 1.X line.
>>>> >>>>
>>>> >>>> If the community likes the above model, then to me it seems
>>>> reasonable
>>>> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>>> immediately
>>>> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>>> cadence of
>>>> >>>> major releases every 2 years seems doable within the above model.
>>>> >>>>
>>>> >>>> Under this model, here is a list of example things I would propose
>>>> doing
>>>> >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>> >>>>
>>>> >>>>
>>>> >>>> APIs
>>>> >>>>
>>>> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>>> in
>>>> >>>> Spark 1.x.
>>>> >>>>
>>>> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>>> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of
>>>> complaints
>>>> >>>> about user applications being unable to use Akka due to Spark’s
>>>> dependency
>>>> >>>> on Akka.
>>>> >>>>
>>>> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>> >>>>
>>>> >>>> 4. Better class package structure for low level developer API’s. In
>>>> >>>> particular, we have some DeveloperApi (mostly various
>>>> listener-related
>>>> >>>> classes) added over the years. Some packages include only one or
>>>> two public
>>>> >>>> classes but a lot of private classes. A better structure is to
>>>> have public
>>>> >>>> classes isolated to a few public packages, and these public
>>>> packages should
>>>> >>>> have minimal private classes for low level developer APIs.
>>>> >>>>
>>>> >>>> 5. Consolidate task metric and accumulator API. Although having
>>>> some
>>>> >>>> subtle differences, these two are very similar but have completely
>>>> different
>>>> >>>> code path.
>>>> >>>>
>>>> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>>> >>>> moving them to other package(s). They are already used beyond SQL,
>>>> e.g. in
>>>> >>>> ML pipelines, and will be used by streaming also.
>>>> >>>>
>>>> >>>>
>>>> >>>> Operation/Deployment
>>>> >>>>
>>>> >>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>> 2.10,
>>>> >>>> but it has been end-of-life.
>>>> >>>>
>>>> >>>> 2. Remove Hadoop 1 support.
>>>> >>>>
>>>> >>>> 3. Assembly-free distribution of Spark: don’t require building an
>>>> >>>> enormous assembly jar in order to run Spark.
>>>> >>>>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Reply via email to