Re: A proposal for Spark 2.0

Mark Hamstra Tue, 10 Nov 2015 19:05:49 -0800

Heh... ok, I was intentionally pushing those bullet points to be extreme to
find where people would start pushing back, and I'll agree that we do
probably want some new features in 2.0 -- but I think we've got good
agreement that new features aren't really the main point of doing a 2.0
release.


I don't really have a concrete example of an anticipatory change, and
that's actually kind of the problem with trying to anticipate what we'll
need in the way of new public API and the like: Until what we already have
is clearly inadequate, it hard to concretely imagine how things really
should be.  At this point I don't have anything specific where I can say "I
really want to do __ with Spark in the future, and I think it should be
changed in this way in 2.0 to allow me to do that."  I'm just wondering
whether we want to even entertain those kinds of change requests if people
have them, or whether we can just delay making those kinds of decisions
until it is really obvious that what we have does't work and that there is
clearly something better that should be done.

On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin <r...@databricks.com> wrote:

> Mark,
>
> I think we are in agreement, although I wouldn't go to the extreme and say
> "a release with no new features might even be best."
>
> Can you elaborate "anticipatory changes"? A concrete example or so would
> be helpful.
>
> On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> I'm liking the way this is shaping up, and I'd summarize it this way (let
>> me know if I'm misunderstanding or misrepresenting anything):
>>
>>    - New features are not at all the focus of Spark 2.0 -- in fact, a
>>    release with no new features might even be best.
>>    - Remove deprecated API that we agree really should be deprecated.
>>    - Fix/change publicly-visible things that anyone who has spent any
>>    time looking at already knows are mistakes or should be done better, but
>>    that can't be changed within 1.x.
>>
>> Do we want to attempt anticipatory changes at all?  In other words, are
>> there things we want to do in 2.x for which we already know that we'll want
>> to make publicly-visible changes or that, if we don't add or change it now,
>> will fall into the "everybody knows it shouldn't be that way" category when
>> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
>> try at all to anticipate what is needed -- working from the premise that
>> being forced into a 3.x release earlier than we expect would be less
>> painful than trying to back out a mistake made at the outset of 2.0 while
>> trying to guess what we'll need.
>>
>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> I’m starting a new thread since the other one got intermixed with
>>> feature requests. Please refrain from making feature request in this
>>> thread. Not that we shouldn’t be adding features, but we can always add
>>> features in 1.7, 2.1, 2.2, ...
>>>
>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>> major releases in Spark, based on discussion with several members of the
>>> community: a major release should be low overhead and minimally disruptive
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>> For this reason, I would *not* propose doing major releases to break
>>> substantial API's or perform large re-architecting that prevent users from
>>> upgrading. Spark has always had a culture of evolving architecture
>>> incrementally and making changes - and I don't think we want to change this
>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>
>>> If the community likes the above model, then to me it seems reasonable
>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of major releases every 2 years seems doable within the above model.
>>>
>>> Under this model, here is a list of example things I would propose doing
>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>
>>>
>>> APIs
>>>
>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>> Spark 1.x.
>>>
>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>> about user applications being unable to use Akka due to Spark’s dependency
>>> on Akka.
>>>
>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>
>>> 4. Better class package structure for low level developer API’s. In
>>> particular, we have some DeveloperApi (mostly various listener-related
>>> classes) added over the years. Some packages include only one or two public
>>> classes but a lot of private classes. A better structure is to have public
>>> classes isolated to a few public packages, and these public packages should
>>> have minimal private classes for low level developer APIs.
>>>
>>> 5. Consolidate task metric and accumulator API. Although having some
>>> subtle differences, these two are very similar but have completely
>>> different code path.
>>>
>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>> ML pipelines, and will be used by streaming also.
>>>
>>>
>>> Operation/Deployment
>>>
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>> but it has been end-of-life.
>>>
>>> 2. Remove Hadoop 1 support.
>>>
>>> 3. Assembly-free distribution of Spark: don’t require building an
>>> enormous assembly jar in order to run Spark.
>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Reply via email to