Re: A proposal for Spark 2.0

Mark Hamstra Tue, 10 Nov 2015 19:27:08 -0800

To take a stab at an example of something concrete and anticipatory I can
go back to something I mentioned previously.  It's not really a good
example because I don't mean to imply that I believe that its premises are
true, but try to go with it.... If we were to decide that real-time,
event-based streaming is something that we really think we'll want to do in
the 2.x cycle and that the current API (after having deprecations removed
and clear mistakes/inadequacies remedied) isn't adequate to support that,
would we want to "take our best shot" at defining a new API at the outset
of 2.0?  Another way of looking at it is whether API changes in 2.0 should
be entirely backward-looking, trying to fix problems that we've already
identified or whether there is room for some forward-looking changes that
are intended to open new directions for Spark development.


On Tue, Nov 10, 2015 at 7:04 PM, Mark Hamstra <[email protected]>
wrote:

> Heh... ok, I was intentionally pushing those bullet points to be extreme
> to find where people would start pushing back, and I'll agree that we do
> probably want some new features in 2.0 -- but I think we've got good
> agreement that new features aren't really the main point of doing a 2.0
> release.
>
> I don't really have a concrete example of an anticipatory change, and
> that's actually kind of the problem with trying to anticipate what we'll
> need in the way of new public API and the like: Until what we already have
> is clearly inadequate, it hard to concretely imagine how things really
> should be.  At this point I don't have anything specific where I can say "I
> really want to do __ with Spark in the future, and I think it should be
> changed in this way in 2.0 to allow me to do that."  I'm just wondering
> whether we want to even entertain those kinds of change requests if people
> have them, or whether we can just delay making those kinds of decisions
> until it is really obvious that what we have does't work and that there is
> clearly something better that should be done.
>
> On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin <[email protected]> wrote:
>
>> Mark,
>>
>> I think we are in agreement, although I wouldn't go to the extreme and
>> say "a release with no new features might even be best."
>>
>> Can you elaborate "anticipatory changes"? A concrete example or so would
>> be helpful.
>>
>> On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra <[email protected]>
>> wrote:
>>
>>> I'm liking the way this is shaping up, and I'd summarize it this way
>>> (let me know if I'm misunderstanding or misrepresenting anything):
>>>
>>>    - New features are not at all the focus of Spark 2.0 -- in fact, a
>>>    release with no new features might even be best.
>>>    - Remove deprecated API that we agree really should be deprecated.
>>>    - Fix/change publicly-visible things that anyone who has spent any
>>>    time looking at already knows are mistakes or should be done better, but
>>>    that can't be changed within 1.x.
>>>
>>> Do we want to attempt anticipatory changes at all?  In other words, are
>>> there things we want to do in 2.x for which we already know that we'll want
>>> to make publicly-visible changes or that, if we don't add or change it now,
>>> will fall into the "everybody knows it shouldn't be that way" category when
>>> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
>>> try at all to anticipate what is needed -- working from the premise that
>>> being forced into a 3.x release earlier than we expect would be less
>>> painful than trying to back out a mistake made at the outset of 2.0 while
>>> trying to guess what we'll need.
>>>
>>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <[email protected]>
>>> wrote:
>>>
>>>> I’m starting a new thread since the other one got intermixed with
>>>> feature requests. Please refrain from making feature request in this
>>>> thread. Not that we shouldn’t be adding features, but we can always add
>>>> features in 1.7, 2.1, 2.2, ...
>>>>
>>>> First - I want to propose a premise for how to think about Spark 2.0
>>>> and major releases in Spark, based on discussion with several members of
>>>> the community: a major release should be low overhead and minimally
>>>> disruptive to the Spark community. A major release should not be very
>>>> different from a minor release and should not be gated based on new
>>>> features. The main purpose of a major release is an opportunity to fix
>>>> things that are broken in the current API and remove certain deprecated
>>>> APIs (examples follow).
>>>>
>>>> For this reason, I would *not* propose doing major releases to break
>>>> substantial API's or perform large re-architecting that prevent users from
>>>> upgrading. Spark has always had a culture of evolving architecture
>>>> incrementally and making changes - and I don't think we want to change this
>>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>>
>>>> If the community likes the above model, then to me it seems reasonable
>>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>>> cadence of major releases every 2 years seems doable within the above 
>>>> model.
>>>>
>>>> Under this model, here is a list of example things I would propose
>>>> doing in Spark 2.0, separated into APIs and Operation/Deployment:
>>>>
>>>>
>>>> APIs
>>>>
>>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>>> Spark 1.x.
>>>>
>>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>>> about user applications being unable to use Akka due to Spark’s dependency
>>>> on Akka.
>>>>
>>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>>
>>>> 4. Better class package structure for low level developer API’s. In
>>>> particular, we have some DeveloperApi (mostly various listener-related
>>>> classes) added over the years. Some packages include only one or two public
>>>> classes but a lot of private classes. A better structure is to have public
>>>> classes isolated to a few public packages, and these public packages should
>>>> have minimal private classes for low level developer APIs.
>>>>
>>>> 5. Consolidate task metric and accumulator API. Although having some
>>>> subtle differences, these two are very similar but have completely
>>>> different code path.
>>>>
>>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>>> ML pipelines, and will be used by streaming also.
>>>>
>>>>
>>>> Operation/Deployment
>>>>
>>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>>> but it has been end-of-life.
>>>>
>>>> 2. Remove Hadoop 1 support.
>>>>
>>>> 3. Assembly-free distribution of Spark: don’t require building an
>>>> enormous assembly jar in order to run Spark.
>>>>
>>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Reply via email to