Re: A proposal for Spark 2.0

Reynold Xin Tue, 10 Nov 2015 18:52:33 -0800

Mark,

I think we are in agreement, although I wouldn't go to the extreme and say
"a release with no new features might even be best."


Can you elaborate "anticipatory changes"? A concrete example or so would be
helpful.

On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> I'm liking the way this is shaping up, and I'd summarize it this way (let
> me know if I'm misunderstanding or misrepresenting anything):
>
>    - New features are not at all the focus of Spark 2.0 -- in fact, a
>    release with no new features might even be best.
>    - Remove deprecated API that we agree really should be deprecated.
>    - Fix/change publicly-visible things that anyone who has spent any
>    time looking at already knows are mistakes or should be done better, but
>    that can't be changed within 1.x.
>
> Do we want to attempt anticipatory changes at all?  In other words, are
> there things we want to do in 2.x for which we already know that we'll want
> to make publicly-visible changes or that, if we don't add or change it now,
> will fall into the "everybody knows it shouldn't be that way" category when
> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
> try at all to anticipate what is needed -- working from the premise that
> being forced into a 3.x release earlier than we expect would be less
> painful than trying to back out a mistake made at the outset of 2.0 while
> trying to guess what we'll need.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>

Re: A proposal for Spark 2.0

Reply via email to