Re: A proposal for Spark 2.0

Mridul Muralidharan Thu, 03 Dec 2015 19:15:38 -0800

There was a proposal to make schedulers pluggable in context of adding one
which leverages Apache Tez : IIRC it was a abandoned - but the jira might
be a good starting point.


Regards
Mridul
On Dec 3, 2015 2:59 PM, "Rad Gruchalski" <ra...@gruchalski.com> wrote:

> There was a talk in this thread about removing the fine-grained Mesos
> scheduler. I think it would a loss to lose it completely, however, I
> understand that it might be a burden to keep it under development for Mesos
> only.
> Having been thinking about it for a while, it would be great if the
> schedulers were pluggable. If Spark 2 could offer a way of registering a
> scheduling mechanism then the Mesos fine-grained scheduler could be moved
> to a separate project and, possibly, maintained by a separate community.
> This would also enable people to add more schedulers in the future -
> Kubernetes comes into mind but also Docker Swarm would become an option.
> This would allow growing the ecosystem a bit.
>
> I’d be very interested in working on such a feature.
>
> Kind regards,
> Radek Gruchalski
> ra...@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Thursday, 3 December 2015 at 21:28, Koert Kuipers wrote:
>
> spark 1.x has been supporting scala 2.11 for 3 or 4 releases now. seems to
> me you already provide a clear upgrade path: get on scala 2.11 before
> upgrading to spark 2.x
>
> from scala team when scala 2.10.6 came out:
> We strongly encourage you to upgrade to the latest stable version of Scala
> 2.11.x, as the 2.10.x series is no longer actively maintained.
>
>
>
>
>
> On Thu, Dec 3, 2015 at 1:03 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
> Reynold's post fromNov. 25:
>
> I don't think we should drop support for Scala 2.10, or make it harder in
> terms of operations for people to upgrade.
>
> If there are further objections, I'm going to bump remove the 1.7 version
> and retarget things to 2.0 on JIRA.
>
>
> On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <so...@cloudera.com> wrote:
>
> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
> think that's premature. If there's a 1.7.0 then we've lost info about
> what it would contain. It's trivial at any later point to merge the
> versions. And, since things change and there's not a pressing need to
> decide one way or the other, it seems fine to at least collect this
> info like we have things like "1.4.3" that may never be released. I'd
> like to add it back?
>
> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com> wrote:
> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
> > is over-stretched now. This means that after 1.6 it's just small
> > maintenance releases in 1.x and no substantial features or evolution.
> > This means that the "in progress" APIs in 1.x that will stay that way,
> > unless one updates to 2.x. It's not unreasonable, but means the update
> > to the 2.x line isn't going to be that optional for users.
> >
> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> > it for a couple years, note. 2.10 is still used today, but that's the
> > point of the current stable 1.x release in general: if you want to
> > stick to current dependencies, stick to the current release. Although
> > I think that's the right way to think about support across major
> > versions in general, I can see that 2.x is more of a required update
> > for those following the project's fixes and releases. Hence may indeed
> > be important to just keep supporting 2.10.
> >
> > I can't see supporting 2.12 at the same time (right?). Is that a
> > concern? it will be long since GA by the time 2.x is first released.
> >
> > There's another fairly coherent worldview where development continues
> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> > 2.0 is delayed somewhat into next year, and by that time supporting
> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
> > currently deployed versions.
> >
> > I can't say I have a strong view but I personally hadn't imagined 2.x
> > would start now.
> >
> >
> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <r...@databricks.com>
> wrote:
> >> I don't think we should drop support for Scala 2.10, or make it harder
> in
> >> terms of operations for people to upgrade.
> >>
> >> If there are further objections, I'm going to bump remove the 1.7
> version
> >> and retarget things to 2.0 on JIRA.
> >>
> >>
> >> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sandy.r...@cloudera.com>
> >> wrote:
> >>>
> >>> I see.  My concern is / was that cluster operators will be reluctant to
> >>> upgrade to 2.0, meaning that developers using those clusters need to
> stay on
> >>> 1.x, and, if they want to move to DataFrames, essentially need to port
> their
> >>> app twice.
> >>>
> >>> I misunderstood and thought part of the proposal was to drop support
> for
> >>> 2.10 though.  If your broad point is that there aren't changes in 2.0
> that
> >>> will make it less palatable to cluster administrators than releases in
> the
> >>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
> >>>
> >>> -Sandy
> >>>
> >>>
> >>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <
> matei.zaha...@gmail.com>
> >>> wrote:
> >>>>
> >>>> What are the other breaking changes in 2.0 though? Note that we're not
> >>>> removing Scala 2.10, we're just making the default build be against
> Scala
> >>>> 2.11 instead of 2.10. There seem to be very few changes that people
> would
> >>>> worry about. If people are going to update their apps, I think it's
> better
> >>>> to make the other small changes in 2.0 at the same time than to
> update once
> >>>> for Dataset and another time for 2.0.
> >>>>
> >>>> BTW just refer to Reynold's original post for the other proposed API
> >>>> changes.
> >>>>
> >>>> Matei
> >>>>
> >>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
> >>>>
> >>>> I think that Kostas' logic still holds.  The majority of Spark users,
> and
> >>>> likely an even vaster majority of people running vaster jobs, are
> still on
> >>>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably
> want
> >>>> to upgrade to the stable version of the Dataset / DataFrame API so
> they
> >>>> don't need to do so twice.  Requiring that they absorb all the other
> ways
> >>>> that Spark breaks compatibility in the move to 2.0 makes it much more
> >>>> difficult for them to make this transition.
> >>>>
> >>>> Using the same set of APIs also means that it will be easier to
> backport
> >>>> critical fixes to the 1.x line.
> >>>>
> >>>> It's not clear to me that avoiding breakage of an experimental API in
> the
> >>>> 1.x line outweighs these issues.
> >>>>
> >>>> -Sandy
> >>>>
> >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <r...@databricks.com>
> >>>> wrote:
> >>>>>
> >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
> >>>>> reason is that I already know we have to break some part of the
> >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g.
> DataFrame.map
> >>>>> should return Dataset rather than RDD). In that case, I'd rather
> break this
> >>>>> sooner (in one release) than later (in two releases). so the damage
> is
> >>>>> smaller.
> >>>>>
> >>>>> I don't think whether we call Dataset/DataFrame experimental or not
> >>>>> matters too much for 2.0. We can still call Dataset experimental in
> 2.0 and
> >>>>> then mark them as stable in 2.1. Despite being "experimental", there
> has
> >>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <
> m...@clearstorydata.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
> >>>>>> fixing.  We're on the same page now.
> >>>>>>
> >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <
> kos...@cloudera.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> A 1.6.x release will only fix bugs - we typically don't change
> APIs in
> >>>>>>> z releases. The Dataset API is experimental and so we might be
> changing the
> >>>>>>> APIs before we declare it stable. This is why I think it is
> important to
> >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before
> moving to
> >>>>>>> Spark 2.0. This will benefit users that would like to use the new
> Dataset
> >>>>>>> APIs but can't move to Spark 2.0 because of the backwards
> incompatible
> >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
> >>>>>>>
> >>>>>>> Kostas
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
> >>>>>>> <m...@clearstorydata.com> wrote:
> >>>>>>>>
> >>>>>>>> Why does stabilization of those two features require a 1.7 release
> >>>>>>>> instead of 1.6.1?
> >>>>>>>>
> >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
> >>>>>>>> <kos...@cloudera.com> wrote:
> >>>>>>>>>
> >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here -
> yes we
> >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0.
> I'd like to
> >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will
> allow us to
> >>>>>>>>> stabilize a few of the new features that were added in 1.6:
> >>>>>>>>>
> >>>>>>>>> 1) the experimental Datasets API
> >>>>>>>>> 2) the new unified memory manager.
> >>>>>>>>>
> >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy
> transition
> >>>>>>>>> but there will be users that won't be able to seamlessly upgrade
> given what
> >>>>>>>>> we have discussed as in scope for 2.0. For these users, having a
> 1.x release
> >>>>>>>>> with these new features/APIs stabilized will be very beneficial.
> This might
> >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a
> bad thing.
> >>>>>>>>>
> >>>>>>>>> Any thoughts on this timeline?
> >>>>>>>>>
> >>>>>>>>> Kostas Sakellis
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com
> >
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Agree, more features/apis/optimization need to be added in
> DF/DS.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
> >>>>>>>>>> provide to developer, maybe the fundamental API is enough,
> like, the
> >>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this
> category, as we
> >>>>>>>>>> can do the same thing easily with DF/DS, even better
> performance.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Mark Hamstra [mailto:m...@clearstorydata.com]
> >>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
> >>>>>>>>>> To: Stephen Boesch
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Cc: dev@spark.apache.org
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
> >>>>>>>>>> argues for retaining the RDD API but not as the first thing
> presented to new
> >>>>>>>>>> Spark developers: "Here's how to use groupBy with
> DataFrames.... Until the
> >>>>>>>>>> optimizer is more fully developed, that won't always get you
> the best
> >>>>>>>>>> performance that can be obtained.  In these particular
> circumstances, ...,
> >>>>>>>>>> you may want to use the low-level RDD API while setting
> >>>>>>>>>> preservesPartitioning to true.  Like this...."
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <
> java...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> My understanding is that  the RDD's presently have more support
> for
> >>>>>>>>>> complete control of partitioning which is a key consideration
> at scale.
> >>>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it
> would seem
> >>>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> An example is the use of groupBy when we know that the source
> >>>>>>>>>> relation (/RDD) is already partitioned on the grouping
> expressions.  AFAIK
> >>>>>>>>>> the spark sql still does not allow that knowledge to be applied
> to the
> >>>>>>>>>> optimizer - so a full shuffle will be performed. However in the
> native RDD
> >>>>>>>>>> we can use preservesPartitioning=true.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <
> m...@clearstorydata.com>:
> >>>>>>>>>>
> >>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
> >>>>>>>>>> wondering about.  I think it may be going too far to deprecate
> it, but
> >>>>>>>>>> changing emphasis is something that we might consider.  The RDD
> API came
> >>>>>>>>>> well before DataFrames and DataSets, so programming guides,
> introductory
> >>>>>>>>>> how-to articles and the like have, to this point, also tended
> to emphasize
> >>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking
> is that with
> >>>>>>>>>> 2.0 maybe we should overhaul all the documentation to
> de-emphasize and
> >>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would
> be
> >>>>>>>>>> introduced and fully addressed before RDDs.  They would be
> presented as the
> >>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in
> contrast, would
> >>>>>>>>>> be presented later as a kind of lower-level,
> closer-to-the-metal API that
> >>>>>>>>>> can be used in atypical, more specialized contexts where
> DataFrames or
> >>>>>>>>>> DataSets don't fully fit.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <
> hao.ch...@intel.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I am not sure what the best practice for this specific problem,
> but
> >>>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful
> issue for
> >>>>>>>>>> lots of users.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API
> (or
> >>>>>>>>>> internal API only?)? As lots of its functionality overlapping
> with DataFrame
> >>>>>>>>>> or DataSet.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hao
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Kostas Sakellis [mailto:kos...@cloudera.com]
> >>>>>>>>>> Sent: Friday, November 13, 2015 5:27 AM
> >>>>>>>>>> To: Nicholas Chammas
> >>>>>>>>>> Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com;
> dev@spark.apache.org;
> >>>>>>>>>> Reynold Xin
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm
> hoping
> >>>>>>>>>> that with Spark 2.0 we can also look at better classpath
> isolation with user
> >>>>>>>>>> programs. I propose we build on
> spark.{driver|executor}.userClassPathFirst,
> >>>>>>>>>> setting it true by default, and not allow any spark transitive
> dependencies
> >>>>>>>>>> to leak into user code. For backwards compatibility we can have
> a whitelist
> >>>>>>>>>> if we want but I'd be good if we start requiring user apps to
> explicitly
> >>>>>>>>>> pull in all their dependencies. From what I can tell, Hadoop 3
> is also
> >>>>>>>>>> moving in this direction.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Kostas
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas
> >>>>>>>>>> <nicholas.cham...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> With regards to Machine learning, it would be great to move
> useful
> >>>>>>>>>> features from MLlib to ML and deprecate the former. Current
> structure of two
> >>>>>>>>>> separate machine learning packages seems to be somewhat
> confusing.
> >>>>>>>>>>
> >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use
> of
> >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX
> evolve with
> >>>>>>>>>> Tungsten.
> >>>>>>>>>>
> >>>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
> >>>>>>>>>> some things in 2.0 without removing or replacing them
> immediately. That way
> >>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to
> deprecate to be
> >>>>>>>>>> replaced all at once.
> >>>>>>>>>>
> >>>>>>>>>> Nick
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander
> >>>>>>>>>> <alexander.ula...@hpe.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Parameter Server is a new feature and thus does not match the
> goal
> >>>>>>>>>> of 2.0 is “to fix things that are broken in the current API and
> remove
> >>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to
> have that
> >>>>>>>>>> feature.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> With regards to Machine learning, it would be great to move
> useful
> >>>>>>>>>> features from MLlib to ML and deprecate the former. Current
> structure of two
> >>>>>>>>>> separate machine learning packages seems to be somewhat
> confusing.
> >>>>>>>>>>
> >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use
> of
> >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX
> evolve with
> >>>>>>>>>> Tungsten.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Best regards, Alexander
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
> >>>>>>>>>> Sent: Thursday, November 12, 2015 7:28 AM
> >>>>>>>>>> To: wi...@qq.com
> >>>>>>>>>> Cc: dev@spark.apache.org
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Being specific to Parameter Server, I think the current
> agreement
> >>>>>>>>>> is that PS shall exist as a third-party library instead of a
> component of
> >>>>>>>>>> the core code base, isn’t?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> Nan Zhu
> >>>>>>>>>>
> >>>>>>>>>> http://codingcat.me
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
> >>>>>>>>>>
> >>>>>>>>>> Who has the idea of machine learning? Spark missing some
> features
> >>>>>>>>>> for machine learning, For example, the parameter server.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaha...@gmail.com> 写道：
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I like the idea of popping out Tachyon to an optional component
> too
> >>>>>>>>>> to reduce the number of dependencies. In the future, it might
> even be useful
> >>>>>>>>>> to do this for Hadoop, but it requires too many API changes to
> be worth
> >>>>>>>>>> doing now.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Regarding Scala 2.12, we should definitely support it
> eventually,
> >>>>>>>>>> but I don't think we need to block 2.0 on that because it can
> be added later
> >>>>>>>>>> too. Has anyone investigated what it would take to run on
> there? I imagine
> >>>>>>>>>> we don't need many code changes, just maybe some REPL stuff.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
> >>>>>>>>>> releases as undisruptive as possible in the model Reynold
> proposed. Keeping
> >>>>>>>>>> everyone working with the same set of releases is super
> important.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Matei
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <
> r...@databricks.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> to the Spark community. A major release should not be very
> >>>>>>>>>> different from a
> >>>>>>>>>>
> >>>>>>>>>> minor release and should not be gated based on new features. The
> >>>>>>>>>> main
> >>>>>>>>>>
> >>>>>>>>>> purpose of a major release is an opportunity to fix things that
> are
> >>>>>>>>>> broken
> >>>>>>>>>>
> >>>>>>>>>> in the current API and remove certain deprecated APIs (examples
> >>>>>>>>>> follow).
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Agree with this stance. Generally, a major release might also
> be a
> >>>>>>>>>>
> >>>>>>>>>> time to replace some big old API or implementation with a new
> one,
> >>>>>>>>>> but
> >>>>>>>>>>
> >>>>>>>>>> I don't see obvious candidates.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later,
> unless
> >>>>>>>>>>
> >>>>>>>>>> there's a fairly good reason to continue adding features in 1.x
> to
> >>>>>>>>>> a
> >>>>>>>>>>
> >>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 1. Scala 2.11 as the default build. We should still support
> Scala
> >>>>>>>>>> 2.10, but
> >>>>>>>>>>
> >>>>>>>>>> it has been end-of-life.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version,
> 2.11
> >>>>>>>>>> will
> >>>>>>>>>>
> >>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
> >>>>>>>>>> propose
> >>>>>>>>>>
> >>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2. Remove Hadoop 1 support.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1
> were
> >>>>>>>>>>
> >>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'm sure we'll think of a number of other small things --
> shading a
> >>>>>>>>>>
> >>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
> >>>>>>>>>>
> >>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
> >>>>>>>>>> this?)
> >>>>>>>>>>
> >>>>>>>>>> Pop out any Docker stuff to another repo?
> >>>>>>>>>>
> >>>>>>>>>> Continue that same effort for EC2?
> >>>>>>>>>>
> >>>>>>>>>> Farming out some of the "external" integrations to another repo
> (?
> >>>>>>>>>>
> >>>>>>>>>> controversial)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> See also anything marked version "2+" in JIRA.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>
>

Re: A proposal for Spark 2.0

Reply via email to