Re: A proposal for Spark 2.0

Rad Gruchalski Thu, 03 Dec 2015 15:00:07 -0800

There was a talk in this thread about removing the fine-grained Mesos 
scheduler. I think it would a loss to lose it completely, however, I understand 
that it might be a burden to keep it under development for Mesos only.  
Having been thinking about it for a while, it would be great if the schedulers 
were pluggable. If Spark 2 could offer a way of registering a scheduling 
mechanism then the Mesos fine-grained scheduler could be moved to a separate 
project and, possibly, maintained by a separate community.
This would also enable people to add more schedulers in the future - Kubernetes 
comes into mind but also Docker Swarm would become an option. This would allow 
growing the ecosystem a bit.


I’d be very interested in working on such a feature.










Kind regards, 
Radek Gruchalski
 ra...@gruchalski.com (mailto:ra...@gruchalski.com)  
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Thursday, 3 December 2015 at 21:28, Koert Kuipers wrote:

> spark 1.x has been supporting scala 2.11 for 3 or 4 releases now. seems to me 
> you already provide a clear upgrade path: get on scala 2.11 before upgrading 
> to spark 2.x
>  
> from scala team when scala 2.10.6 came out:
> We strongly encourage you to upgrade to the latest stable version of Scala 
> 2.11.x, as the 2.10.x series is no longer actively maintained.
>  
>  
>  
>  
>  
> On Thu, Dec 3, 2015 at 1:03 PM, Mark Hamstra <m...@clearstorydata.com 
> (mailto:m...@clearstorydata.com)> wrote:
> > Reynold's post fromNov. 25:
> >  
> > > I don't think we should drop support for Scala 2.10, or make it harder in 
> > > terms of operations for people to upgrade.
> > >  
> > > If there are further objections, I'm going to bump remove the 1.7 version 
> > > and retarget things to 2.0 on JIRA.
> >  
> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <so...@cloudera.com 
> > (mailto:so...@cloudera.com)> wrote:
> > > Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
> > > think that's premature. If there's a 1.7.0 then we've lost info about
> > > what it would contain. It's trivial at any later point to merge the
> > > versions. And, since things change and there's not a pressing need to
> > > decide one way or the other, it seems fine to at least collect this
> > > info like we have things like "1.4.3" that may never be released. I'd
> > > like to add it back?
> > >  
> > > On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com 
> > > (mailto:so...@cloudera.com)> wrote:
> > > > Maintaining both a 1.7 and 2.0 is too much work for the project, which
> > > > is over-stretched now. This means that after 1.6 it's just small
> > > > maintenance releases in 1.x and no substantial features or evolution.
> > > > This means that the "in progress" APIs in 1.x that will stay that way,
> > > > unless one updates to 2.x. It's not unreasonable, but means the update
> > > > to the 2.x line isn't going to be that optional for users.
> > > >
> > > > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> > > > it for a couple years, note. 2.10 is still used today, but that's the
> > > > point of the current stable 1.x release in general: if you want to
> > > > stick to current dependencies, stick to the current release. Although
> > > > I think that's the right way to think about support across major
> > > > versions in general, I can see that 2.x is more of a required update
> > > > for those following the project's fixes and releases. Hence may indeed
> > > > be important to just keep supporting 2.10.
> > > >
> > > > I can't see supporting 2.12 at the same time (right?). Is that a
> > > > concern? it will be long since GA by the time 2.x is first released.
> > > >
> > > > There's another fairly coherent worldview where development continues
> > > > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> > > > 2.0 is delayed somewhat into next year, and by that time supporting
> > > > 2.11+2.12 and Java 8 looks more feasible and more in tune with
> > > > currently deployed versions.
> > > >
> > > > I can't say I have a strong view but I personally hadn't imagined 2.x
> > > > would start now.
> > > >
> > > >
> > > > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <r...@databricks.com 
> > > > (mailto:r...@databricks.com)> wrote:
> > > >> I don't think we should drop support for Scala 2.10, or make it harder 
> > > >> in
> > > >> terms of operations for people to upgrade.
> > > >>
> > > >> If there are further objections, I'm going to bump remove the 1.7 
> > > >> version
> > > >> and retarget things to 2.0 on JIRA.
> > > >>
> > > >>
> > > >> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sandy.r...@cloudera.com 
> > > >> (mailto:sandy.r...@cloudera.com)>
> > > >> wrote:
> > > >>>
> > > >>> I see.  My concern is / was that cluster operators will be reluctant 
> > > >>> to
> > > >>> upgrade to 2.0, meaning that developers using those clusters need to 
> > > >>> stay on
> > > >>> 1.x, and, if they want to move to DataFrames, essentially need to 
> > > >>> port their
> > > >>> app twice.
> > > >>>
> > > >>> I misunderstood and thought part of the proposal was to drop support 
> > > >>> for
> > > >>> 2.10 though.  If your broad point is that there aren't changes in 2.0 
> > > >>> that
> > > >>> will make it less palatable to cluster administrators than releases 
> > > >>> in the
> > > >>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
> > > >>>
> > > >>> -Sandy
> > > >>>
> > > >>>
> > > >>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia 
> > > >>> <matei.zaha...@gmail.com (mailto:matei.zaha...@gmail.com)>
> > > >>> wrote:
> > > >>>>
> > > >>>> What are the other breaking changes in 2.0 though? Note that we're 
> > > >>>> not
> > > >>>> removing Scala 2.10, we're just making the default build be against 
> > > >>>> Scala
> > > >>>> 2.11 instead of 2.10. There seem to be very few changes that people 
> > > >>>> would
> > > >>>> worry about. If people are going to update their apps, I think it's 
> > > >>>> better
> > > >>>> to make the other small changes in 2.0 at the same time than to 
> > > >>>> update once
> > > >>>> for Dataset and another time for 2.0.
> > > >>>>
> > > >>>> BTW just refer to Reynold's original post for the other proposed API
> > > >>>> changes.
> > > >>>>
> > > >>>> Matei
> > > >>>>
> > > >>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.r...@cloudera.com 
> > > >>>> (mailto:sandy.r...@cloudera.com)> wrote:
> > > >>>>
> > > >>>> I think that Kostas' logic still holds.  The majority of Spark 
> > > >>>> users, and
> > > >>>> likely an even vaster majority of people running vaster jobs, are 
> > > >>>> still on
> > > >>>> RDDs and on the cusp of upgrading to DataFrames.  Users will 
> > > >>>> probably want
> > > >>>> to upgrade to the stable version of the Dataset / DataFrame API so 
> > > >>>> they
> > > >>>> don't need to do so twice.  Requiring that they absorb all the other 
> > > >>>> ways
> > > >>>> that Spark breaks compatibility in the move to 2.0 makes it much more
> > > >>>> difficult for them to make this transition.
> > > >>>>
> > > >>>> Using the same set of APIs also means that it will be easier to 
> > > >>>> backport
> > > >>>> critical fixes to the 1.x line.
> > > >>>>
> > > >>>> It's not clear to me that avoiding breakage of an experimental API 
> > > >>>> in the
> > > >>>> 1.x line outweighs these issues.
> > > >>>>
> > > >>>> -Sandy
> > > >>>>
> > > >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <r...@databricks.com 
> > > >>>> (mailto:r...@databricks.com)>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
> > > >>>>> reason is that I already know we have to break some part of the
> > > >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. 
> > > >>>>> DataFrame.map
> > > >>>>> should return Dataset rather than RDD). In that case, I'd rather 
> > > >>>>> break this
> > > >>>>> sooner (in one release) than later (in two releases). so the damage 
> > > >>>>> is
> > > >>>>> smaller.
> > > >>>>>
> > > >>>>> I don't think whether we call Dataset/DataFrame experimental or not
> > > >>>>> matters too much for 2.0. We can still call Dataset experimental in 
> > > >>>>> 2.0 and
> > > >>>>> then mark them as stable in 2.1. Despite being "experimental", 
> > > >>>>> there has
> > > >>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
> > > >>>>> <m...@clearstorydata.com (mailto:m...@clearstorydata.com)>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
> > > >>>>>> fixing.  We're on the same page now.
> > > >>>>>>
> > > >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
> > > >>>>>> <kos...@cloudera.com (mailto:kos...@cloudera.com)>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> A 1.6.x release will only fix bugs - we typically don't change 
> > > >>>>>>> APIs in
> > > >>>>>>> z releases. The Dataset API is experimental and so we might be 
> > > >>>>>>> changing the
> > > >>>>>>> APIs before we declare it stable. This is why I think it is 
> > > >>>>>>> important to
> > > >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before 
> > > >>>>>>> moving to
> > > >>>>>>> Spark 2.0. This will benefit users that would like to use the new 
> > > >>>>>>> Dataset
> > > >>>>>>> APIs but can't move to Spark 2.0 because of the backwards 
> > > >>>>>>> incompatible
> > > >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
> > > >>>>>>>
> > > >>>>>>> Kostas
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
> > > >>>>>>> <m...@clearstorydata.com (mailto:m...@clearstorydata.com)> wrote:
> > > >>>>>>>>
> > > >>>>>>>> Why does stabilization of those two features require a 1.7 
> > > >>>>>>>> release
> > > >>>>>>>> instead of 1.6.1?
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
> > > >>>>>>>> <kos...@cloudera.com (mailto:kos...@cloudera.com)> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - 
> > > >>>>>>>>> yes we
> > > >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 
> > > >>>>>>>>> 2.0. I'd like to
> > > >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will 
> > > >>>>>>>>> allow us to
> > > >>>>>>>>> stabilize a few of the new features that were added in 1.6:
> > > >>>>>>>>>
> > > >>>>>>>>> 1) the experimental Datasets API
> > > >>>>>>>>> 2) the new unified memory manager.
> > > >>>>>>>>>
> > > >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy 
> > > >>>>>>>>> transition
> > > >>>>>>>>> but there will be users that won't be able to seamlessly 
> > > >>>>>>>>> upgrade given what
> > > >>>>>>>>> we have discussed as in scope for 2.0. For these users, having 
> > > >>>>>>>>> a 1.x release
> > > >>>>>>>>> with these new features/APIs stabilized will be very 
> > > >>>>>>>>> beneficial. This might
> > > >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a 
> > > >>>>>>>>> bad thing.
> > > >>>>>>>>>
> > > >>>>>>>>> Any thoughts on this timeline?
> > > >>>>>>>>>
> > > >>>>>>>>> Kostas Sakellis
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
> > > >>>>>>>>> <hao.ch...@intel.com (mailto:hao.ch...@intel.com)>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Agree, more features/apis/optimization need to be added in 
> > > >>>>>>>>>> DF/DS.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
> > > >>>>>>>>>> provide to developer, maybe the fundamental API is enough, 
> > > >>>>>>>>>> like, the
> > > >>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this 
> > > >>>>>>>>>> category, as we
> > > >>>>>>>>>> can do the same thing easily with DF/DS, even better 
> > > >>>>>>>>>> performance.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> From: Mark Hamstra [mailto:m...@clearstorydata.com]
> > > >>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
> > > >>>>>>>>>> To: Stephen Boesch
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Cc: dev@spark.apache.org (mailto:dev@spark.apache.org)
> > > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
> > > >>>>>>>>>> argues for retaining the RDD API but not as the first thing 
> > > >>>>>>>>>> presented to new
> > > >>>>>>>>>> Spark developers: "Here's how to use groupBy with 
> > > >>>>>>>>>> DataFrames.... Until the
> > > >>>>>>>>>> optimizer is more fully developed, that won't always get you 
> > > >>>>>>>>>> the best
> > > >>>>>>>>>> performance that can be obtained.  In these particular 
> > > >>>>>>>>>> circumstances, ...,
> > > >>>>>>>>>> you may want to use the low-level RDD API while setting
> > > >>>>>>>>>> preservesPartitioning to true.  Like this...."
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
> > > >>>>>>>>>> <java...@gmail.com (mailto:java...@gmail.com)>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> My understanding is that  the RDD's presently have more 
> > > >>>>>>>>>> support for
> > > >>>>>>>>>> complete control of partitioning which is a key consideration 
> > > >>>>>>>>>> at scale.
> > > >>>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it 
> > > >>>>>>>>>> would seem
> > > >>>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> An example is the use of groupBy when we know that the source
> > > >>>>>>>>>> relation (/RDD) is already partitioned on the grouping 
> > > >>>>>>>>>> expressions.  AFAIK
> > > >>>>>>>>>> the spark sql still does not allow that knowledge to be 
> > > >>>>>>>>>> applied to the
> > > >>>>>>>>>> optimizer - so a full shuffle will be performed. However in 
> > > >>>>>>>>>> the native RDD
> > > >>>>>>>>>> we can use preservesPartitioning=true.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra 
> > > >>>>>>>>>> <m...@clearstorydata.com (mailto:m...@clearstorydata.com)>:
> > > >>>>>>>>>>
> > > >>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
> > > >>>>>>>>>> wondering about.  I think it may be going too far to deprecate 
> > > >>>>>>>>>> it, but
> > > >>>>>>>>>> changing emphasis is something that we might consider.  The 
> > > >>>>>>>>>> RDD API came
> > > >>>>>>>>>> well before DataFrames and DataSets, so programming guides, 
> > > >>>>>>>>>> introductory
> > > >>>>>>>>>> how-to articles and the like have, to this point, also tended 
> > > >>>>>>>>>> to emphasize
> > > >>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm 
> > > >>>>>>>>>> thinking is that with
> > > >>>>>>>>>> 2.0 maybe we should overhaul all the documentation to 
> > > >>>>>>>>>> de-emphasize and
> > > >>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets 
> > > >>>>>>>>>> would be
> > > >>>>>>>>>> introduced and fully addressed before RDDs.  They would be 
> > > >>>>>>>>>> presented as the
> > > >>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in 
> > > >>>>>>>>>> contrast, would
> > > >>>>>>>>>> be presented later as a kind of lower-level, 
> > > >>>>>>>>>> closer-to-the-metal API that
> > > >>>>>>>>>> can be used in atypical, more specialized contexts where 
> > > >>>>>>>>>> DataFrames or
> > > >>>>>>>>>> DataSets don't fully fit.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao 
> > > >>>>>>>>>> <hao.ch...@intel.com (mailto:hao.ch...@intel.com)>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> I am not sure what the best practice for this specific 
> > > >>>>>>>>>> problem, but
> > > >>>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful 
> > > >>>>>>>>>> issue for
> > > >>>>>>>>>> lots of users.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API 
> > > >>>>>>>>>> (or
> > > >>>>>>>>>> internal API only?)? As lots of its functionality overlapping 
> > > >>>>>>>>>> with DataFrame
> > > >>>>>>>>>> or DataSet.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hao
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> From: Kostas Sakellis [mailto:kos...@cloudera.com]
> > > >>>>>>>>>> Sent: Friday, November 13, 2015 5:27 AM
> > > >>>>>>>>>> To: Nicholas Chammas
> > > >>>>>>>>>> Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com 
> > > >>>>>>>>>> (mailto:wi...@qq.com); dev@spark.apache.org 
> > > >>>>>>>>>> (mailto:dev@spark.apache.org);
> > > >>>>>>>>>> Reynold Xin
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm 
> > > >>>>>>>>>> hoping
> > > >>>>>>>>>> that with Spark 2.0 we can also look at better classpath 
> > > >>>>>>>>>> isolation with user
> > > >>>>>>>>>> programs. I propose we build on 
> > > >>>>>>>>>> spark.{driver|executor}.userClassPathFirst,
> > > >>>>>>>>>> setting it true by default, and not allow any spark transitive 
> > > >>>>>>>>>> dependencies
> > > >>>>>>>>>> to leak into user code. For backwards compatibility we can 
> > > >>>>>>>>>> have a whitelist
> > > >>>>>>>>>> if we want but I'd be good if we start requiring user apps to 
> > > >>>>>>>>>> explicitly
> > > >>>>>>>>>> pull in all their dependencies. From what I can tell, Hadoop 3 
> > > >>>>>>>>>> is also
> > > >>>>>>>>>> moving in this direction.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Kostas
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas
> > > >>>>>>>>>> <nicholas.cham...@gmail.com 
> > > >>>>>>>>>> (mailto:nicholas.cham...@gmail.com)> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> With regards to Machine learning, it would be great to move 
> > > >>>>>>>>>> useful
> > > >>>>>>>>>> features from MLlib to ML and deprecate the former. Current 
> > > >>>>>>>>>> structure of two
> > > >>>>>>>>>> separate machine learning packages seems to be somewhat 
> > > >>>>>>>>>> confusing.
> > > >>>>>>>>>>
> > > >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use 
> > > >>>>>>>>>> of
> > > >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX 
> > > >>>>>>>>>> evolve with
> > > >>>>>>>>>> Tungsten.
> > > >>>>>>>>>>
> > > >>>>>>>>>> On that note of deprecating stuff, it might be good to 
> > > >>>>>>>>>> deprecate
> > > >>>>>>>>>> some things in 2.0 without removing or replacing them 
> > > >>>>>>>>>> immediately. That way
> > > >>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to 
> > > >>>>>>>>>> deprecate to be
> > > >>>>>>>>>> replaced all at once.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Nick
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander
> > > >>>>>>>>>> <alexander.ula...@hpe.com (mailto:alexander.ula...@hpe.com)> 
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Parameter Server is a new feature and thus does not match the 
> > > >>>>>>>>>> goal
> > > >>>>>>>>>> of 2.0 is “to fix things that are broken in the current API 
> > > >>>>>>>>>> and remove
> > > >>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to 
> > > >>>>>>>>>> have that
> > > >>>>>>>>>> feature.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> With regards to Machine learning, it would be great to move 
> > > >>>>>>>>>> useful
> > > >>>>>>>>>> features from MLlib to ML and deprecate the former. Current 
> > > >>>>>>>>>> structure of two
> > > >>>>>>>>>> separate machine learning packages seems to be somewhat 
> > > >>>>>>>>>> confusing.
> > > >>>>>>>>>>
> > > >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use 
> > > >>>>>>>>>> of
> > > >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX 
> > > >>>>>>>>>> evolve with
> > > >>>>>>>>>> Tungsten.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Best regards, Alexander
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
> > > >>>>>>>>>> Sent: Thursday, November 12, 2015 7:28 AM
> > > >>>>>>>>>> To: wi...@qq.com (mailto:wi...@qq.com)
> > > >>>>>>>>>> Cc: dev@spark.apache.org (mailto:dev@spark.apache.org)
> > > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Being specific to Parameter Server, I think the current 
> > > >>>>>>>>>> agreement
> > > >>>>>>>>>> is that PS shall exist as a third-party library instead of a 
> > > >>>>>>>>>> component of
> > > >>>>>>>>>> the core code base, isn’t?
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Best,
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>>
> > > >>>>>>>>>> Nan Zhu
> > > >>>>>>>>>>
> > > >>>>>>>>>> http://codingcat.me
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com 
> > > >>>>>>>>>> (mailto:wi...@qq.com) wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Who has the idea of machine learning? Spark missing some 
> > > >>>>>>>>>> features
> > > >>>>>>>>>> for machine learning, For example, the parameter server.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaha...@gmail.com 
> > > >>>>>>>>>> (mailto:matei.zaha...@gmail.com)> 写道：
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I like the idea of popping out Tachyon to an optional 
> > > >>>>>>>>>> component too
> > > >>>>>>>>>> to reduce the number of dependencies. In the future, it might 
> > > >>>>>>>>>> even be useful
> > > >>>>>>>>>> to do this for Hadoop, but it requires too many API changes to 
> > > >>>>>>>>>> be worth
> > > >>>>>>>>>> doing now.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regarding Scala 2.12, we should definitely support it 
> > > >>>>>>>>>> eventually,
> > > >>>>>>>>>> but I don't think we need to block 2.0 on that because it can 
> > > >>>>>>>>>> be added later
> > > >>>>>>>>>> too. Has anyone investigated what it would take to run on 
> > > >>>>>>>>>> there? I imagine
> > > >>>>>>>>>> we don't need many code changes, just maybe some REPL stuff.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
> > > >>>>>>>>>> releases as undisruptive as possible in the model Reynold 
> > > >>>>>>>>>> proposed. Keeping
> > > >>>>>>>>>> everyone working with the same set of releases is super 
> > > >>>>>>>>>> important.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Matei
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com 
> > > >>>>>>>>>> (mailto:so...@cloudera.com)> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
> > > >>>>>>>>>> <r...@databricks.com (mailto:r...@databricks.com)>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> to the Spark community. A major release should not be very
> > > >>>>>>>>>> different from a
> > > >>>>>>>>>>
> > > >>>>>>>>>> minor release and should not be gated based on new features. 
> > > >>>>>>>>>> The
> > > >>>>>>>>>> main
> > > >>>>>>>>>>
> > > >>>>>>>>>> purpose of a major release is an opportunity to fix things 
> > > >>>>>>>>>> that are
> > > >>>>>>>>>> broken
> > > >>>>>>>>>>
> > > >>>>>>>>>> in the current API and remove certain deprecated APIs (examples
> > > >>>>>>>>>> follow).
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Agree with this stance. Generally, a major release might also 
> > > >>>>>>>>>> be a
> > > >>>>>>>>>>
> > > >>>>>>>>>> time to replace some big old API or implementation with a new 
> > > >>>>>>>>>> one,
> > > >>>>>>>>>> but
> > > >>>>>>>>>>
> > > >>>>>>>>>> I don't see obvious candidates.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, 
> > > >>>>>>>>>> unless
> > > >>>>>>>>>>
> > > >>>>>>>>>> there's a fairly good reason to continue adding features in 
> > > >>>>>>>>>> 1.x to
> > > >>>>>>>>>> a
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1. Scala 2.11 as the default build. We should still support 
> > > >>>>>>>>>> Scala
> > > >>>>>>>>>> 2.10, but
> > > >>>>>>>>>>
> > > >>>>>>>>>> it has been end-of-life.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 
> > > >>>>>>>>>> 2.11
> > > >>>>>>>>>> will
> > > >>>>>>>>>>
> > > >>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
> > > >>>>>>>>>> propose
> > > >>>>>>>>>>
> > > >>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2. Remove Hadoop 1 support.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 
> > > >>>>>>>>>> were
> > > >>>>>>>>>>
> > > >>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'm sure we'll think of a number of other small things -- 
> > > >>>>>>>>>> shading a
> > > >>>>>>>>>>
> > > >>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
> > > >>>>>>>>>>
> > > >>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
> > > >>>>>>>>>> this?)
> > > >>>>>>>>>>
> > > >>>>>>>>>> Pop out any Docker stuff to another repo?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Continue that same effort for EC2?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Farming out some of the "external" integrations to another 
> > > >>>>>>>>>> repo (?
> > > >>>>>>>>>>
> > > >>>>>>>>>> controversial)
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> See also anything marked version "2+" in JIRA.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> ---------------------------------------------------------------------
> > > >>>>>>>>>>
> > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > > >>>>>>>>>> (mailto:dev-unsubscr...@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org 
> > > >>>>>>>>>> (mailto:dev-h...@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> ---------------------------------------------------------------------
> > > >>>>>>>>>>
> > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > > >>>>>>>>>> (mailto:dev-unsubscr...@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org 
> > > >>>>>>>>>> (mailto:dev-h...@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> ---------------------------------------------------------------------
> > > >>>>>>>>>>
> > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > > >>>>>>>>>> (mailto:dev-unsubscr...@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org 
> > > >>>>>>>>>> (mailto:dev-h...@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > >  
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > > (mailto:dev-unsubscr...@spark.apache.org)
> > > For additional commands, e-mail: dev-h...@spark.apache.org 
> > > (mailto:dev-h...@spark.apache.org)
> > >  
> >  
>

Re: A proposal for Spark 2.0

Reply via email to