There was a talk in this thread about removing the fine-grained Mesos scheduler. I think it would a loss to lose it completely, however, I understand that it might be a burden to keep it under development for Mesos only. Having been thinking about it for a while, it would be great if the schedulers were pluggable. If Spark 2 could offer a way of registering a scheduling mechanism then the Mesos fine-grained scheduler could be moved to a separate project and, possibly, maintained by a separate community. This would also enable people to add more schedulers in the future - Kubernetes comes into mind but also Docker Swarm would become an option. This would allow growing the ecosystem a bit.
I’d be very interested in working on such a feature. Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Thursday, 3 December 2015 at 21:28, Koert Kuipers wrote: > spark 1.x has been supporting scala 2.11 for 3 or 4 releases now. seems to me > you already provide a clear upgrade path: get on scala 2.11 before upgrading > to spark 2.x > > from scala team when scala 2.10.6 came out: > We strongly encourage you to upgrade to the latest stable version of Scala > 2.11.x, as the 2.10.x series is no longer actively maintained. > > > > > > On Thu, Dec 3, 2015 at 1:03 PM, Mark Hamstra <m...@clearstorydata.com > (mailto:m...@clearstorydata.com)> wrote: > > Reynold's post fromNov. 25: > > > > > I don't think we should drop support for Scala 2.10, or make it harder in > > > terms of operations for people to upgrade. > > > > > > If there are further objections, I'm going to bump remove the 1.7 version > > > and retarget things to 2.0 on JIRA. > > > > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <so...@cloudera.com > > (mailto:so...@cloudera.com)> wrote: > > > Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I > > > think that's premature. If there's a 1.7.0 then we've lost info about > > > what it would contain. It's trivial at any later point to merge the > > > versions. And, since things change and there's not a pressing need to > > > decide one way or the other, it seems fine to at least collect this > > > info like we have things like "1.4.3" that may never be released. I'd > > > like to add it back? > > > > > > On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com > > > (mailto:so...@cloudera.com)> wrote: > > > > Maintaining both a 1.7 and 2.0 is too much work for the project, which > > > > is over-stretched now. This means that after 1.6 it's just small > > > > maintenance releases in 1.x and no substantial features or evolution. > > > > This means that the "in progress" APIs in 1.x that will stay that way, > > > > unless one updates to 2.x. It's not unreasonable, but means the update > > > > to the 2.x line isn't going to be that optional for users. > > > > > > > > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting > > > > it for a couple years, note. 2.10 is still used today, but that's the > > > > point of the current stable 1.x release in general: if you want to > > > > stick to current dependencies, stick to the current release. Although > > > > I think that's the right way to think about support across major > > > > versions in general, I can see that 2.x is more of a required update > > > > for those following the project's fixes and releases. Hence may indeed > > > > be important to just keep supporting 2.10. > > > > > > > > I can't see supporting 2.12 at the same time (right?). Is that a > > > > concern? it will be long since GA by the time 2.x is first released. > > > > > > > > There's another fairly coherent worldview where development continues > > > > in 1.7 and focuses on finishing the loose ends and lots of bug fixing. > > > > 2.0 is delayed somewhat into next year, and by that time supporting > > > > 2.11+2.12 and Java 8 looks more feasible and more in tune with > > > > currently deployed versions. > > > > > > > > I can't say I have a strong view but I personally hadn't imagined 2.x > > > > would start now. > > > > > > > > > > > > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <r...@databricks.com > > > > (mailto:r...@databricks.com)> wrote: > > > >> I don't think we should drop support for Scala 2.10, or make it harder > > > >> in > > > >> terms of operations for people to upgrade. > > > >> > > > >> If there are further objections, I'm going to bump remove the 1.7 > > > >> version > > > >> and retarget things to 2.0 on JIRA. > > > >> > > > >> > > > >> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sandy.r...@cloudera.com > > > >> (mailto:sandy.r...@cloudera.com)> > > > >> wrote: > > > >>> > > > >>> I see. My concern is / was that cluster operators will be reluctant > > > >>> to > > > >>> upgrade to 2.0, meaning that developers using those clusters need to > > > >>> stay on > > > >>> 1.x, and, if they want to move to DataFrames, essentially need to > > > >>> port their > > > >>> app twice. > > > >>> > > > >>> I misunderstood and thought part of the proposal was to drop support > > > >>> for > > > >>> 2.10 though. If your broad point is that there aren't changes in 2.0 > > > >>> that > > > >>> will make it less palatable to cluster administrators than releases > > > >>> in the > > > >>> 1.x line, then yes, 2.0 as the next release sounds fine to me. > > > >>> > > > >>> -Sandy > > > >>> > > > >>> > > > >>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia > > > >>> <matei.zaha...@gmail.com (mailto:matei.zaha...@gmail.com)> > > > >>> wrote: > > > >>>> > > > >>>> What are the other breaking changes in 2.0 though? Note that we're > > > >>>> not > > > >>>> removing Scala 2.10, we're just making the default build be against > > > >>>> Scala > > > >>>> 2.11 instead of 2.10. There seem to be very few changes that people > > > >>>> would > > > >>>> worry about. If people are going to update their apps, I think it's > > > >>>> better > > > >>>> to make the other small changes in 2.0 at the same time than to > > > >>>> update once > > > >>>> for Dataset and another time for 2.0. > > > >>>> > > > >>>> BTW just refer to Reynold's original post for the other proposed API > > > >>>> changes. > > > >>>> > > > >>>> Matei > > > >>>> > > > >>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.r...@cloudera.com > > > >>>> (mailto:sandy.r...@cloudera.com)> wrote: > > > >>>> > > > >>>> I think that Kostas' logic still holds. The majority of Spark > > > >>>> users, and > > > >>>> likely an even vaster majority of people running vaster jobs, are > > > >>>> still on > > > >>>> RDDs and on the cusp of upgrading to DataFrames. Users will > > > >>>> probably want > > > >>>> to upgrade to the stable version of the Dataset / DataFrame API so > > > >>>> they > > > >>>> don't need to do so twice. Requiring that they absorb all the other > > > >>>> ways > > > >>>> that Spark breaks compatibility in the move to 2.0 makes it much more > > > >>>> difficult for them to make this transition. > > > >>>> > > > >>>> Using the same set of APIs also means that it will be easier to > > > >>>> backport > > > >>>> critical fixes to the 1.x line. > > > >>>> > > > >>>> It's not clear to me that avoiding breakage of an experimental API > > > >>>> in the > > > >>>> 1.x line outweighs these issues. > > > >>>> > > > >>>> -Sandy > > > >>>> > > > >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <r...@databricks.com > > > >>>> (mailto:r...@databricks.com)> > > > >>>> wrote: > > > >>>>> > > > >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The > > > >>>>> reason is that I already know we have to break some part of the > > > >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. > > > >>>>> DataFrame.map > > > >>>>> should return Dataset rather than RDD). In that case, I'd rather > > > >>>>> break this > > > >>>>> sooner (in one release) than later (in two releases). so the damage > > > >>>>> is > > > >>>>> smaller. > > > >>>>> > > > >>>>> I don't think whether we call Dataset/DataFrame experimental or not > > > >>>>> matters too much for 2.0. We can still call Dataset experimental in > > > >>>>> 2.0 and > > > >>>>> then mark them as stable in 2.1. Despite being "experimental", > > > >>>>> there has > > > >>>>> been no breaking changes to DataFrame from 1.3 to 1.6. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra > > > >>>>> <m...@clearstorydata.com (mailto:m...@clearstorydata.com)> > > > >>>>> wrote: > > > >>>>>> > > > >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug > > > >>>>>> fixing. We're on the same page now. > > > >>>>>> > > > >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis > > > >>>>>> <kos...@cloudera.com (mailto:kos...@cloudera.com)> > > > >>>>>> wrote: > > > >>>>>>> > > > >>>>>>> A 1.6.x release will only fix bugs - we typically don't change > > > >>>>>>> APIs in > > > >>>>>>> z releases. The Dataset API is experimental and so we might be > > > >>>>>>> changing the > > > >>>>>>> APIs before we declare it stable. This is why I think it is > > > >>>>>>> important to > > > >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before > > > >>>>>>> moving to > > > >>>>>>> Spark 2.0. This will benefit users that would like to use the new > > > >>>>>>> Dataset > > > >>>>>>> APIs but can't move to Spark 2.0 because of the backwards > > > >>>>>>> incompatible > > > >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc. > > > >>>>>>> > > > >>>>>>> Kostas > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra > > > >>>>>>> <m...@clearstorydata.com (mailto:m...@clearstorydata.com)> wrote: > > > >>>>>>>> > > > >>>>>>>> Why does stabilization of those two features require a 1.7 > > > >>>>>>>> release > > > >>>>>>>> instead of 1.6.1? > > > >>>>>>>> > > > >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis > > > >>>>>>>> <kos...@cloudera.com (mailto:kos...@cloudera.com)> wrote: > > > >>>>>>>>> > > > >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - > > > >>>>>>>>> yes we > > > >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark > > > >>>>>>>>> 2.0. I'd like to > > > >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will > > > >>>>>>>>> allow us to > > > >>>>>>>>> stabilize a few of the new features that were added in 1.6: > > > >>>>>>>>> > > > >>>>>>>>> 1) the experimental Datasets API > > > >>>>>>>>> 2) the new unified memory manager. > > > >>>>>>>>> > > > >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy > > > >>>>>>>>> transition > > > >>>>>>>>> but there will be users that won't be able to seamlessly > > > >>>>>>>>> upgrade given what > > > >>>>>>>>> we have discussed as in scope for 2.0. For these users, having > > > >>>>>>>>> a 1.x release > > > >>>>>>>>> with these new features/APIs stabilized will be very > > > >>>>>>>>> beneficial. This might > > > >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a > > > >>>>>>>>> bad thing. > > > >>>>>>>>> > > > >>>>>>>>> Any thoughts on this timeline? > > > >>>>>>>>> > > > >>>>>>>>> Kostas Sakellis > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao > > > >>>>>>>>> <hao.ch...@intel.com (mailto:hao.ch...@intel.com)> > > > >>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> Agree, more features/apis/optimization need to be added in > > > >>>>>>>>>> DF/DS. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to > > > >>>>>>>>>> provide to developer, maybe the fundamental API is enough, > > > >>>>>>>>>> like, the > > > >>>>>>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this > > > >>>>>>>>>> category, as we > > > >>>>>>>>>> can do the same thing easily with DF/DS, even better > > > >>>>>>>>>> performance. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> From: Mark Hamstra [mailto:m...@clearstorydata.com] > > > >>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM > > > >>>>>>>>>> To: Stephen Boesch > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Cc: dev@spark.apache.org (mailto:dev@spark.apache.org) > > > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0 > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that > > > >>>>>>>>>> argues for retaining the RDD API but not as the first thing > > > >>>>>>>>>> presented to new > > > >>>>>>>>>> Spark developers: "Here's how to use groupBy with > > > >>>>>>>>>> DataFrames.... Until the > > > >>>>>>>>>> optimizer is more fully developed, that won't always get you > > > >>>>>>>>>> the best > > > >>>>>>>>>> performance that can be obtained. In these particular > > > >>>>>>>>>> circumstances, ..., > > > >>>>>>>>>> you may want to use the low-level RDD API while setting > > > >>>>>>>>>> preservesPartitioning to true. Like this...." > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch > > > >>>>>>>>>> <java...@gmail.com (mailto:java...@gmail.com)> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> My understanding is that the RDD's presently have more > > > >>>>>>>>>> support for > > > >>>>>>>>>> complete control of partitioning which is a key consideration > > > >>>>>>>>>> at scale. > > > >>>>>>>>>> While partitioning control is still piecemeal in DF/DS it > > > >>>>>>>>>> would seem > > > >>>>>>>>>> premature to make RDD's a second-tier approach to spark dev. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> An example is the use of groupBy when we know that the source > > > >>>>>>>>>> relation (/RDD) is already partitioned on the grouping > > > >>>>>>>>>> expressions. AFAIK > > > >>>>>>>>>> the spark sql still does not allow that knowledge to be > > > >>>>>>>>>> applied to the > > > >>>>>>>>>> optimizer - so a full shuffle will be performed. However in > > > >>>>>>>>>> the native RDD > > > >>>>>>>>>> we can use preservesPartitioning=true. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra > > > >>>>>>>>>> <m...@clearstorydata.com (mailto:m...@clearstorydata.com)>: > > > >>>>>>>>>> > > > >>>>>>>>>> The place of the RDD API in 2.0 is also something I've been > > > >>>>>>>>>> wondering about. I think it may be going too far to deprecate > > > >>>>>>>>>> it, but > > > >>>>>>>>>> changing emphasis is something that we might consider. The > > > >>>>>>>>>> RDD API came > > > >>>>>>>>>> well before DataFrames and DataSets, so programming guides, > > > >>>>>>>>>> introductory > > > >>>>>>>>>> how-to articles and the like have, to this point, also tended > > > >>>>>>>>>> to emphasize > > > >>>>>>>>>> RDDs -- or at least to deal with them early. What I'm > > > >>>>>>>>>> thinking is that with > > > >>>>>>>>>> 2.0 maybe we should overhaul all the documentation to > > > >>>>>>>>>> de-emphasize and > > > >>>>>>>>>> reposition RDDs. In this scheme, DataFrames and DataSets > > > >>>>>>>>>> would be > > > >>>>>>>>>> introduced and fully addressed before RDDs. They would be > > > >>>>>>>>>> presented as the > > > >>>>>>>>>> normal/default/standard way to do things in Spark. RDDs, in > > > >>>>>>>>>> contrast, would > > > >>>>>>>>>> be presented later as a kind of lower-level, > > > >>>>>>>>>> closer-to-the-metal API that > > > >>>>>>>>>> can be used in atypical, more specialized contexts where > > > >>>>>>>>>> DataFrames or > > > >>>>>>>>>> DataSets don't fully fit. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao > > > >>>>>>>>>> <hao.ch...@intel.com (mailto:hao.ch...@intel.com)> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> I am not sure what the best practice for this specific > > > >>>>>>>>>> problem, but > > > >>>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful > > > >>>>>>>>>> issue for > > > >>>>>>>>>> lots of users. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API > > > >>>>>>>>>> (or > > > >>>>>>>>>> internal API only?)? As lots of its functionality overlapping > > > >>>>>>>>>> with DataFrame > > > >>>>>>>>>> or DataSet. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Hao > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> From: Kostas Sakellis [mailto:kos...@cloudera.com] > > > >>>>>>>>>> Sent: Friday, November 13, 2015 5:27 AM > > > >>>>>>>>>> To: Nicholas Chammas > > > >>>>>>>>>> Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com > > > >>>>>>>>>> (mailto:wi...@qq.com); dev@spark.apache.org > > > >>>>>>>>>> (mailto:dev@spark.apache.org); > > > >>>>>>>>>> Reynold Xin > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0 > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm > > > >>>>>>>>>> hoping > > > >>>>>>>>>> that with Spark 2.0 we can also look at better classpath > > > >>>>>>>>>> isolation with user > > > >>>>>>>>>> programs. I propose we build on > > > >>>>>>>>>> spark.{driver|executor}.userClassPathFirst, > > > >>>>>>>>>> setting it true by default, and not allow any spark transitive > > > >>>>>>>>>> dependencies > > > >>>>>>>>>> to leak into user code. For backwards compatibility we can > > > >>>>>>>>>> have a whitelist > > > >>>>>>>>>> if we want but I'd be good if we start requiring user apps to > > > >>>>>>>>>> explicitly > > > >>>>>>>>>> pull in all their dependencies. From what I can tell, Hadoop 3 > > > >>>>>>>>>> is also > > > >>>>>>>>>> moving in this direction. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Kostas > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas > > > >>>>>>>>>> <nicholas.cham...@gmail.com > > > >>>>>>>>>> (mailto:nicholas.cham...@gmail.com)> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> With regards to Machine learning, it would be great to move > > > >>>>>>>>>> useful > > > >>>>>>>>>> features from MLlib to ML and deprecate the former. Current > > > >>>>>>>>>> structure of two > > > >>>>>>>>>> separate machine learning packages seems to be somewhat > > > >>>>>>>>>> confusing. > > > >>>>>>>>>> > > > >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use > > > >>>>>>>>>> of > > > >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX > > > >>>>>>>>>> evolve with > > > >>>>>>>>>> Tungsten. > > > >>>>>>>>>> > > > >>>>>>>>>> On that note of deprecating stuff, it might be good to > > > >>>>>>>>>> deprecate > > > >>>>>>>>>> some things in 2.0 without removing or replacing them > > > >>>>>>>>>> immediately. That way > > > >>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to > > > >>>>>>>>>> deprecate to be > > > >>>>>>>>>> replaced all at once. > > > >>>>>>>>>> > > > >>>>>>>>>> Nick > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander > > > >>>>>>>>>> <alexander.ula...@hpe.com (mailto:alexander.ula...@hpe.com)> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> Parameter Server is a new feature and thus does not match the > > > >>>>>>>>>> goal > > > >>>>>>>>>> of 2.0 is “to fix things that are broken in the current API > > > >>>>>>>>>> and remove > > > >>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to > > > >>>>>>>>>> have that > > > >>>>>>>>>> feature. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> With regards to Machine learning, it would be great to move > > > >>>>>>>>>> useful > > > >>>>>>>>>> features from MLlib to ML and deprecate the former. Current > > > >>>>>>>>>> structure of two > > > >>>>>>>>>> separate machine learning packages seems to be somewhat > > > >>>>>>>>>> confusing. > > > >>>>>>>>>> > > > >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use > > > >>>>>>>>>> of > > > >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX > > > >>>>>>>>>> evolve with > > > >>>>>>>>>> Tungsten. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Best regards, Alexander > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> From: Nan Zhu [mailto:zhunanmcg...@gmail.com] > > > >>>>>>>>>> Sent: Thursday, November 12, 2015 7:28 AM > > > >>>>>>>>>> To: wi...@qq.com (mailto:wi...@qq.com) > > > >>>>>>>>>> Cc: dev@spark.apache.org (mailto:dev@spark.apache.org) > > > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0 > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Being specific to Parameter Server, I think the current > > > >>>>>>>>>> agreement > > > >>>>>>>>>> is that PS shall exist as a third-party library instead of a > > > >>>>>>>>>> component of > > > >>>>>>>>>> the core code base, isn’t? > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Best, > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> -- > > > >>>>>>>>>> > > > >>>>>>>>>> Nan Zhu > > > >>>>>>>>>> > > > >>>>>>>>>> http://codingcat.me > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com > > > >>>>>>>>>> (mailto:wi...@qq.com) wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> Who has the idea of machine learning? Spark missing some > > > >>>>>>>>>> features > > > >>>>>>>>>> for machine learning, For example, the parameter server. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> 在 2015年11月12日,05:32,Matei Zaharia <matei.zaha...@gmail.com > > > >>>>>>>>>> (mailto:matei.zaha...@gmail.com)> 写道: > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> I like the idea of popping out Tachyon to an optional > > > >>>>>>>>>> component too > > > >>>>>>>>>> to reduce the number of dependencies. In the future, it might > > > >>>>>>>>>> even be useful > > > >>>>>>>>>> to do this for Hadoop, but it requires too many API changes to > > > >>>>>>>>>> be worth > > > >>>>>>>>>> doing now. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Regarding Scala 2.12, we should definitely support it > > > >>>>>>>>>> eventually, > > > >>>>>>>>>> but I don't think we need to block 2.0 on that because it can > > > >>>>>>>>>> be added later > > > >>>>>>>>>> too. Has anyone investigated what it would take to run on > > > >>>>>>>>>> there? I imagine > > > >>>>>>>>>> we don't need many code changes, just maybe some REPL stuff. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Needless to say, but I'm all for the idea of making "major" > > > >>>>>>>>>> releases as undisruptive as possible in the model Reynold > > > >>>>>>>>>> proposed. Keeping > > > >>>>>>>>>> everyone working with the same set of releases is super > > > >>>>>>>>>> important. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Matei > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com > > > >>>>>>>>>> (mailto:so...@cloudera.com)> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin > > > >>>>>>>>>> <r...@databricks.com (mailto:r...@databricks.com)> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> to the Spark community. A major release should not be very > > > >>>>>>>>>> different from a > > > >>>>>>>>>> > > > >>>>>>>>>> minor release and should not be gated based on new features. > > > >>>>>>>>>> The > > > >>>>>>>>>> main > > > >>>>>>>>>> > > > >>>>>>>>>> purpose of a major release is an opportunity to fix things > > > >>>>>>>>>> that are > > > >>>>>>>>>> broken > > > >>>>>>>>>> > > > >>>>>>>>>> in the current API and remove certain deprecated APIs (examples > > > >>>>>>>>>> follow). > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Agree with this stance. Generally, a major release might also > > > >>>>>>>>>> be a > > > >>>>>>>>>> > > > >>>>>>>>>> time to replace some big old API or implementation with a new > > > >>>>>>>>>> one, > > > >>>>>>>>>> but > > > >>>>>>>>>> > > > >>>>>>>>>> I don't see obvious candidates. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, > > > >>>>>>>>>> unless > > > >>>>>>>>>> > > > >>>>>>>>>> there's a fairly good reason to continue adding features in > > > >>>>>>>>>> 1.x to > > > >>>>>>>>>> a > > > >>>>>>>>>> > > > >>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> 1. Scala 2.11 as the default build. We should still support > > > >>>>>>>>>> Scala > > > >>>>>>>>>> 2.10, but > > > >>>>>>>>>> > > > >>>>>>>>>> it has been end-of-life. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, > > > >>>>>>>>>> 2.11 > > > >>>>>>>>>> will > > > >>>>>>>>>> > > > >>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd > > > >>>>>>>>>> propose > > > >>>>>>>>>> > > > >>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> 2. Remove Hadoop 1 support. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 > > > >>>>>>>>>> were > > > >>>>>>>>>> > > > >>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> I'm sure we'll think of a number of other small things -- > > > >>>>>>>>>> shading a > > > >>>>>>>>>> > > > >>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of > > > >>>>>>>>>> > > > >>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc? > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed > > > >>>>>>>>>> this?) > > > >>>>>>>>>> > > > >>>>>>>>>> Pop out any Docker stuff to another repo? > > > >>>>>>>>>> > > > >>>>>>>>>> Continue that same effort for EC2? > > > >>>>>>>>>> > > > >>>>>>>>>> Farming out some of the "external" integrations to another > > > >>>>>>>>>> repo (? > > > >>>>>>>>>> > > > >>>>>>>>>> controversial) > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> See also anything marked version "2+" in JIRA. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> --------------------------------------------------------------------- > > > >>>>>>>>>> > > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > >>>>>>>>>> (mailto:dev-unsubscr...@spark.apache.org) > > > >>>>>>>>>> > > > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org > > > >>>>>>>>>> (mailto:dev-h...@spark.apache.org) > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> --------------------------------------------------------------------- > > > >>>>>>>>>> > > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > >>>>>>>>>> (mailto:dev-unsubscr...@spark.apache.org) > > > >>>>>>>>>> > > > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org > > > >>>>>>>>>> (mailto:dev-h...@spark.apache.org) > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> --------------------------------------------------------------------- > > > >>>>>>>>>> > > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > >>>>>>>>>> (mailto:dev-unsubscr...@spark.apache.org) > > > >>>>>>>>>> > > > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org > > > >>>>>>>>>> (mailto:dev-h...@spark.apache.org) > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>> > > > >> > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > (mailto:dev-unsubscr...@spark.apache.org) > > > For additional commands, e-mail: dev-h...@spark.apache.org > > > (mailto:dev-h...@spark.apache.org) > > > > > >