I don't think there are any plan for Scala 2.12 support yet. We can always add Scala 2.12 support later.
On Thu, Nov 26, 2015 at 12:59 PM, Koert Kuipers <ko...@tresata.com> wrote: > I also thought the idea was to drop 2.10. Do we want to cross build for 3 > scala versions? > On Nov 25, 2015 3:54 AM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote: > >> I see. My concern is / was that cluster operators will be reluctant to >> upgrade to 2.0, meaning that developers using those clusters need to stay >> on 1.x, and, if they want to move to DataFrames, essentially need to port >> their app twice. >> >> I misunderstood and thought part of the proposal was to drop support for >> 2.10 though. If your broad point is that there aren't changes in 2.0 that >> will make it less palatable to cluster administrators than releases in the >> 1.x line, then yes, 2.0 as the next release sounds fine to me. >> >> -Sandy >> >> >> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >> >>> What are the other breaking changes in 2.0 though? Note that we're not >>> removing Scala 2.10, we're just making the default build be against Scala >>> 2.11 instead of 2.10. There seem to be very few changes that people would >>> worry about. If people are going to update their apps, I think it's better >>> to make the other small changes in 2.0 at the same time than to update once >>> for Dataset and another time for 2.0. >>> >>> BTW just refer to Reynold's original post for the other proposed API >>> changes. >>> >>> Matei >>> >>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.r...@cloudera.com> >>> wrote: >>> >>> I think that Kostas' logic still holds. The majority of Spark users, >>> and likely an even vaster majority of people running vaster jobs, are still >>> on RDDs and on the cusp of upgrading to DataFrames. Users will probably >>> want to upgrade to the stable version of the Dataset / DataFrame API so >>> they don't need to do so twice. Requiring that they absorb all the other >>> ways that Spark breaks compatibility in the move to 2.0 makes it much more >>> difficult for them to make this transition. >>> >>> Using the same set of APIs also means that it will be easier to backport >>> critical fixes to the 1.x line. >>> >>> It's not clear to me that avoiding breakage of an experimental API in >>> the 1.x line outweighs these issues. >>> >>> -Sandy >>> >>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> I actually think the next one (after 1.6) should be Spark 2.0. The >>>> reason is that I already know we have to break some part of the >>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map >>>> should return Dataset rather than RDD). In that case, I'd rather break this >>>> sooner (in one release) than later (in two releases). so the damage is >>>> smaller. >>>> >>>> I don't think whether we call Dataset/DataFrame experimental or not >>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and >>>> then mark them as stable in 2.1. Despite being "experimental", there has >>>> been no breaking changes to DataFrame from 1.3 to 1.6. >>>> >>>> >>>> >>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <m...@clearstorydata.com> >>>> wrote: >>>> >>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug >>>>> fixing. We're on the same page now. >>>>> >>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kos...@cloudera.com> >>>>> wrote: >>>>> >>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs >>>>>> in z releases. The Dataset API is experimental and so we might be >>>>>> changing >>>>>> the APIs before we declare it stable. This is why I think it is important >>>>>> to first stabilize the Dataset API with a Spark 1.7 release before moving >>>>>> to Spark 2.0. This will benefit users that would like to use the new >>>>>> Dataset APIs but can't move to Spark 2.0 because of the backwards >>>>>> incompatible changes, like removal of deprecated APIs, Scala 2.11 etc. >>>>>> >>>>>> Kostas >>>>>> >>>>>> >>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra < >>>>>> m...@clearstorydata.com> wrote: >>>>>> >>>>>>> Why does stabilization of those two features require a 1.7 release >>>>>>> instead of 1.6.1? >>>>>>> >>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis < >>>>>>> kos...@cloudera.com> wrote: >>>>>>> >>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes >>>>>>>> we can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd >>>>>>>> like to propose we have one more 1.x release after Spark 1.6. This will >>>>>>>> allow us to stabilize a few of the new features that were added in 1.6: >>>>>>>> >>>>>>>> 1) the experimental Datasets API >>>>>>>> 2) the new unified memory manager. >>>>>>>> >>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition >>>>>>>> but there will be users that won't be able to seamlessly upgrade given >>>>>>>> what >>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x >>>>>>>> release with these new features/APIs stabilized will be very >>>>>>>> beneficial. >>>>>>>> This might make Spark 1.7 a lighter release but that is not >>>>>>>> necessarily a >>>>>>>> bad thing. >>>>>>>> >>>>>>>> Any thoughts on this timeline? >>>>>>>> >>>>>>>> Kostas Sakellis >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to >>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the >>>>>>>>> ShuffledRDD etc.. But PairRDDFunctions probably not in this >>>>>>>>> category, as >>>>>>>>> we can do the same thing easily with DF/DS, even better performance. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM >>>>>>>>> *To:* Stephen Boesch >>>>>>>>> >>>>>>>>> *Cc:* dev@spark.apache.org >>>>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that >>>>>>>>> argues for retaining the RDD API but not as the first thing presented >>>>>>>>> to >>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames.... >>>>>>>>> Until >>>>>>>>> the optimizer is more fully developed, that won't always get you the >>>>>>>>> best >>>>>>>>> performance that can be obtained. In these particular circumstances, >>>>>>>>> ..., >>>>>>>>> you may want to use the low-level RDD API while setting >>>>>>>>> preservesPartitioning to true. Like this...." >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <java...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> My understanding is that the RDD's presently have more support >>>>>>>>> for complete control of partitioning which is a key consideration at >>>>>>>>> scale. While partitioning control is still piecemeal in DF/DS it >>>>>>>>> would >>>>>>>>> seem premature to make RDD's a second-tier approach to spark dev. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> An example is the use of groupBy when we know that the source >>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions. >>>>>>>>> AFAIK >>>>>>>>> the spark sql still does not allow that knowledge to be applied to the >>>>>>>>> optimizer - so a full shuffle will be performed. However in the >>>>>>>>> native RDD >>>>>>>>> we can use preservesPartitioning=true. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>: >>>>>>>>> >>>>>>>>> The place of the RDD API in 2.0 is also something I've been >>>>>>>>> wondering about. I think it may be going too far to deprecate it, but >>>>>>>>> changing emphasis is something that we might consider. The RDD API >>>>>>>>> came >>>>>>>>> well before DataFrames and DataSets, so programming guides, >>>>>>>>> introductory >>>>>>>>> how-to articles and the like have, to this point, also tended to >>>>>>>>> emphasize >>>>>>>>> RDDs -- or at least to deal with them early. What I'm thinking is >>>>>>>>> that >>>>>>>>> with 2.0 maybe we should overhaul all the documentation to >>>>>>>>> de-emphasize and >>>>>>>>> reposition RDDs. In this scheme, DataFrames and DataSets would be >>>>>>>>> introduced and fully addressed before RDDs. They would be presented >>>>>>>>> as the >>>>>>>>> normal/default/standard way to do things in Spark. RDDs, in contrast, >>>>>>>>> would be presented later as a kind of lower-level, >>>>>>>>> closer-to-the-metal API >>>>>>>>> that can be used in atypical, more specialized contexts where >>>>>>>>> DataFrames or >>>>>>>>> DataSets don't fully fit. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> I am not sure what the best practice for this specific problem, >>>>>>>>> but it’s really worth to think about it in 2.0, as it is a painful >>>>>>>>> issue >>>>>>>>> for lots of users. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or >>>>>>>>> internal API only?)? As lots of its functionality overlapping with >>>>>>>>> DataFrame or DataSet. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hao >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com] >>>>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM >>>>>>>>> *To:* Nicholas Chammas >>>>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; >>>>>>>>> dev@spark.apache.org; Reynold Xin >>>>>>>>> >>>>>>>>> >>>>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I know we want to keep breaking changes to a minimum but I'm >>>>>>>>> hoping that with Spark 2.0 we can also look at better classpath >>>>>>>>> isolation >>>>>>>>> with user programs. I propose we build on >>>>>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by >>>>>>>>> default, and >>>>>>>>> not allow any spark transitive dependencies to leak into user code. >>>>>>>>> For >>>>>>>>> backwards compatibility we can have a whitelist if we want but I'd be >>>>>>>>> good >>>>>>>>> if we start requiring user apps to explicitly pull in all their >>>>>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this >>>>>>>>> direction. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Kostas >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas < >>>>>>>>> nicholas.cham...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> With regards to Machine learning, it would be great to move useful >>>>>>>>> features from MLlib to ML and deprecate the former. Current structure >>>>>>>>> of >>>>>>>>> two separate machine learning packages seems to be somewhat confusing. >>>>>>>>> >>>>>>>>> With regards to GraphX, it would be great to deprecate the use of >>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve >>>>>>>>> with >>>>>>>>> Tungsten. >>>>>>>>> >>>>>>>>> On that note of deprecating stuff, it might be good to deprecate >>>>>>>>> some things in 2.0 without removing or replacing them immediately. >>>>>>>>> That way >>>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to >>>>>>>>> be >>>>>>>>> replaced all at once. >>>>>>>>> >>>>>>>>> Nick >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander < >>>>>>>>> alexander.ula...@hpe.com> wrote: >>>>>>>>> >>>>>>>>> Parameter Server is a new feature and thus does not match the goal >>>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove >>>>>>>>> certain deprecated APIs”. At the same time I would be happy to have >>>>>>>>> that >>>>>>>>> feature. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> With regards to Machine learning, it would be great to move useful >>>>>>>>> features from MLlib to ML and deprecate the former. Current structure >>>>>>>>> of >>>>>>>>> two separate machine learning packages seems to be somewhat confusing. >>>>>>>>> >>>>>>>>> With regards to GraphX, it would be great to deprecate the use of >>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve >>>>>>>>> with >>>>>>>>> Tungsten. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Best regards, Alexander >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com] >>>>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM >>>>>>>>> *To:* wi...@qq.com >>>>>>>>> *Cc:* dev@spark.apache.org >>>>>>>>> *Subject:* Re: A proposal for Spark 2.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Being specific to Parameter Server, I think the current agreement >>>>>>>>> is that PS shall exist as a third-party library instead of a >>>>>>>>> component of >>>>>>>>> the core code base, isn’t? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Nan Zhu >>>>>>>>> >>>>>>>>> http://codingcat.me >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: >>>>>>>>> >>>>>>>>> Who has the idea of machine learning? Spark missing some features >>>>>>>>> for machine learning, For example, the parameter server. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 在 2015年11月12日,05:32,Matei Zaharia <matei.zaha...@gmail.com> 写道: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I like the idea of popping out Tachyon to an optional component >>>>>>>>> too to reduce the number of dependencies. In the future, it might >>>>>>>>> even be >>>>>>>>> useful to do this for Hadoop, but it requires too many API changes to >>>>>>>>> be >>>>>>>>> worth doing now. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Regarding Scala 2.12, we should definitely support it eventually, >>>>>>>>> but I don't think we need to block 2.0 on that because it can be added >>>>>>>>> later too. Has anyone investigated what it would take to run on >>>>>>>>> there? I >>>>>>>>> imagine we don't need many code changes, just maybe some REPL stuff. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Needless to say, but I'm all for the idea of making "major" >>>>>>>>> releases as undisruptive as possible in the model Reynold proposed. >>>>>>>>> Keeping >>>>>>>>> everyone working with the same set of releases is super important. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Matei >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> to the Spark community. A major release should not be very >>>>>>>>> different from a >>>>>>>>> >>>>>>>>> minor release and should not be gated based on new features. The >>>>>>>>> main >>>>>>>>> >>>>>>>>> purpose of a major release is an opportunity to fix things that >>>>>>>>> are broken >>>>>>>>> >>>>>>>>> in the current API and remove certain deprecated APIs (examples >>>>>>>>> follow). >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Agree with this stance. Generally, a major release might also be a >>>>>>>>> >>>>>>>>> time to replace some big old API or implementation with a new one, >>>>>>>>> but >>>>>>>>> >>>>>>>>> I don't see obvious candidates. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless >>>>>>>>> >>>>>>>>> there's a fairly good reason to continue adding features in 1.x to >>>>>>>>> a >>>>>>>>> >>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala >>>>>>>>> 2.10, but >>>>>>>>> >>>>>>>>> it has been end-of-life. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 >>>>>>>>> will >>>>>>>>> >>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd >>>>>>>>> propose >>>>>>>>> >>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 2. Remove Hadoop 1 support. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were >>>>>>>>> >>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm sure we'll think of a number of other small things -- shading a >>>>>>>>> >>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of >>>>>>>>> >>>>>>>>> simpler, more recent dependencies to support from Hadoop etc? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed >>>>>>>>> this?) >>>>>>>>> >>>>>>>>> Pop out any Docker stuff to another repo? >>>>>>>>> >>>>>>>>> Continue that same effort for EC2? >>>>>>>>> >>>>>>>>> Farming out some of the "external" integrations to another repo (? >>>>>>>>> >>>>>>>>> controversial) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> See also anything marked version "2+" in JIRA. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>