Why does stabilization of those two features require a 1.7 release instead of 1.6.1?
On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <kos...@cloudera.com> wrote: > We have veered off the topic of Spark 2.0 a little bit here - yes we can > talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to > propose we have one more 1.x release after Spark 1.6. This will allow us to > stabilize a few of the new features that were added in 1.6: > > 1) the experimental Datasets API > 2) the new unified memory manager. > > I understand our goal for Spark 2.0 is to offer an easy transition but > there will be users that won't be able to seamlessly upgrade given what we > have discussed as in scope for 2.0. For these users, having a 1.x release > with these new features/APIs stabilized will be very beneficial. This might > make Spark 1.7 a lighter release but that is not necessarily a bad thing. > > Any thoughts on this timeline? > > Kostas Sakellis > > > > On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > >> Agree, more features/apis/optimization need to be added in DF/DS. >> >> >> >> I mean, we need to think about what kind of RDD APIs we have to provide >> to developer, maybe the fundamental API is enough, like, the ShuffledRDD >> etc.. But PairRDDFunctions probably not in this category, as we can do the >> same thing easily with DF/DS, even better performance. >> >> >> >> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >> *Sent:* Friday, November 13, 2015 11:23 AM >> *To:* Stephen Boesch >> >> *Cc:* dev@spark.apache.org >> *Subject:* Re: A proposal for Spark 2.0 >> >> >> >> Hmmm... to me, that seems like precisely the kind of thing that argues >> for retaining the RDD API but not as the first thing presented to new Spark >> developers: "Here's how to use groupBy with DataFrames.... Until the >> optimizer is more fully developed, that won't always get you the best >> performance that can be obtained. In these particular circumstances, ..., >> you may want to use the low-level RDD API while setting >> preservesPartitioning to true. Like this...." >> >> >> >> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <java...@gmail.com> >> wrote: >> >> My understanding is that the RDD's presently have more support for >> complete control of partitioning which is a key consideration at scale. >> While partitioning control is still piecemeal in DF/DS it would seem >> premature to make RDD's a second-tier approach to spark dev. >> >> >> >> An example is the use of groupBy when we know that the source relation >> (/RDD) is already partitioned on the grouping expressions. AFAIK the spark >> sql still does not allow that knowledge to be applied to the optimizer - so >> a full shuffle will be performed. However in the native RDD we can use >> preservesPartitioning=true. >> >> >> >> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>: >> >> The place of the RDD API in 2.0 is also something I've been wondering >> about. I think it may be going too far to deprecate it, but changing >> emphasis is something that we might consider. The RDD API came well before >> DataFrames and DataSets, so programming guides, introductory how-to >> articles and the like have, to this point, also tended to emphasize RDDs -- >> or at least to deal with them early. What I'm thinking is that with 2.0 >> maybe we should overhaul all the documentation to de-emphasize and >> reposition RDDs. In this scheme, DataFrames and DataSets would be >> introduced and fully addressed before RDDs. They would be presented as the >> normal/default/standard way to do things in Spark. RDDs, in contrast, >> would be presented later as a kind of lower-level, closer-to-the-metal API >> that can be used in atypical, more specialized contexts where DataFrames or >> DataSets don't fully fit. >> >> >> >> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com> wrote: >> >> I am not sure what the best practice for this specific problem, but it’s >> really worth to think about it in 2.0, as it is a painful issue for lots of >> users. >> >> >> >> By the way, is it also an opportunity to deprecate the RDD API (or >> internal API only?)? As lots of its functionality overlapping with >> DataFrame or DataSet. >> >> >> >> Hao >> >> >> >> *From:* Kostas Sakellis [mailto:kos...@cloudera.com] >> *Sent:* Friday, November 13, 2015 5:27 AM >> *To:* Nicholas Chammas >> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; >> Reynold Xin >> >> >> *Subject:* Re: A proposal for Spark 2.0 >> >> >> >> I know we want to keep breaking changes to a minimum but I'm hoping that >> with Spark 2.0 we can also look at better classpath isolation with user >> programs. I propose we build on spark.{driver|executor}.userClassPathFirst, >> setting it true by default, and not allow any spark transitive dependencies >> to leak into user code. For backwards compatibility we can have a whitelist >> if we want but I'd be good if we start requiring user apps to explicitly >> pull in all their dependencies. From what I can tell, Hadoop 3 is also >> moving in this direction. >> >> >> >> Kostas >> >> >> >> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >> With regards to Machine learning, it would be great to move useful >> features from MLlib to ML and deprecate the former. Current structure of >> two separate machine learning packages seems to be somewhat confusing. >> >> With regards to GraphX, it would be great to deprecate the use of RDD in >> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. >> >> On that note of deprecating stuff, it might be good to deprecate some >> things in 2.0 without removing or replacing them immediately. That way 2.0 >> doesn’t have to wait for everything that we want to deprecate to be >> replaced all at once. >> >> Nick >> >> >> >> >> >> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander < >> alexander.ula...@hpe.com> wrote: >> >> Parameter Server is a new feature and thus does not match the goal of 2.0 >> is “to fix things that are broken in the current API and remove certain >> deprecated APIs”. At the same time I would be happy to have that feature. >> >> >> >> With regards to Machine learning, it would be great to move useful >> features from MLlib to ML and deprecate the former. Current structure of >> two separate machine learning packages seems to be somewhat confusing. >> >> With regards to GraphX, it would be great to deprecate the use of RDD in >> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. >> >> >> >> Best regards, Alexander >> >> >> >> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com] >> *Sent:* Thursday, November 12, 2015 7:28 AM >> *To:* wi...@qq.com >> *Cc:* dev@spark.apache.org >> *Subject:* Re: A proposal for Spark 2.0 >> >> >> >> Being specific to Parameter Server, I think the current agreement is that >> PS shall exist as a third-party library instead of a component of the core >> code base, isn’t? >> >> >> >> Best, >> >> >> >> -- >> >> Nan Zhu >> >> http://codingcat.me >> >> >> >> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: >> >> Who has the idea of machine learning? Spark missing some features for >> machine learning, For example, the parameter server. >> >> >> >> >> >> 在 2015年11月12日,05:32,Matei Zaharia <matei.zaha...@gmail.com> 写道: >> >> >> >> I like the idea of popping out Tachyon to an optional component too to >> reduce the number of dependencies. In the future, it might even be useful >> to do this for Hadoop, but it requires too many API changes to be worth >> doing now. >> >> >> >> Regarding Scala 2.12, we should definitely support it eventually, but I >> don't think we need to block 2.0 on that because it can be added later too. >> Has anyone investigated what it would take to run on there? I imagine we >> don't need many code changes, just maybe some REPL stuff. >> >> >> >> Needless to say, but I'm all for the idea of making "major" releases as >> undisruptive as possible in the model Reynold proposed. Keeping everyone >> working with the same set of releases is super important. >> >> >> >> Matei >> >> >> >> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com> >> wrote: >> >> to the Spark community. A major release should not be very different from >> a >> >> minor release and should not be gated based on new features. The main >> >> purpose of a major release is an opportunity to fix things that are broken >> >> in the current API and remove certain deprecated APIs (examples follow). >> >> >> >> Agree with this stance. Generally, a major release might also be a >> >> time to replace some big old API or implementation with a new one, but >> >> I don't see obvious candidates. >> >> >> >> I wouldn't mind turning attention to 2.x sooner than later, unless >> >> there's a fairly good reason to continue adding features in 1.x to a >> >> 1.7 release. The scope as of 1.6 is already pretty darned big. >> >> >> >> >> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, >> but >> >> it has been end-of-life. >> >> >> >> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will >> >> be quite stable, and 2.10 will have been EOL for a while. I'd propose >> >> dropping 2.10. Otherwise it's supported for 2 more years. >> >> >> >> >> >> 2. Remove Hadoop 1 support. >> >> >> >> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were >> >> sort of 'alpha' and 'beta' releases) and even <2.6. >> >> >> >> I'm sure we'll think of a number of other small things -- shading a >> >> bunch of stuff? reviewing and updating dependencies in light of >> >> simpler, more recent dependencies to support from Hadoop etc? >> >> >> >> Farming out Tachyon to a module? (I felt like someone proposed this?) >> >> Pop out any Docker stuff to another repo? >> >> Continue that same effort for EC2? >> >> Farming out some of the "external" integrations to another repo (? >> >> controversial) >> >> >> >> See also anything marked version "2+" in JIRA. >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >> >> >> >> >> >> >> >> > >