My understanding is that the RDD's presently have more support for complete control of partitioning which is a key consideration at scale. While partitioning control is still piecemeal in DF/DS it would seem premature to make RDD's a second-tier approach to spark dev.
An example is the use of groupBy when we know that the source relation (/RDD) is already partitioned on the grouping expressions. AFAIK the spark sql still does not allow that knowledge to be applied to the optimizer - so a full shuffle will be performed. However in the native RDD we can use preservesPartitioning=true. 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>: > The place of the RDD API in 2.0 is also something I've been wondering > about. I think it may be going too far to deprecate it, but changing > emphasis is something that we might consider. The RDD API came well before > DataFrames and DataSets, so programming guides, introductory how-to > articles and the like have, to this point, also tended to emphasize RDDs -- > or at least to deal with them early. What I'm thinking is that with 2.0 > maybe we should overhaul all the documentation to de-emphasize and > reposition RDDs. In this scheme, DataFrames and DataSets would be > introduced and fully addressed before RDDs. They would be presented as the > normal/default/standard way to do things in Spark. RDDs, in contrast, > would be presented later as a kind of lower-level, closer-to-the-metal API > that can be used in atypical, more specialized contexts where DataFrames or > DataSets don't fully fit. > > On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > >> I am not sure what the best practice for this specific problem, but it’s >> really worth to think about it in 2.0, as it is a painful issue for lots of >> users. >> >> >> >> By the way, is it also an opportunity to deprecate the RDD API (or >> internal API only?)? As lots of its functionality overlapping with >> DataFrame or DataSet. >> >> >> >> Hao >> >> >> >> *From:* Kostas Sakellis [mailto:kos...@cloudera.com] >> *Sent:* Friday, November 13, 2015 5:27 AM >> *To:* Nicholas Chammas >> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; >> Reynold Xin >> >> *Subject:* Re: A proposal for Spark 2.0 >> >> >> >> I know we want to keep breaking changes to a minimum but I'm hoping that >> with Spark 2.0 we can also look at better classpath isolation with user >> programs. I propose we build on spark.{driver|executor}.userClassPathFirst, >> setting it true by default, and not allow any spark transitive dependencies >> to leak into user code. For backwards compatibility we can have a whitelist >> if we want but I'd be good if we start requiring user apps to explicitly >> pull in all their dependencies. From what I can tell, Hadoop 3 is also >> moving in this direction. >> >> >> >> Kostas >> >> >> >> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >> With regards to Machine learning, it would be great to move useful >> features from MLlib to ML and deprecate the former. Current structure of >> two separate machine learning packages seems to be somewhat confusing. >> >> With regards to GraphX, it would be great to deprecate the use of RDD in >> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. >> >> On that note of deprecating stuff, it might be good to deprecate some >> things in 2.0 without removing or replacing them immediately. That way 2.0 >> doesn’t have to wait for everything that we want to deprecate to be >> replaced all at once. >> >> Nick >> >> >> >> >> >> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander < >> alexander.ula...@hpe.com> wrote: >> >> Parameter Server is a new feature and thus does not match the goal of 2.0 >> is “to fix things that are broken in the current API and remove certain >> deprecated APIs”. At the same time I would be happy to have that feature. >> >> >> >> With regards to Machine learning, it would be great to move useful >> features from MLlib to ML and deprecate the former. Current structure of >> two separate machine learning packages seems to be somewhat confusing. >> >> With regards to GraphX, it would be great to deprecate the use of RDD in >> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten. >> >> >> >> Best regards, Alexander >> >> >> >> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com] >> *Sent:* Thursday, November 12, 2015 7:28 AM >> *To:* wi...@qq.com >> *Cc:* dev@spark.apache.org >> *Subject:* Re: A proposal for Spark 2.0 >> >> >> >> Being specific to Parameter Server, I think the current agreement is that >> PS shall exist as a third-party library instead of a component of the core >> code base, isn’t? >> >> >> >> Best, >> >> >> >> -- >> >> Nan Zhu >> >> http://codingcat.me >> >> >> >> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: >> >> Who has the idea of machine learning? Spark missing some features for >> machine learning, For example, the parameter server. >> >> >> >> >> >> 在 2015年11月12日,05:32,Matei Zaharia <matei.zaha...@gmail.com> 写道: >> >> >> >> I like the idea of popping out Tachyon to an optional component too to >> reduce the number of dependencies. In the future, it might even be useful >> to do this for Hadoop, but it requires too many API changes to be worth >> doing now. >> >> >> >> Regarding Scala 2.12, we should definitely support it eventually, but I >> don't think we need to block 2.0 on that because it can be added later too. >> Has anyone investigated what it would take to run on there? I imagine we >> don't need many code changes, just maybe some REPL stuff. >> >> >> >> Needless to say, but I'm all for the idea of making "major" releases as >> undisruptive as possible in the model Reynold proposed. Keeping everyone >> working with the same set of releases is super important. >> >> >> >> Matei >> >> >> >> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com> >> wrote: >> >> to the Spark community. A major release should not be very different from >> a >> >> minor release and should not be gated based on new features. The main >> >> purpose of a major release is an opportunity to fix things that are broken >> >> in the current API and remove certain deprecated APIs (examples follow). >> >> >> >> Agree with this stance. Generally, a major release might also be a >> >> time to replace some big old API or implementation with a new one, but >> >> I don't see obvious candidates. >> >> >> >> I wouldn't mind turning attention to 2.x sooner than later, unless >> >> there's a fairly good reason to continue adding features in 1.x to a >> >> 1.7 release. The scope as of 1.6 is already pretty darned big. >> >> >> >> >> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10, >> but >> >> it has been end-of-life. >> >> >> >> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will >> >> be quite stable, and 2.10 will have been EOL for a while. I'd propose >> >> dropping 2.10. Otherwise it's supported for 2 more years. >> >> >> >> >> >> 2. Remove Hadoop 1 support. >> >> >> >> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were >> >> sort of 'alpha' and 'beta' releases) and even <2.6. >> >> >> >> I'm sure we'll think of a number of other small things -- shading a >> >> bunch of stuff? reviewing and updating dependencies in light of >> >> simpler, more recent dependencies to support from Hadoop etc? >> >> >> >> Farming out Tachyon to a module? (I felt like someone proposed this?) >> >> Pop out any Docker stuff to another repo? >> >> Continue that same effort for EC2? >> >> Farming out some of the "external" integrations to another repo (? >> >> controversial) >> >> >> >> See also anything marked version "2+" in JIRA. >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >> >> > >