Re: A proposal for Spark 2.0

Kostas Sakellis Fri, 13 Nov 2015 11:41:22 -0800

We have veered off the topic of Spark 2.0 a little bit here - yes we can
talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
propose we have one more 1.x release after Spark 1.6. This will allow us to
stabilize a few of the new features that were added in 1.6:


1) the experimental Datasets API
2) the new unified memory manager.

I understand our goal for Spark 2.0 is to offer an easy transition but
there will be users that won't be able to seamlessly upgrade given what we
have discussed as in scope for 2.0. For these users, having a 1.x release
with these new features/APIs stabilized will be very beneficial. This might
make Spark 1.7 a lighter release but that is not necessarily a bad thing.

Any thoughts on this timeline?

Kostas Sakellis



On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com> wrote:

> Agree, more features/apis/optimization need to be added in DF/DS.
>
>
>
> I mean, we need to think about what kind of RDD APIs we have to provide to
> developer, maybe the fundamental API is enough, like, the ShuffledRDD
> etc..  But PairRDDFunctions probably not in this category, as we can do the
> same thing easily with DF/DS, even better performance.
>
>
>
> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
> *Sent:* Friday, November 13, 2015 11:23 AM
> *To:* Stephen Boesch
>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Hmmm... to me, that seems like precisely the kind of thing that argues for
> retaining the RDD API but not as the first thing presented to new Spark
> developers: "Here's how to use groupBy with DataFrames.... Until the
> optimizer is more fully developed, that won't always get you the best
> performance that can be obtained.  In these particular circumstances, ...,
> you may want to use the low-level RDD API while setting
> preservesPartitioning to true.  Like this...."
>
>
>
> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> My understanding is that  the RDD's presently have more support for
> complete control of partitioning which is a key consideration at scale.
> While partitioning control is still piecemeal in  DF/DS  it would seem
> premature to make RDD's a second-tier approach to spark dev.
>
>
>
> An example is the use of groupBy when we know that the source relation
> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
> sql still does not allow that knowledge to be applied to the optimizer - so
> a full shuffle will be performed. However in the native RDD we can use
> preservesPartitioning=true.
>
>
>
> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>:
>
> The place of the RDD API in 2.0 is also something I've been wondering
> about.  I think it may be going too far to deprecate it, but changing
> emphasis is something that we might consider.  The RDD API came well before
> DataFrames and DataSets, so programming guides, introductory how-to
> articles and the like have, to this point, also tended to emphasize RDDs --
> or at least to deal with them early.  What I'm thinking is that with 2.0
> maybe we should overhaul all the documentation to de-emphasize and
> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> introduced and fully addressed before RDDs.  They would be presented as the
> normal/default/standard way to do things in Spark.  RDDs, in contrast,
> would be presented later as a kind of lower-level, closer-to-the-metal API
> that can be used in atypical, more specialized contexts where DataFrames or
> DataSets don't fully fit.
>
>
>
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
> I am not sure what the best practice for this specific problem, but it’s
> really worth to think about it in 2.0, as it is a painful issue for lots of
> users.
>
>
>
> By the way, is it also an opportunity to deprecate the RDD API (or
> internal API only?)? As lots of its functionality overlapping with
> DataFrame or DataSet.
>
>
>
> Hao
>
>
>
> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
> *Sent:* Friday, November 13, 2015 5:27 AM
> *To:* Nicholas Chammas
> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
> Reynold Xin
>
>
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> I know we want to keep breaking changes to a minimum but I'm hoping that
> with Spark 2.0 we can also look at better classpath isolation with user
> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
> setting it true by default, and not allow any spark transitive dependencies
> to leak into user code. For backwards compatibility we can have a whitelist
> if we want but I'd be good if we start requiring user apps to explicitly
> pull in all their dependencies. From what I can tell, Hadoop 3 is also
> moving in this direction.
>
>
>
> Kostas
>
>
>
> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
>
> 
>
>
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Parameter Server is a new feature and thus does not match the goal of 2.0
> is “to fix things that are broken in the current API and remove certain
> deprecated APIs”. At the same time I would be happy to have that feature.
>
>
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
> *Sent:* Thursday, November 12, 2015 7:28 AM
> *To:* wi...@qq.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Being specific to Parameter Server, I think the current agreement is that
> PS shall exist as a third-party library instead of a component of the core
> code base, isn’t?
>
>
>
> Best,
>
>
>
> --
>
> Nan Zhu
>
> http://codingcat.me
>
>
>
> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>
> Who has the idea of machine learning? Spark missing some features for
> machine learning, For example, the parameter server.
>
>
>
>
>
> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaha...@gmail.com> 写道：
>
>
>
> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
>
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
>
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
>
>
> Matei
>
>
>
> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>
>
>
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com> wrote:
>
> to the Spark community. A major release should not be very different from a
>
> minor release and should not be gated based on new features. The main
>
> purpose of a major release is an opportunity to fix things that are broken
>
> in the current API and remove certain deprecated APIs (examples follow).
>
>
>
> Agree with this stance. Generally, a major release might also be a
>
> time to replace some big old API or implementation with a new one, but
>
> I don't see obvious candidates.
>
>
>
> I wouldn't mind turning attention to 2.x sooner than later, unless
>
> there's a fairly good reason to continue adding features in 1.x to a
>
> 1.7 release. The scope as of 1.6 is already pretty darned big.
>
>
>
>
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>
> it has been end-of-life.
>
>
>
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>
> dropping 2.10. Otherwise it's supported for 2 more years.
>
>
>
>
>
> 2. Remove Hadoop 1 support.
>
>
>
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>
> sort of 'alpha' and 'beta' releases) and even <2.6.
>
>
>
> I'm sure we'll think of a number of other small things -- shading a
>
> bunch of stuff? reviewing and updating dependencies in light of
>
> simpler, more recent dependencies to support from Hadoop etc?
>
>
>
> Farming out Tachyon to a module? (I felt like someone proposed this?)
>
> Pop out any Docker stuff to another repo?
>
> Continue that same effort for EC2?
>
> Farming out some of the "external" integrations to another repo (?
>
> controversial)
>
>
>
> See also anything marked version "2+" in JIRA.
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>

Re: A proposal for Spark 2.0

Reply via email to