Re: A proposal for Spark 2.0

Mark Hamstra Fri, 13 Nov 2015 12:27:13 -0800

Why does stabilization of those two features require a 1.7 release instead
of 1.6.1?


On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <kos...@cloudera.com>
wrote:

> We have veered off the topic of Spark 2.0 a little bit here - yes we can
> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
> propose we have one more 1.x release after Spark 1.6. This will allow us to
> stabilize a few of the new features that were added in 1.6:
>
> 1) the experimental Datasets API
> 2) the new unified memory manager.
>
> I understand our goal for Spark 2.0 is to offer an easy transition but
> there will be users that won't be able to seamlessly upgrade given what we
> have discussed as in scope for 2.0. For these users, having a 1.x release
> with these new features/APIs stabilized will be very beneficial. This might
> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>
> Any thoughts on this timeline?
>
> Kostas Sakellis
>
>
>
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
>> Agree, more features/apis/optimization need to be added in DF/DS.
>>
>>
>>
>> I mean, we need to think about what kind of RDD APIs we have to provide
>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD
>> etc..  But PairRDDFunctions probably not in this category, as we can do the
>> same thing easily with DF/DS, even better performance.
>>
>>
>>
>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>> *Sent:* Friday, November 13, 2015 11:23 AM
>> *To:* Stephen Boesch
>>
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Hmmm... to me, that seems like precisely the kind of thing that argues
>> for retaining the RDD API but not as the first thing presented to new Spark
>> developers: "Here's how to use groupBy with DataFrames.... Until the
>> optimizer is more fully developed, that won't always get you the best
>> performance that can be obtained.  In these particular circumstances, ...,
>> you may want to use the low-level RDD API while setting
>> preservesPartitioning to true.  Like this...."
>>
>>
>>
>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <java...@gmail.com>
>> wrote:
>>
>> My understanding is that  the RDD's presently have more support for
>> complete control of partitioning which is a key consideration at scale.
>> While partitioning control is still piecemeal in  DF/DS  it would seem
>> premature to make RDD's a second-tier approach to spark dev.
>>
>>
>>
>> An example is the use of groupBy when we know that the source relation
>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>> sql still does not allow that knowledge to be applied to the optimizer - so
>> a full shuffle will be performed. However in the native RDD we can use
>> preservesPartitioning=true.
>>
>>
>>
>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>:
>>
>> The place of the RDD API in 2.0 is also something I've been wondering
>> about.  I think it may be going too far to deprecate it, but changing
>> emphasis is something that we might consider.  The RDD API came well before
>> DataFrames and DataSets, so programming guides, introductory how-to
>> articles and the like have, to this point, also tended to emphasize RDDs --
>> or at least to deal with them early.  What I'm thinking is that with 2.0
>> maybe we should overhaul all the documentation to de-emphasize and
>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>> introduced and fully addressed before RDDs.  They would be presented as the
>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>> would be presented later as a kind of lower-level, closer-to-the-metal API
>> that can be used in atypical, more specialized contexts where DataFrames or
>> DataSets don't fully fit.
>>
>>
>>
>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>>
>> I am not sure what the best practice for this specific problem, but it’s
>> really worth to think about it in 2.0, as it is a painful issue for lots of
>> users.
>>
>>
>>
>> By the way, is it also an opportunity to deprecate the RDD API (or
>> internal API only?)? As lots of its functionality overlapping with
>> DataFrame or DataSet.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>> *Sent:* Friday, November 13, 2015 5:27 AM
>> *To:* Nicholas Chammas
>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
>> Reynold Xin
>>
>>
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> I know we want to keep breaking changes to a minimum but I'm hoping that
>> with Spark 2.0 we can also look at better classpath isolation with user
>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>> setting it true by default, and not allow any spark transitive dependencies
>> to leak into user code. For backwards compatibility we can have a whitelist
>> if we want but I'd be good if we start requiring user apps to explicitly
>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>> moving in this direction.
>>
>>
>>
>> Kostas
>>
>>
>>
>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>> On that note of deprecating stuff, it might be good to deprecate some
>> things in 2.0 without removing or replacing them immediately. That way 2.0
>> doesn’t have to wait for everything that we want to deprecate to be
>> replaced all at once.
>>
>> Nick
>>
>> 
>>
>>
>>
>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>> alexander.ula...@hpe.com> wrote:
>>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* wi...@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features for
>> machine learning, For example, the parameter server.
>>
>>
>>
>>
>>
>> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaha...@gmail.com> 写道：
>>
>>
>>
>> I like the idea of popping out Tachyon to an optional component too to
>> reduce the number of dependencies. In the future, it might even be useful
>> to do this for Hadoop, but it requires too many API changes to be worth
>> doing now.
>>
>>
>>
>> Regarding Scala 2.12, we should definitely support it eventually, but I
>> don't think we need to block 2.0 on that because it can be added later too.
>> Has anyone investigated what it would take to run on there? I imagine we
>> don't need many code changes, just maybe some REPL stuff.
>>
>>
>>
>> Needless to say, but I'm all for the idea of making "major" releases as
>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>> working with the same set of releases is super important.
>>
>>
>>
>> Matei
>>
>>
>>
>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>
>>
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com>
>> wrote:
>>
>> to the Spark community. A major release should not be very different from
>> a
>>
>> minor release and should not be gated based on new features. The main
>>
>> purpose of a major release is an opportunity to fix things that are broken
>>
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>>
>>
>> Agree with this stance. Generally, a major release might also be a
>>
>> time to replace some big old API or implementation with a new one, but
>>
>> I don't see obvious candidates.
>>
>>
>>
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>
>> there's a fairly good reason to continue adding features in 1.x to a
>>
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>
>>
>>
>>
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but
>>
>> it has been end-of-life.
>>
>>
>>
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>
>> dropping 2.10. Otherwise it's supported for 2 more years.
>>
>>
>>
>>
>>
>> 2. Remove Hadoop 1 support.
>>
>>
>>
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>
>>
>>
>> I'm sure we'll think of a number of other small things -- shading a
>>
>> bunch of stuff? reviewing and updating dependencies in light of
>>
>> simpler, more recent dependencies to support from Hadoop etc?
>>
>>
>>
>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>
>> Pop out any Docker stuff to another repo?
>>
>> Continue that same effort for EC2?
>>
>> Farming out some of the "external" integrations to another repo (?
>>
>> controversial)
>>
>>
>>
>> See also anything marked version "2+" in JIRA.
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: A proposal for Spark 2.0

Reply via email to