Re: A proposal for Spark 2.0

Reynold Xin Thu, 26 Nov 2015 13:02:48 -0800

I don't think there are any plan for Scala 2.12 support yet. We can always
add Scala 2.12 support later.



On Thu, Nov 26, 2015 at 12:59 PM, Koert Kuipers <ko...@tresata.com> wrote:

> I also thought the idea was to drop 2.10. Do we want to cross build for 3
> scala versions?
> On Nov 25, 2015 3:54 AM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote:
>
>> I see.  My concern is / was that cluster operators will be reluctant to
>> upgrade to 2.0, meaning that developers using those clusters need to stay
>> on 1.x, and, if they want to move to DataFrames, essentially need to port
>> their app twice.
>>
>> I misunderstood and thought part of the proposal was to drop support for
>> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
>> will make it less palatable to cluster administrators than releases in the
>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>>
>> -Sandy
>>
>>
>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>>> What are the other breaking changes in 2.0 though? Note that we're not
>>> removing Scala 2.10, we're just making the default build be against Scala
>>> 2.11 instead of 2.10. There seem to be very few changes that people would
>>> worry about. If people are going to update their apps, I think it's better
>>> to make the other small changes in 2.0 at the same time than to update once
>>> for Dataset and another time for 2.0.
>>>
>>> BTW just refer to Reynold's original post for the other proposed API
>>> changes.
>>>
>>> Matei
>>>
>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.r...@cloudera.com>
>>> wrote:
>>>
>>> I think that Kostas' logic still holds.  The majority of Spark users,
>>> and likely an even vaster majority of people running vaster jobs, are still
>>> on RDDs and on the cusp of upgrading to DataFrames.  Users will probably
>>> want to upgrade to the stable version of the Dataset / DataFrame API so
>>> they don't need to do so twice.  Requiring that they absorb all the other
>>> ways that Spark breaks compatibility in the move to 2.0 makes it much more
>>> difficult for them to make this transition.
>>>
>>> Using the same set of APIs also means that it will be easier to backport
>>> critical fixes to the 1.x line.
>>>
>>> It's not clear to me that avoiding breakage of an experimental API in
>>> the 1.x line outweighs these issues.
>>>
>>> -Sandy
>>>
>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>>> reason is that I already know we have to break some part of the
>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>>> sooner (in one release) than later (in two releases). so the damage is
>>>> smaller.
>>>>
>>>> I don't think whether we call Dataset/DataFrame experimental or not
>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>>
>>>>
>>>>
>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <m...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>>> fixing.  We're on the same page now.
>>>>>
>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kos...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs
>>>>>> in z releases. The Dataset API is experimental and so we might be 
>>>>>> changing
>>>>>> the APIs before we declare it stable. This is why I think it is important
>>>>>> to first stabilize the Dataset API with a Spark 1.7 release before moving
>>>>>> to Spark 2.0. This will benefit users that would like to use the new
>>>>>> Dataset APIs but can't move to Spark 2.0 because of the backwards
>>>>>> incompatible changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>>>
>>>>>> Kostas
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
>>>>>> m...@clearstorydata.com> wrote:
>>>>>>
>>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>>> instead of 1.6.1?
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>>>>>>> kos...@cloudera.com> wrote:
>>>>>>>
>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes
>>>>>>>> we can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd
>>>>>>>> like to propose we have one more 1.x release after Spark 1.6. This will
>>>>>>>> allow us to stabilize a few of the new features that were added in 1.6:
>>>>>>>>
>>>>>>>> 1) the experimental Datasets API
>>>>>>>> 2) the new unified memory manager.
>>>>>>>>
>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>>> but there will be users that won't be able to seamlessly upgrade given 
>>>>>>>> what
>>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>>>> release with these new features/APIs stabilized will be very 
>>>>>>>> beneficial.
>>>>>>>> This might make Spark 1.7 a lighter release but that is not 
>>>>>>>> necessarily a
>>>>>>>> bad thing.
>>>>>>>>
>>>>>>>> Any thoughts on this timeline?
>>>>>>>>
>>>>>>>> Kostas Sakellis
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this 
>>>>>>>>> category, as
>>>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>>>> *To:* Stephen Boesch
>>>>>>>>>
>>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>>> argues for retaining the RDD API but not as the first thing presented 
>>>>>>>>> to
>>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames.... 
>>>>>>>>> Until
>>>>>>>>> the optimizer is more fully developed, that won't always get you the 
>>>>>>>>> best
>>>>>>>>> performance that can be obtained.  In these particular circumstances, 
>>>>>>>>> ...,
>>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <java...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> My understanding is that  the RDD's presently have more support
>>>>>>>>> for complete control of partitioning which is a key consideration at
>>>>>>>>> scale.  While partitioning control is still piecemeal in  DF/DS  it 
>>>>>>>>> would
>>>>>>>>> seem premature to make RDD's a second-tier approach to spark dev.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  
>>>>>>>>> AFAIK
>>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>>>> optimizer - so a full shuffle will be performed. However in the 
>>>>>>>>> native RDD
>>>>>>>>> we can use preservesPartitioning=true.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>:
>>>>>>>>>
>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
>>>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
>>>>>>>>> changing emphasis is something that we might consider.  The RDD API 
>>>>>>>>> came
>>>>>>>>> well before DataFrames and DataSets, so programming guides, 
>>>>>>>>> introductory
>>>>>>>>> how-to articles and the like have, to this point, also tended to 
>>>>>>>>> emphasize
>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is 
>>>>>>>>> that
>>>>>>>>> with 2.0 maybe we should overhaul all the documentation to 
>>>>>>>>> de-emphasize and
>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>>>>> introduced and fully addressed before RDDs.  They would be presented 
>>>>>>>>> as the
>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>>>>>>> would be presented later as a kind of lower-level, 
>>>>>>>>> closer-to-the-metal API
>>>>>>>>> that can be used in atypical, more specialized contexts where 
>>>>>>>>> DataFrames or
>>>>>>>>> DataSets don't fully fit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I am not sure what the best practice for this specific problem,
>>>>>>>>> but it’s really worth to think about it in 2.0, as it is a painful 
>>>>>>>>> issue
>>>>>>>>> for lots of users.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>>>>>> internal API only?)? As lots of its functionality overlapping with
>>>>>>>>> DataFrame or DataSet.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hao
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>>>>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>>>>>> *To:* Nicholas Chammas
>>>>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com;
>>>>>>>>> dev@spark.apache.org; Reynold Xin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm
>>>>>>>>> hoping that with Spark 2.0 we can also look at better classpath 
>>>>>>>>> isolation
>>>>>>>>> with user programs. I propose we build on
>>>>>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by 
>>>>>>>>> default, and
>>>>>>>>> not allow any spark transitive dependencies to leak into user code. 
>>>>>>>>> For
>>>>>>>>> backwards compatibility we can have a whitelist if we want but I'd be 
>>>>>>>>> good
>>>>>>>>> if we start requiring user apps to explicitly pull in all their
>>>>>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>>>>>>> direction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kostas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure 
>>>>>>>>> of
>>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>>
>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve 
>>>>>>>>> with
>>>>>>>>> Tungsten.
>>>>>>>>>
>>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
>>>>>>>>> some things in 2.0 without removing or replacing them immediately. 
>>>>>>>>> That way
>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to 
>>>>>>>>> be
>>>>>>>>> replaced all at once.
>>>>>>>>>
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>>>>>> alexander.ula...@hpe.com> wrote:
>>>>>>>>>
>>>>>>>>> Parameter Server is a new feature and thus does not match the goal
>>>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove
>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to have 
>>>>>>>>> that
>>>>>>>>> feature.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure 
>>>>>>>>> of
>>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>>
>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve 
>>>>>>>>> with
>>>>>>>>> Tungsten.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards, Alexander
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
>>>>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>>>>>> *To:* wi...@qq.com
>>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Being specific to Parameter Server, I think the current agreement
>>>>>>>>> is that PS shall exist as a third-party library instead of a 
>>>>>>>>> component of
>>>>>>>>> the core code base, isn’t?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Nan Zhu
>>>>>>>>>
>>>>>>>>> http://codingcat.me
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>>>>>>>>>
>>>>>>>>> Who has the idea of machine learning? Spark missing some features
>>>>>>>>> for machine learning, For example, the parameter server.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaha...@gmail.com> 写道：
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I like the idea of popping out Tachyon to an optional component
>>>>>>>>> too to reduce the number of dependencies. In the future, it might 
>>>>>>>>> even be
>>>>>>>>> useful to do this for Hadoop, but it requires too many API changes to 
>>>>>>>>> be
>>>>>>>>> worth doing now.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regarding Scala 2.12, we should definitely support it eventually,
>>>>>>>>> but I don't think we need to block 2.0 on that because it can be added
>>>>>>>>> later too. Has anyone investigated what it would take to run on 
>>>>>>>>> there? I
>>>>>>>>> imagine we don't need many code changes, just maybe some REPL stuff.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
>>>>>>>>> releases as undisruptive as possible in the model Reynold proposed. 
>>>>>>>>> Keeping
>>>>>>>>> everyone working with the same set of releases is super important.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> to the Spark community. A major release should not be very
>>>>>>>>> different from a
>>>>>>>>>
>>>>>>>>> minor release and should not be gated based on new features. The
>>>>>>>>> main
>>>>>>>>>
>>>>>>>>> purpose of a major release is an opportunity to fix things that
>>>>>>>>> are broken
>>>>>>>>>
>>>>>>>>> in the current API and remove certain deprecated APIs (examples
>>>>>>>>> follow).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>>>>>
>>>>>>>>> time to replace some big old API or implementation with a new one,
>>>>>>>>> but
>>>>>>>>>
>>>>>>>>> I don't see obvious candidates.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>>>>>
>>>>>>>>> there's a fairly good reason to continue adding features in 1.x to
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>>>>>> 2.10, but
>>>>>>>>>
>>>>>>>>> it has been end-of-life.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11
>>>>>>>>> will
>>>>>>>>>
>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
>>>>>>>>> propose
>>>>>>>>>
>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. Remove Hadoop 1 support.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>>>>>
>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>>>>>
>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>>>>>
>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
>>>>>>>>> this?)
>>>>>>>>>
>>>>>>>>> Pop out any Docker stuff to another repo?
>>>>>>>>>
>>>>>>>>> Continue that same effort for EC2?
>>>>>>>>>
>>>>>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>>>>>
>>>>>>>>> controversial)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>

Re: A proposal for Spark 2.0

Reply via email to