Re: A proposal for Spark 2.0

Kostas Sakellis Thu, 12 Nov 2015 13:28:15 -0800

I know we want to keep breaking changes to a minimum but I'm hoping that
with Spark 2.0 we can also look at better classpath isolation with user
programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
setting it true by default, and not allow any spark transitive dependencies
to leak into user code. For backwards compatibility we can have a whitelist
if we want but I'd be good if we start requiring user apps to explicitly
pull in all their dependencies. From what I can tell, Hadoop 3 is also
moving in this direction.


Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
> 
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* wi...@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features for
>> machine learning, For example, the parameter server.
>>
>>
>>
>>
>>
>> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaha...@gmail.com> 写道：
>>
>>
>>
>> I like the idea of popping out Tachyon to an optional component too to
>> reduce the number of dependencies. In the future, it might even be useful
>> to do this for Hadoop, but it requires too many API changes to be worth
>> doing now.
>>
>>
>>
>> Regarding Scala 2.12, we should definitely support it eventually, but I
>> don't think we need to block 2.0 on that because it can be added later too.
>> Has anyone investigated what it would take to run on there? I imagine we
>> don't need many code changes, just maybe some REPL stuff.
>>
>>
>>
>> Needless to say, but I'm all for the idea of making "major" releases as
>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>> working with the same set of releases is super important.
>>
>>
>>
>> Matei
>>
>>
>>
>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>
>>
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com>
>> wrote:
>>
>> to the Spark community. A major release should not be very different from
>> a
>>
>> minor release and should not be gated based on new features. The main
>>
>> purpose of a major release is an opportunity to fix things that are broken
>>
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>>
>>
>> Agree with this stance. Generally, a major release might also be a
>>
>> time to replace some big old API or implementation with a new one, but
>>
>> I don't see obvious candidates.
>>
>>
>>
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>
>> there's a fairly good reason to continue adding features in 1.x to a
>>
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>
>>
>>
>>
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but
>>
>> it has been end-of-life.
>>
>>
>>
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>
>> dropping 2.10. Otherwise it's supported for 2 more years.
>>
>>
>>
>>
>>
>> 2. Remove Hadoop 1 support.
>>
>>
>>
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>
>>
>>
>> I'm sure we'll think of a number of other small things -- shading a
>>
>> bunch of stuff? reviewing and updating dependencies in light of
>>
>> simpler, more recent dependencies to support from Hadoop etc?
>>
>>
>>
>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>
>> Pop out any Docker stuff to another repo?
>>
>> Continue that same effort for EC2?
>>
>> Farming out some of the "external" integrations to another repo (?
>>
>> controversial)
>>
>>
>>
>> See also anything marked version "2+" in JIRA.
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>

Re: A proposal for Spark 2.0

Reply via email to