Re: Support for local disk columnar storage for DataFrames

2015-11-12 Thread Andrew Duffy
Relevant link:
http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files

On Wed, Nov 11, 2015 at 7:31 PM, Reynold Xin  wrote:

> Thanks for the email. Can you explain what the difference is between this
> and existing formats such as Parquet/ORC?
>
>
> On Wed, Nov 11, 2015 at 4:59 AM, Cristian O <
> cristian.b.op...@googlemail.com> wrote:
>
>> Hi,
>>
>> I was wondering if there's any planned support for local disk columnar
>> storage.
>>
>> This could be an extension of the in-memory columnar store, or possibly
>> something similar to the recently added local checkpointing for RDDs
>>
>> This could also have the added benefit of enabling iterative usage for
>> DataFrames by pruning the query plan through local checkpoints.
>>
>> A further enhancement would be to add update support to the columnar
>> format (in the immutable copy-on-write sense of course), by maintaining
>> references to unchanged row blocks and only copying and mutating the ones
>> that have changed.
>>
>> A use case here is streaming and merging updates in a large dataset that
>> can be efficiently stored internally in a columnar format, rather than
>> accessing a more inefficient external  data store like HDFS or Cassandra.
>>
>> Thanks,
>> Cristian
>>
>
>


Re: A proposal for Spark 2.0

2015-11-12 Thread witgo
Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.


> 在 2015年11月12日,05:32,Matei Zaharia  写道:
> 
> I like the idea of popping out Tachyon to an optional component too to reduce 
> the number of dependencies. In the future, it might even be useful to do this 
> for Hadoop, but it requires too many API changes to be worth doing now.
> 
> Regarding Scala 2.12, we should definitely support it eventually, but I don't 
> think we need to block 2.0 on that because it can be added later too. Has 
> anyone investigated what it would take to run on there? I imagine we don't 
> need many code changes, just maybe some REPL stuff.
> 
> Needless to say, but I'm all for the idea of making "major" releases as 
> undisruptive as possible in the model Reynold proposed. Keeping everyone 
> working with the same set of releases is super important.
> 
> Matei
> 
>> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
>> 
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>> 
>> Agree with this stance. Generally, a major release might also be a
>> time to replace some big old API or implementation with a new one, but
>> I don't see obvious candidates.
>> 
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>> there's a fairly good reason to continue adding features in 1.x to a
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>> 
>> 
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>>> it has been end-of-life.
>> 
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>> dropping 2.10. Otherwise it's supported for 2 more years.
>> 
>> 
>>> 2. Remove Hadoop 1 support.
>> 
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>> 
>> I'm sure we'll think of a number of other small things -- shading a
>> bunch of stuff? reviewing and updating dependencies in light of
>> simpler, more recent dependencies to support from Hadoop etc?
>> 
>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>> Pop out any Docker stuff to another repo?
>> Continue that same effort for EC2?
>> Farming out some of the "external" integrations to another repo (?
>> controversial)
>> 
>> See also anything marked version "2+" in JIRA.
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-12 Thread Nan Zhu
Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,  

--  
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:

> Who has the idea of machine learning? Spark missing some features for machine 
> learning, For example, the parameter server.
>  
>  
> > 在 2015年11月12日,05:32,Matei Zaharia  > (mailto:matei.zaha...@gmail.com)> 写道:
> >  
> > I like the idea of popping out Tachyon to an optional component too to 
> > reduce the number of dependencies. In the future, it might even be useful 
> > to do this for Hadoop, but it requires too many API changes to be worth 
> > doing now.
> >  
> > Regarding Scala 2.12, we should definitely support it eventually, but I 
> > don't think we need to block 2.0 on that because it can be added later too. 
> > Has anyone investigated what it would take to run on there? I imagine we 
> > don't need many code changes, just maybe some REPL stuff.
> >  
> > Needless to say, but I'm all for the idea of making "major" releases as 
> > undisruptive as possible in the model Reynold proposed. Keeping everyone 
> > working with the same set of releases is super important.
> >  
> > Matei
> >  
> > > On Nov 11, 2015, at 4:58 AM, Sean Owen  > > (mailto:so...@cloudera.com)> wrote:
> > >  
> > > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  > > (mailto:r...@databricks.com)> wrote:
> > > > to the Spark community. A major release should not be very different 
> > > > from a
> > > > minor release and should not be gated based on new features. The main
> > > > purpose of a major release is an opportunity to fix things that are 
> > > > broken
> > > > in the current API and remove certain deprecated APIs (examples follow).
> > > >  
> > >  
> > >  
> > > Agree with this stance. Generally, a major release might also be a
> > > time to replace some big old API or implementation with a new one, but
> > > I don't see obvious candidates.
> > >  
> > > I wouldn't mind turning attention to 2.x sooner than later, unless
> > > there's a fairly good reason to continue adding features in 1.x to a
> > > 1.7 release. The scope as of 1.6 is already pretty darned big.
> > >  
> > >  
> > > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, 
> > > > but
> > > > it has been end-of-life.
> > > >  
> > >  
> > >  
> > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> > > be quite stable, and 2.10 will have been EOL for a while. I'd propose
> > > dropping 2.10. Otherwise it's supported for 2 more years.
> > >  
> > >  
> > > > 2. Remove Hadoop 1 support.
> > >  
> > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > > sort of 'alpha' and 'beta' releases) and even <2.6.
> > >  
> > > I'm sure we'll think of a number of other small things -- shading a
> > > bunch of stuff? reviewing and updating dependencies in light of
> > > simpler, more recent dependencies to support from Hadoop etc?
> > >  
> > > Farming out Tachyon to a module? (I felt like someone proposed this?)
> > > Pop out any Docker stuff to another repo?
> > > Continue that same effort for EC2?
> > > Farming out some of the "external" integrations to another repo (?
> > > controversial)
> > >  
> > > See also anything marked version "2+" in JIRA.
> > >  
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > > (mailto:dev-unsubscr...@spark.apache.org)
> > > For additional commands, e-mail: dev-h...@spark.apache.org 
> > > (mailto:dev-h...@spark.apache.org)
> > >  
> >  
> >  
> >  
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> >  
>  
>  
>  
>  
>  
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
>  
>  




Re: Support for local disk columnar storage for DataFrames

2015-11-12 Thread Cristian O
Sorry, apparently only replied to Reynold, meant to copy the list as well,
so I'm self replying and taking the opportunity to illustrate with an
example.

Basically I want to conceptually do this:

val bigDf = sqlContext.sparkContext.parallelize((1 to 100)).map(i
=> (i, 1)).toDF("k", "v")
val deltaDf = sqlContext.sparkContext.parallelize(Array(1,
5)).map(i => (i, 1)).toDF("k", "v")

bigDf.cache()

bigDf.registerTempTable("big")
deltaDf.registerTempTable("delta")

val newBigDf = sqlContext.sql("SELECT big.k, big.v + IF(delta.v is
null, 0, delta.v) FROM big LEFT JOIN delta on big.k = delta.k")

newBigDf.cache()
bigDf.unpersist()


This is essentially an update of keys "1" and "5" only, in a dataset of
1 million keys.

This can be achieved efficiently if the join would preserve the cached
blocks that have been unaffected, and only copy and mutate the 2 affected
blocks corresponding to the matching join keys.

Statistics can determine which blocks actually need mutating. Note also
that shuffling is not required assuming both dataframes are pre-partitioned
by the same key K.

In SQL this could actually be expressed as an UPDATE statement or for a
more generalized use as a MERGE UPDATE:
https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx

While this may seem like a very special case optimization, it would
effectively implement UPDATE support for cached DataFrames, for both
optimal and non-optimal usage.

I appreciate there's quite a lot here, so thank you for taking the time to
consider it.

Cristian



On 12 November 2015 at 15:49, Cristian O 
wrote:

> Hi Reynold,
>
> Thanks for your reply.
>
> Parquet may very well be used as the underlying implementation, but this
> is more than about a particular storage representation.
>
> There are a few things here that are inter-related and open different
> possibilities, so it's hard to structure, but I'll give it a try:
>
> 1. Checkpointing DataFrames - while a DF can be saved locally as parquet,
> just using that as a checkpoint would currently require explicitly reading
> it back. A proper checkpoint implementation would just save (perhaps
> asynchronously) and prune the logical plan while allowing to continue using
> the same DF, now backed by the checkpoint.
>
> It's important to prune the logical plan to avoid all kinds of issues that
> may arise from unbounded expansion with iterative use-cases, like this one
> I encountered recently: https://issues.apache.org/jira/browse/SPARK-11596
>
> But really what I'm after here is:
>
> 2. Efficient updating of cached DataFrames - The main use case here is
> keeping a relatively large dataset cached and updating it iteratively from
> streaming. For example one would like to perform ad-hoc queries on an
> incrementally updated, cached DataFrame. I expect this is already becoming
> an increasingly common use case. Note that the dataset may require merging
> (like adding) or overrriding values by key, so simply appending is not
> sufficient.
>
> This is very similar in concept with updateStateByKey for regular RDDs,
> i.e. an efficient copy-on-write mechanism, albeit perhaps at CachedBatch
> level  (the row blocks for the columnar representation).
>
> This can be currently simulated with UNION or (OUTER) JOINs however is
> very inefficient as it requires copying and recaching the entire dataset,
> and unpersisting the original one. There are also the aforementioned
> problems with unbounded logical plans (physical plans are fine)
>
> These two together, checkpointing and updating cached DataFrames, would
> give fault-tolerant efficient updating of DataFrames, meaning streaming
> apps can take advantage of the compact columnar representation and Tungsten
> optimisations.
>
> I'm not quite sure if something like this can be achieved by other means
> or has been investigated before, hence why I'm looking for feedback here.
>
> While one could use external data stores, they would have the added IO
> penalty, plus most of what's available at the moment is either HDFS
> (extremely inefficient for updates) or key-value stores that have 5-10x
> space overhead over columnar formats.
>
> Thanks,
> Cristian
>
>
>
>
>
>
> On 12 November 2015 at 03:31, Reynold Xin  wrote:
>
>> Thanks for the email. Can you explain what the difference is between this
>> and existing formats such as Parquet/ORC?
>>
>>
>> On Wed, Nov 11, 2015 at 4:59 AM, Cristian O <
>> cristian.b.op...@googlemail.com> wrote:
>>
>>> Hi,
>>>
>>> I was wondering if there's any planned support for local disk columnar
>>> storage.
>>>
>>> This could be an extension of the in-memory columnar store, or possibly
>>> something similar to the recently added local checkpointing for RDDs
>>>
>>> This could also have the added benefit of enabling iterative usage for
>>> DataFrames by pruning the query plan through local checkpoints.
>>>
>>> A further enhancement would be to add update support to the columnar
>>> format (in the immutable copy-on-write sense o

RE: A proposal for Spark 2.0

2015-11-12 Thread Ulanov, Alexander
Parameter Server is a new feature and thus does not match the goal of 2.0 is 
“to fix things that are broken in the current API and remove certain deprecated 
APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

Best regards, Alexander

From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Thursday, November 12, 2015 7:28 AM
To: wi...@qq.com
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,

--
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com 
wrote:
Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.


在 2015年11月12日,05:32,Matei Zaharia 
mailto:matei.zaha...@gmail.com>> 写道:

I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.

Matei

On Nov 11, 2015, at 4:58 AM, Sean Owen 
mailto:so...@cloudera.com>> wrote:

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.


1. Scala 2.11 as the default build. We should still support Scala 2.10, but
it has been end-of-life.

By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
be quite stable, and 2.10 will have been EOL for a while. I'd propose
dropping 2.10. Otherwise it's supported for 2 more years.


2. Remove Hadoop 1 support.

I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
sort of 'alpha' and 'beta' releases) and even <2.6.

I'm sure we'll think of a number of other small things -- shading a
bunch of stuff? reviewing and updating dependencies in light of
simpler, more recent dependencies to support from Hadoop etc?

Farming out Tachyon to a module? (I felt like someone proposed this?)
Pop out any Docker stuff to another repo?
Continue that same effort for EC2?
Farming out some of the "external" integrations to another repo (?
controversial)

See also anything marked version "2+" in JIRA.

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.org


-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.org




-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-12 Thread Nicholas Chammas
With regards to Machine learning, it would be great to move useful features
from MLlib to ML and deprecate the former. Current structure of two
separate machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in
GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some
things in 2.0 without removing or replacing them immediately. That way 2.0
doesn’t have to wait for everything that we want to deprecate to be
replaced all at once.

Nick
​

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
wrote:

> Parameter Server is a new feature and thus does not match the goal of 2.0
> is “to fix things that are broken in the current API and remove certain
> deprecated APIs”. At the same time I would be happy to have that feature.
>
>
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
> *Sent:* Thursday, November 12, 2015 7:28 AM
> *To:* wi...@qq.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Being specific to Parameter Server, I think the current agreement is that
> PS shall exist as a third-party library instead of a component of the core
> code base, isn’t?
>
>
>
> Best,
>
>
>
> --
>
> Nan Zhu
>
> http://codingcat.me
>
>
>
> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>
> Who has the idea of machine learning? Spark missing some features for
> machine learning, For example, the parameter server.
>
>
>
>
>
> 在 2015年11月12日,05:32,Matei Zaharia  写道:
>
>
>
> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
>
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
>
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
>
>
> Matei
>
>
>
> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
>
>
>
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
>
> to the Spark community. A major release should not be very different from a
>
> minor release and should not be gated based on new features. The main
>
> purpose of a major release is an opportunity to fix things that are broken
>
> in the current API and remove certain deprecated APIs (examples follow).
>
>
>
> Agree with this stance. Generally, a major release might also be a
>
> time to replace some big old API or implementation with a new one, but
>
> I don't see obvious candidates.
>
>
>
> I wouldn't mind turning attention to 2.x sooner than later, unless
>
> there's a fairly good reason to continue adding features in 1.x to a
>
> 1.7 release. The scope as of 1.6 is already pretty darned big.
>
>
>
>
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>
> it has been end-of-life.
>
>
>
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>
> dropping 2.10. Otherwise it's supported for 2 more years.
>
>
>
>
>
> 2. Remove Hadoop 1 support.
>
>
>
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>
> sort of 'alpha' and 'beta' releases) and even <2.6.
>
>
>
> I'm sure we'll think of a number of other small things -- shading a
>
> bunch of stuff? reviewing and updating dependencies in light of
>
> simpler, more recent dependencies to support from Hadoop etc?
>
>
>
> Farming out Tachyon to a module? (I felt like someone proposed this?)
>
> Pop out any Docker stuff to another repo?
>
> Continue that same effort for EC2?
>
> Farming out some of the "external" integrations to another repo (?
>
> controversial)
>
>
>
> See also anything marked version "2+" in JIRA.
>
>
>
> -
>
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>
>
> -
>
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>
> For additional commands, e-mail: dev-h...@spark.

Re: Proposal for SQL join optimization

2015-11-12 Thread Zhan Zhang
Hi Xiao,

Performance-wise, without the manual tuning, the query cannot be finished, and 
with the tuning the query can finish in minutes in TPCH 100G data.

I have created https://issues.apache.org/jira/browse/SPARK-11704 and 
https://issues.apache.org/jira/browse/SPARK-11705 for these two issues, and we 
can move the discussion there.

Thanks.

Zhan Zhang

On Nov 11, 2015, at 6:16 PM, Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:

Hi, Zhan,

That sounds really interesting! Please at me when you submit the PR. If 
possible, please also posted the performance difference.

Thanks,

Xiao Li


2015-11-11 14:45 GMT-08:00 Zhan Zhang 
mailto:zzh...@hortonworks.com>>:
Hi Folks,

I did some performance measurement based on TPC-H recently, and want to bring 
up some performance issue I observed. Both are related to cartesian join.

1. CartesianProduct implementation.

Currently CartesianProduct relies on RDD.cartesian, in which the computation is 
realized as follows

  override def compute(split: Partition, context: TaskContext): Iterator[(T, 
U)] = {
val currSplit = split.asInstanceOf[CartesianPartition]
for (x <- rdd1.iterator(currSplit.s1, context);
 y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
  }

>From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
>Which is really heavy and may never finished if n is large, especially when 
>rdd2 is coming from ShuffleRDD.

We should have some optimization on CartesianProduct by caching rightResults. 
The problem is that we don’t have cleanup hook to unpersist rightResults AFAIK. 
I think we should have some cleanup hook after query execution.
With the hook available, we can easily optimize such Cartesian join. I believe 
such cleanup hook may also benefit other query optimizations.


2. Unnecessary CartesianProduct join.

When we have some queries similar to following (don’t remember the exact form):
select * from a, b, c, d where a.key1 = c.key1 and b.key2 = c.key2 and c.key3 = 
d.key3

There will be a cartesian join between a and b. But if we just simply change 
the table order, for example from a, c, b, d, such cartesian join are 
eliminated.
Without such manual tuning, the query will never finish if a, c are big. But we 
should not relies on such manual optimization.


Please provide your inputs. If they are both valid, I will open liras for each.

Thanks.

Zhan Zhang

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.org





[build system] short jenkins downtime tomorrow morning, 11-13-2015 @ 7am PST

2015-11-12 Thread shane knapp
i will admit that it does seem like a bad idea to poke jenkins on
friday the 13th, but there's a release that fixes a lot of security
issues:

https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-11-11

i'll set jenkins to stop kicking off any new builds around 5am PST,
and will upgrade and restart jenkins around 7am PST.  barring anything
horrible happening, we should be back up and building by 730am.

...and this time, i promise not to touch any of the plugins.  :)

shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-12 Thread Kostas Sakellis
I know we want to keep breaking changes to a minimum but I'm hoping that
with Spark 2.0 we can also look at better classpath isolation with user
programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
setting it true by default, and not allow any spark transitive dependencies
to leak into user code. For backwards compatibility we can have a whitelist
if we want but I'd be good if we start requiring user apps to explicitly
pull in all their dependencies. From what I can tell, Hadoop 3 is also
moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
> ​
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* wi...@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features for
>> machine learning, For example, the parameter server.
>>
>>
>>
>>
>>
>> 在 2015年11月12日,05:32,Matei Zaharia  写道:
>>
>>
>>
>> I like the idea of popping out Tachyon to an optional component too to
>> reduce the number of dependencies. In the future, it might even be useful
>> to do this for Hadoop, but it requires too many API changes to be worth
>> doing now.
>>
>>
>>
>> Regarding Scala 2.12, we should definitely support it eventually, but I
>> don't think we need to block 2.0 on that because it can be added later too.
>> Has anyone investigated what it would take to run on there? I imagine we
>> don't need many code changes, just maybe some REPL stuff.
>>
>>
>>
>> Needless to say, but I'm all for the idea of making "major" releases as
>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>> working with the same set of releases is super important.
>>
>>
>>
>> Matei
>>
>>
>>
>> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
>>
>>
>>
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
>> wrote:
>>
>> to the Spark community. A major release should not be very different from
>> a
>>
>> minor release and should not be gated based on new features. The main
>>
>> purpose of a major release is an opportunity to fix things that are broken
>>
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>>
>>
>> Agree with this stance. Generally, a major release might also be a
>>
>> time to replace some big old API or implementation with a new one, but
>>
>> I don't see obvious candidates.
>>
>>
>>
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>
>> there's a fairly good reason to continue adding features in 1.x to a
>>
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>
>>
>>
>>
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but
>>
>> it has been end-of-life.
>>
>>
>>
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>
>> dropping 2.10. Otherwise it's supported for 2 more years.
>>
>>
>>
>>
>>
>> 2. Remove Hadoop 1 support.
>>
>>
>>
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>
>>
>>
>> I'm sure we'll think of a number of other small things --

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
I am not sure what the best practice for this specific problem, but it’s really 
worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API 
only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kos...@cloudera.com]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; Reynold Xin
Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with 
Spark 2.0 we can also look at better classpath isolation with user programs. I 
propose we build on spark.{driver|executor}.userClassPathFirst, setting it true 
by default, and not allow any spark transitive dependencies to leak into user 
code. For backwards compatibility we can have a whitelist if we want but I'd be 
good if we start requiring user apps to explicitly pull in all their 
dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas 
mailto:nicholas.cham...@gmail.com>> wrote:

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 
2.0 without removing or replacing them immediately. That way 2.0 doesn’t have 
to wait for everything that we want to deprecate to be replaced all at once.

Nick
​

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
mailto:alexander.ula...@hpe.com>> wrote:
Parameter Server is a new feature and thus does not match the goal of 2.0 is 
“to fix things that are broken in the current API and remove certain deprecated 
APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

Best regards, Alexander

From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Thursday, November 12, 2015 7:28 AM
To: wi...@qq.com
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,

--
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com 
wrote:
Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.


在 2015年11月12日,05:32,Matei Zaharia 
mailto:matei.zaha...@gmail.com>> 写道:

I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.

Matei

On Nov 11, 2015, at 4:58 AM, Sean Owen 
mailto:so...@cloudera.com>> wrote:

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.


1. Scala 2.11 as the default build. We should still support Scala 2.10, but
it has been end-of-life.

By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
be qui

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-12 Thread Jeff Zhang
Didn't notice that I can pass comma separated path in the existing API
(SparkContext#textFile). So no necessary for new api. Thanks all.



On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang  wrote:

> Hi Pradeep
>
> ≥≥≥ Looks like what I was suggesting doesn't work. :/
> I guess you mean put comma separated path into one string and pass it
> to existing API (SparkContext#textFile). It should not work. I suggest to
> create new api SparkContext#textFiles to accept an array of string. I have
> already implemented a simple patch and it works.
>
>
>
>
> On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota 
> wrote:
>
>> Looks like what I was suggesting doesn't work. :/
>>
>> On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang  wrote:
>>
>>> Yes, that's what I suggest. TextInputFormat support multiple inputs. So
>>> in spark side, we just need to provide API to for that.
>>>
>>> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota >> > wrote:
>>>
 IIRC, TextInputFormat supports an input path that is a comma separated
 list. I haven't tried this, but I think you should just be able to do
 sc.textFile("file1,file2,...")

 On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:

> I know these workaround, but wouldn't it be more convenient and
> straightforward to use SparkContext#textFiles ?
>
> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra  > wrote:
>
>> For more than a small number of files, you'd be better off using
>> SparkContext#union instead of RDD#union.  That will avoid building up a
>> lengthy lineage.
>>
>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>> wrote:
>>
>>> Hey Jeff,
>>> Do you mean reading from multiple text files? In that case, as a
>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>> multiple rdds. For example:
>>>
>>> val lines1 = sc.textFile("file1")
>>> val lines2 = sc.textFile("file2")
>>>
>>> val rdd = lines1 union lines2
>>>
>>> regards,
>>> --Jakob
>>>
>>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>>
 Although user can use the hdfs glob syntax to support multiple
 inputs. But sometimes, it is not convenient to do that. Not sure why
 there's no api of SparkContext#textFiles. It should be easy to 
 implement
 that. I'd love to create a ticket and contribute for that if there's no
 other consideration that I don't know.

 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra
The place of the RDD API in 2.0 is also something I've been wondering
about.  I think it may be going too far to deprecate it, but changing
emphasis is something that we might consider.  The RDD API came well before
DataFrames and DataSets, so programming guides, introductory how-to
articles and the like have, to this point, also tended to emphasize RDDs --
or at least to deal with them early.  What I'm thinking is that with 2.0
maybe we should overhaul all the documentation to de-emphasize and
reposition RDDs.  In this scheme, DataFrames and DataSets would be
introduced and fully addressed before RDDs.  They would be presented as the
normal/default/standard way to do things in Spark.  RDDs, in contrast,
would be presented later as a kind of lower-level, closer-to-the-metal API
that can be used in atypical, more specialized contexts where DataFrames or
DataSets don't fully fit.

On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:

> I am not sure what the best practice for this specific problem, but it’s
> really worth to think about it in 2.0, as it is a painful issue for lots of
> users.
>
>
>
> By the way, is it also an opportunity to deprecate the RDD API (or
> internal API only?)? As lots of its functionality overlapping with
> DataFrame or DataSet.
>
>
>
> Hao
>
>
>
> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
> *Sent:* Friday, November 13, 2015 5:27 AM
> *To:* Nicholas Chammas
> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
> Reynold Xin
>
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> I know we want to keep breaking changes to a minimum but I'm hoping that
> with Spark 2.0 we can also look at better classpath isolation with user
> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
> setting it true by default, and not allow any spark transitive dependencies
> to leak into user code. For backwards compatibility we can have a whitelist
> if we want but I'd be good if we start requiring user apps to explicitly
> pull in all their dependencies. From what I can tell, Hadoop 3 is also
> moving in this direction.
>
>
>
> Kostas
>
>
>
> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
>
> ​
>
>
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Parameter Server is a new feature and thus does not match the goal of 2.0
> is “to fix things that are broken in the current API and remove certain
> deprecated APIs”. At the same time I would be happy to have that feature.
>
>
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
> *Sent:* Thursday, November 12, 2015 7:28 AM
> *To:* wi...@qq.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Being specific to Parameter Server, I think the current agreement is that
> PS shall exist as a third-party library instead of a component of the core
> code base, isn’t?
>
>
>
> Best,
>
>
>
> --
>
> Nan Zhu
>
> http://codingcat.me
>
>
>
> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>
> Who has the idea of machine learning? Spark missing some features for
> machine learning, For example, the parameter server.
>
>
>
>
>
> 在 2015年11月12日,05:32,Matei Zaharia  写道:
>
>
>
> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
>
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
>
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of

Seems jenkins is down (or very slow)?

2015-11-12 Thread Yin Huai
Hi Guys,

Seems Jenkins is down or very slow? Does anyone else experience it or just
me?

Thanks,

Yin


Re: Seems jenkins is down (or very slow)?

2015-11-12 Thread Yin Huai
Seems it is back.

On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai  wrote:

> Hi Guys,
>
> Seems Jenkins is down or very slow? Does anyone else experience it or just
> me?
>
> Thanks,
>
> Yin
>


Re: Seems jenkins is down (or very slow)?

2015-11-12 Thread Ted Yu
I was able to access the following where response was fast:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45806/

Cheers

On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai  wrote:

> Hi Guys,
>
> Seems Jenkins is down or very slow? Does anyone else experience it or just
> me?
>
> Thanks,
>
> Yin
>


Re: Seems jenkins is down (or very slow)?

2015-11-12 Thread Fengdong Yu
I can assess directly in China



> On Nov 13, 2015, at 10:28 AM, Ted Yu  wrote:
> 
> I was able to access the following where response was fast:
> 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN 
> 
> 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45806/ 
> 
> 
> Cheers
> 
> On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai  > wrote:
> Hi Guys,
> 
> Seems Jenkins is down or very slow? Does anyone else experience it or just me?
> 
> Thanks,
> 
> Yin
> 



Re: A proposal for Spark 2.0

2015-11-12 Thread Stephen Boesch
My understanding is that  the RDD's presently have more support for
complete control of partitioning which is a key consideration at scale.
While partitioning control is still piecemeal in  DF/DS  it would seem
premature to make RDD's a second-tier approach to spark dev.

An example is the use of groupBy when we know that the source relation
(/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
sql still does not allow that knowledge to be applied to the optimizer - so
a full shuffle will be performed. However in the native RDD we can use
preservesPartitioning=true.

2015-11-12 17:42 GMT-08:00 Mark Hamstra :

> The place of the RDD API in 2.0 is also something I've been wondering
> about.  I think it may be going too far to deprecate it, but changing
> emphasis is something that we might consider.  The RDD API came well before
> DataFrames and DataSets, so programming guides, introductory how-to
> articles and the like have, to this point, also tended to emphasize RDDs --
> or at least to deal with them early.  What I'm thinking is that with 2.0
> maybe we should overhaul all the documentation to de-emphasize and
> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> introduced and fully addressed before RDDs.  They would be presented as the
> normal/default/standard way to do things in Spark.  RDDs, in contrast,
> would be presented later as a kind of lower-level, closer-to-the-metal API
> that can be used in atypical, more specialized contexts where DataFrames or
> DataSets don't fully fit.
>
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:
>
>> I am not sure what the best practice for this specific problem, but it’s
>> really worth to think about it in 2.0, as it is a painful issue for lots of
>> users.
>>
>>
>>
>> By the way, is it also an opportunity to deprecate the RDD API (or
>> internal API only?)? As lots of its functionality overlapping with
>> DataFrame or DataSet.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>> *Sent:* Friday, November 13, 2015 5:27 AM
>> *To:* Nicholas Chammas
>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
>> Reynold Xin
>>
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> I know we want to keep breaking changes to a minimum but I'm hoping that
>> with Spark 2.0 we can also look at better classpath isolation with user
>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>> setting it true by default, and not allow any spark transitive dependencies
>> to leak into user code. For backwards compatibility we can have a whitelist
>> if we want but I'd be good if we start requiring user apps to explicitly
>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>> moving in this direction.
>>
>>
>>
>> Kostas
>>
>>
>>
>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>> On that note of deprecating stuff, it might be good to deprecate some
>> things in 2.0 without removing or replacing them immediately. That way 2.0
>> doesn’t have to wait for everything that we want to deprecate to be
>> replaced all at once.
>>
>> Nick
>>
>> ​
>>
>>
>>
>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>> alexander.ula...@hpe.com> wrote:
>>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* wi...@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features for
>> machine learning, For example, the parame

Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra
Hmmm... to me, that seems like precisely the kind of thing that argues for
retaining the RDD API but not as the first thing presented to new Spark
developers: "Here's how to use groupBy with DataFrames Until the
optimizer is more fully developed, that won't always get you the best
performance that can be obtained.  In these particular circumstances, ...,
you may want to use the low-level RDD API while setting
preservesPartitioning to true.  Like this"

On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch  wrote:

> My understanding is that  the RDD's presently have more support for
> complete control of partitioning which is a key consideration at scale.
> While partitioning control is still piecemeal in  DF/DS  it would seem
> premature to make RDD's a second-tier approach to spark dev.
>
> An example is the use of groupBy when we know that the source relation
> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
> sql still does not allow that knowledge to be applied to the optimizer - so
> a full shuffle will be performed. However in the native RDD we can use
> preservesPartitioning=true.
>
> 2015-11-12 17:42 GMT-08:00 Mark Hamstra :
>
>> The place of the RDD API in 2.0 is also something I've been wondering
>> about.  I think it may be going too far to deprecate it, but changing
>> emphasis is something that we might consider.  The RDD API came well before
>> DataFrames and DataSets, so programming guides, introductory how-to
>> articles and the like have, to this point, also tended to emphasize RDDs --
>> or at least to deal with them early.  What I'm thinking is that with 2.0
>> maybe we should overhaul all the documentation to de-emphasize and
>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>> introduced and fully addressed before RDDs.  They would be presented as the
>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>> would be presented later as a kind of lower-level, closer-to-the-metal API
>> that can be used in atypical, more specialized contexts where DataFrames or
>> DataSets don't fully fit.
>>
>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:
>>
>>> I am not sure what the best practice for this specific problem, but it’s
>>> really worth to think about it in 2.0, as it is a painful issue for lots of
>>> users.
>>>
>>>
>>>
>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>> internal API only?)? As lots of its functionality overlapping with
>>> DataFrame or DataSet.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>> *To:* Nicholas Chammas
>>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
>>> Reynold Xin
>>>
>>> *Subject:* Re: A proposal for Spark 2.0
>>>
>>>
>>>
>>> I know we want to keep breaking changes to a minimum but I'm hoping that
>>> with Spark 2.0 we can also look at better classpath isolation with user
>>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>>> setting it true by default, and not allow any spark transitive dependencies
>>> to leak into user code. For backwards compatibility we can have a whitelist
>>> if we want but I'd be good if we start requiring user apps to explicitly
>>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>>> moving in this direction.
>>>
>>>
>>>
>>> Kostas
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>> With regards to Machine learning, it would be great to move useful
>>> features from MLlib to ML and deprecate the former. Current structure of
>>> two separate machine learning packages seems to be somewhat confusing.
>>>
>>> With regards to GraphX, it would be great to deprecate the use of RDD in
>>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>>
>>> On that note of deprecating stuff, it might be good to deprecate some
>>> things in 2.0 without removing or replacing them immediately. That way 2.0
>>> doesn’t have to wait for everything that we want to deprecate to be
>>> replaced all at once.
>>>
>>> Nick
>>>
>>> ​
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>> alexander.ula...@hpe.com> wrote:
>>>
>>> Parameter Server is a new feature and thus does not match the goal of
>>> 2.0 is “to fix things that are broken in the current API and remove certain
>>> deprecated APIs”. At the same time I would be happy to have that feature.
>>>
>>>
>>>
>>> With regards to Machine learning, it would be great to move useful
>>> features from MLlib to ML and deprecate the former. Current structure of
>>> two separate machine learning packages seems to be somewhat confusing.
>>>
>>> With regards to GraphX, it would be great to deprecate the use of RDD in
>>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>>
>>>
>>>
>>> Best regards, Alexander
>>>
>>>
>>>
>

Re: RE: A proposal for Spark 2.0

2015-11-12 Thread Guoqiang Li
Yes, I agree with  Nan Zhu. I recommend these projects:
https://github.com/dmlc/ps-lite (Apache License 2)
https://github.com/Microsoft/multiverso (MIT License)


Alexander, You may also be interested in the demo(graph on parameter Server) 


https://github.com/witgo/zen/tree/ps_graphx/graphx/src/main/scala/com/github/cloudml/zen/graphx







-- Original --
From:  "Ulanov, Alexander";;
Date:  Fri, Nov 13, 2015 01:44 AM
To:  "Nan Zhu"; "Guoqiang Li"; 
Cc:  "dev@spark.apache.org"; "Reynold 
Xin"; 
Subject:  RE: A proposal for Spark 2.0



  
Parameter Server is a new feature and thus does not match the goal of 2.0 is 
??to fix things that are broken in the current API and remove certain 
deprecated APIs??.  At the same time I would be happy to have that feature.
 
 
 
With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine  learning packages seems to be somewhat confusing.
 
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.
 
 
 
Best regards, Alexander
 
 
 
From: Nan Zhu [mailto:zhunanmcg...@gmail.com] 
 Sent: Thursday, November 12, 2015 7:28 AM
 To: wi...@qq.com
 Cc: dev@spark.apache.org
 Subject: Re: A proposal for Spark 2.0
 
 
  
Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn??t?
 
  
 
 
  
Best,
 
   
 
 
  
-- 
 
  
Nan Zhu
 
  
http://codingcat.me
 
  
 
 
 
 
On Thursday, November 12, 2015 at 9:49 AM,  wi...@qq.com wrote:
 
Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.
 
  
 
 
  
 
 

?? 2015??11??1205:32??Matei  Zaharia   ??
 
  
 
 
  
I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.
 
  
 
 
  
Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many  code changes, just maybe some REPL stuff.
 
  
 
 
  
Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.
 
  
 
 
  
Matei
 
  
 
 

On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
 
  
 
 
  
On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
 

to the Spark community. A major release should not be very different from a
 
  
minor release and should not be gated based on new features. The main
 
  
purpose of a major release is an opportunity to fix things that are broken
 
  
in the current API and remove certain deprecated APIs (examples follow).
 
 
   
 
 
  
Agree with this stance. Generally, a major release might also be a
 
  
time to replace some big old API or implementation with a new one, but
 
  
I don't see obvious candidates.
 
  
 
 
  
I wouldn't mind turning attention to 2.x sooner than later, unless
 
  
there's a fairly good reason to continue adding features in 1.x to a
 
  
1.7 release. The scope as of 1.6 is already pretty darned big.
 
  
 
 
  
 
 

1. Scala 2.11 as the default build. We should still support Scala 2.10, but
 
  
it has been end-of-life.
 
 
   
 
 
  
By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
 
  
be quite stable, and 2.10 will have been EOL for a while. I'd propose
 
  
dropping 2.10. Otherwise it's supported for 2 more years.
 
  
 
 
  
 
 
   
2. Remove Hadoop 1 support.
 
   
 
 
  
I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
 
  
sort of 'alpha' and 'beta' releases) and even <2.6.
 
  
 
 
  
I'm sure we'll think of a number of other small things -- shading a
 
  
bunch of stuff? reviewing and updating dependencies in light of
 
  
simpler, more recent dependencies to support from Hadoop etc?
 
  
 
 
  
Farming out Tachyon to a module? (I felt like someone proposed this?)
 
  
Pop out any Docker stuff to another repo?
 
  
Continue that same effort for EC2?
 
  
Farming out some of the "external" integrations to another repo (?
 
  
controversial)
 
  
 
 
  
See also anything marked version "2+" in JIRA.
 
  
 
 
  
-
 
  
To unsubscribe, e-mail:  dev-unsubscr...@spark.apache.org
 
  
For additional commands, e-mail:  dev-h...@spark.apache.org
 
 
   
 
 
  
 
 
  
-
 
  
To unsubscribe, e-mail:  dev-unsubscr...@spark.apache.org
 
  
For addition

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
Agree, more features/apis/optimization need to be added in DF/DS.

I mean, we need to think about what kind of RDD APIs we have to provide to 
developer, maybe the fundamental API is enough, like, the ShuffledRDD etc..  
But PairRDDFunctions probably not in this category, as we can do the same thing 
easily with DF/DS, even better performance.

From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Friday, November 13, 2015 11:23 AM
To: Stephen Boesch
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Hmmm... to me, that seems like precisely the kind of thing that argues for 
retaining the RDD API but not as the first thing presented to new Spark 
developers: "Here's how to use groupBy with DataFrames Until the optimizer 
is more fully developed, that won't always get you the best performance that 
can be obtained.  In these particular circumstances, ..., you may want to use 
the low-level RDD API while setting preservesPartitioning to true.  Like 
this"

On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
mailto:java...@gmail.com>> wrote:
My understanding is that  the RDD's presently have more support for complete 
control of partitioning which is a key consideration at scale.  While 
partitioning control is still piecemeal in  DF/DS  it would seem premature to 
make RDD's a second-tier approach to spark dev.

An example is the use of groupBy when we know that the source relation (/RDD) 
is already partitioned on the grouping expressions.  AFAIK the spark sql still 
does not allow that knowledge to be applied to the optimizer - so a full 
shuffle will be performed. However in the native RDD we can use 
preservesPartitioning=true.

2015-11-12 17:42 GMT-08:00 Mark Hamstra 
mailto:m...@clearstorydata.com>>:
The place of the RDD API in 2.0 is also something I've been wondering about.  I 
think it may be going too far to deprecate it, but changing emphasis is 
something that we might consider.  The RDD API came well before DataFrames and 
DataSets, so programming guides, introductory how-to articles and the like 
have, to this point, also tended to emphasize RDDs -- or at least to deal with 
them early.  What I'm thinking is that with 2.0 maybe we should overhaul all 
the documentation to de-emphasize and reposition RDDs.  In this scheme, 
DataFrames and DataSets would be introduced and fully addressed before RDDs.  
They would be presented as the normal/default/standard way to do things in 
Spark.  RDDs, in contrast, would be presented later as a kind of lower-level, 
closer-to-the-metal API that can be used in atypical, more specialized contexts 
where DataFrames or DataSets don't fully fit.

On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao 
mailto:hao.ch...@intel.com>> wrote:
I am not sure what the best practice for this specific problem, but it’s really 
worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API 
only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kos...@cloudera.com]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com; 
dev@spark.apache.org; Reynold Xin

Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with 
Spark 2.0 we can also look at better classpath isolation with user programs. I 
propose we build on spark.{driver|executor}.userClassPathFirst, setting it true 
by default, and not allow any spark transitive dependencies to leak into user 
code. For backwards compatibility we can have a whitelist if we want but I'd be 
good if we start requiring user apps to explicitly pull in all their 
dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas 
mailto:nicholas.cham...@gmail.com>> wrote:

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 
2.0 without removing or replacing them immediately. That way 2.0 doesn’t have 
to wait for everything that we want to deprecate to be replaced all at once.

Nick
​

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
mailto:alexander.ula...@hpe.com>> wrote:
Parameter Server is a new feature and thus does not match the goal of 2.0 is 
“to fix things that are broken in the current API and remove certain deprecated 
APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be