Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Andrew Or
This seems like a legitimate blocker. We will cut another RC to include the
revert.

2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp:

 Now I've finished to revert for SPARK-4434 and opened PR.


 (2014/11/16 17:08), Josh Rosen wrote:

 -1

 I found a potential regression in 1.1.1 related to spark-submit and
 cluster
 deploy mode: https://issues.apache.org/jira/browse/SPARK-4434

 I think that this is worth fixing.

 On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com
 wrote:

  +1

 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
 are
 fixed. Hive version inspection works as expected.


 On 11/15/14 8:25 AM, Zach Fry wrote:

  +0

 I expect to start testing on Monday but won't have enough results to
 change
 my vote from +0
 until Monday night or Tuesday morning.

 Thanks,
 Zach



 --
 View this message in context: http://apache-spark-
 developers-list.1001551.n3.nabble.com/VOTE-Release-
 Apache-Spark-1-1-1-RC1-tp9311p9370.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



  -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Debasish Das
Andrew,

I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap
followed by groupBy...My cluster memory is less than the memory I need and
therefore flatMap does around 400 GB of shuffle...memory is around 120 GB...

14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
mapId=-1, reduceId=22)

I searched on user-list and this issue has been found over there:

http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html

I wanted to make sure whether 1.1.1 does not have the same bug...-1 from me
till we figure out the root cause...

Thanks.

Deb

On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or and...@databricks.com wrote:

 This seems like a legitimate blocker. We will cut another RC to include the
 revert.

 2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp:

  Now I've finished to revert for SPARK-4434 and opened PR.
 
 
  (2014/11/16 17:08), Josh Rosen wrote:
 
  -1
 
  I found a potential regression in 1.1.1 related to spark-submit and
  cluster
  deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
 
  I think that this is worth fixing.
 
  On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com
  wrote:
 
   +1
 
  Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
  are
  fixed. Hive version inspection works as expected.
 
 
  On 11/15/14 8:25 AM, Zach Fry wrote:
 
   +0
 
  I expect to start testing on Monday but won't have enough results to
  change
  my vote from +0
  until Monday night or Tuesday morning.
 
  Thanks,
  Zach
 
 
 
  --
  View this message in context: http://apache-spark-
  developers-list.1001551.n3.nabble.com/VOTE-Release-
  Apache-Spark-1-1-1-RC1-tp9311p9370.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
   -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: mvn or sbt for studying and developing Spark?

2014-11-17 Thread Michael Armbrust

 * I moved from sbt to maven in June specifically due to Andrew Or's
 describing mvn as the default build tool.  Developers should keep in mind
 that jenkins uses mvn so we need to run mvn before submitting PR's - even
 if sbt were used for day to day dev work


To be clear, I think that the PR builder actually uses sbt
https://github.com/apache/spark/blob/master/dev/run-tests#L198 currently,
but there are master builds that make sure maven doesn't break (amongst
other things).


 *  In addition, as Sean has alluded to, the Intellij seems to comprehend
 the maven builds a bit more readily than sbt


Yeah, this is a very good point.  I have used `sbt/sbt gen-idea` in the
past, but I'm currently using the maven integration of inteliJ since it
seems more stable.


 * But for command line and day to day dev purposes:  sbt sounds great to
 use  Those sound bites you provided about exposing built-in test databases
 for hive and for displaying available testcases are sweet.  Any
 easy/convenient way to see more of  those kinds of facilities available
 through sbt ?


The Spark SQL developer readme
https://github.com/apache/spark/tree/master/sql has a little bit of this,
but we really should have some documentation on using SBT as well.

 Integrating with those systems is generally easier if you are also working
 with Spark in Maven.  (And I wouldn't classify all of those Maven-built
 systems as legacy, Michael :)


Also a good point, though I've seen some pretty clever uses of sbt's
external project references to link spark into other projects.  I'll
certainly admit I have a bias towards new shiny things in general though,
so my definition of legacy is probably skewed :)


Re: mvn or sbt for studying and developing Spark?

2014-11-17 Thread Nicholas Chammas
The docs on using sbt are here:
https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt

They'll be published with 1.2.0 presumably.
On 2014년 11월 17일 (월) at 오후 2:49 Michael Armbrust mich...@databricks.com
wrote:

 
  * I moved from sbt to maven in June specifically due to Andrew Or's
  describing mvn as the default build tool.  Developers should keep in mind
  that jenkins uses mvn so we need to run mvn before submitting PR's - even
  if sbt were used for day to day dev work
 

 To be clear, I think that the PR builder actually uses sbt
 https://github.com/apache/spark/blob/master/dev/run-tests#L198
 currently,
 but there are master builds that make sure maven doesn't break (amongst
 other things).


  *  In addition, as Sean has alluded to, the Intellij seems to comprehend
  the maven builds a bit more readily than sbt
 

 Yeah, this is a very good point.  I have used `sbt/sbt gen-idea` in the
 past, but I'm currently using the maven integration of inteliJ since it
 seems more stable.


  * But for command line and day to day dev purposes:  sbt sounds great to
  use  Those sound bites you provided about exposing built-in test
 databases
  for hive and for displaying available testcases are sweet.  Any
  easy/convenient way to see more of  those kinds of facilities available
  through sbt ?
 

 The Spark SQL developer readme
 https://github.com/apache/spark/tree/master/sql has a little bit of
 this,
 but we really should have some documentation on using SBT as well.

  Integrating with those systems is generally easier if you are also working
  with Spark in Maven.  (And I wouldn't classify all of those Maven-built
  systems as legacy, Michael :)


 Also a good point, though I've seen some pretty clever uses of sbt's
 external project references to link spark into other projects.  I'll
 certainly admit I have a bias towards new shiny things in general though,
 so my definition of legacy is probably skewed :)



Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Kevin Markey

+0 (non-binding)

Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn, 
plain-vanilla Hadoop 2.3.0. No regressions.


However, 12% to 22% increase in run time relative to 1.0.0 release.  (No 
other environment or configuration changes.)  Would have recommended +1 
were it not for added latency.


Not sure if added latency a function of 1.0 vs 1.1 or 1.0 vs 1.1.1 
changes, as we've never tested with 1.1.0. But thought I'd share the 
results.  (This is somewhat disappointing.)


Kevin Markey

On 11/17/2014 11:42 AM, Debasish Das wrote:

Andrew,

I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap
followed by groupBy...My cluster memory is less than the memory I need and
therefore flatMap does around 400 GB of shuffle...memory is around 120 GB...

14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
mapId=-1, reduceId=22)

I searched on user-list and this issue has been found over there:

http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html

I wanted to make sure whether 1.1.1 does not have the same bug...-1 from me
till we figure out the root cause...

Thanks.

Deb

On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or and...@databricks.com wrote:


This seems like a legitimate blocker. We will cut another RC to include the
revert.

2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp:


Now I've finished to revert for SPARK-4434 and opened PR.


(2014/11/16 17:08), Josh Rosen wrote:


-1

I found a potential regression in 1.1.1 related to spark-submit and
cluster
deploy mode: https://issues.apache.org/jira/browse/SPARK-4434

I think that this is worth fixing.

On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com
wrote:

  +1


Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
are
fixed. Hive version inspection works as expected.


On 11/15/14 8:25 AM, Zach Fry wrote:

  +0


I expect to start testing on Monday but won't have enough results to
change
my vote from +0
until Monday night or Tuesday morning.

Thanks,
Zach



--
View this message in context: http://apache-spark-
developers-list.1001551.n3.nabble.com/VOTE-Release-
Apache-Spark-1-1-1-RC1-tp9311p9370.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



  -

To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org








-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Patrick Wendell
Hey Kevin,

If you are upgrading from 1.0.X to 1.1.X checkout the upgrade notes
here [1] - it could be that default changes caused a regression for
your workload. Do you still see a regression if you restore the
configuration changes?

It's great to hear specifically about issues like this, so please fork
a new thread and describe your workload if you see a regression. The
main focus of a patch release vote like this is to test regressions
against the previous release on the same line (e.g. 1.1.1 vs 1.1.0)
though of course we still want to be cognizant of 1.0-to-1.1
regressions and make sure we can address them down the road.

[1] https://spark.apache.org/releases/spark-release-1-1-0.html

On Mon, Nov 17, 2014 at 2:04 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 +0 (non-binding)

 Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn,
 plain-vanilla Hadoop 2.3.0. No regressions.

 However, 12% to 22% increase in run time relative to 1.0.0 release.  (No
 other environment or configuration changes.)  Would have recommended +1 were
 it not for added latency.

 Not sure if added latency a function of 1.0 vs 1.1 or 1.0 vs 1.1.1 changes,
 as we've never tested with 1.1.0. But thought I'd share the results.  (This
 is somewhat disappointing.)

 Kevin Markey


 On 11/17/2014 11:42 AM, Debasish Das wrote:

 Andrew,

 I put up 1.1.1 branch and I am getting shuffle failures while doing
 flatMap
 followed by groupBy...My cluster memory is less than the memory I need and
 therefore flatMap does around 400 GB of shuffle...memory is around 120
 GB...

 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
 mapId=-1, reduceId=22)

 I searched on user-list and this issue has been found over there:


 http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html

 I wanted to make sure whether 1.1.1 does not have the same bug...-1 from
 me
 till we figure out the root cause...

 Thanks.

 Deb

 On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or and...@databricks.com wrote:

 This seems like a legitimate blocker. We will cut another RC to include
 the
 revert.

 2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp:

 Now I've finished to revert for SPARK-4434 and opened PR.


 (2014/11/16 17:08), Josh Rosen wrote:

 -1

 I found a potential regression in 1.1.1 related to spark-submit and
 cluster
 deploy mode: https://issues.apache.org/jira/browse/SPARK-4434

 I think that this is worth fixing.

 On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com
 wrote:

   +1


 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
 are
 fixed. Hive version inspection works as expected.


 On 11/15/14 8:25 AM, Zach Fry wrote:

   +0


 I expect to start testing on Monday but won't have enough results to
 change
 my vote from +0
 until Monday night or Tuesday morning.

 Thanks,
 Zach



 --
 View this message in context: http://apache-spark-
 developers-list.1001551.n3.nabble.com/VOTE-Release-
 Apache-Spark-1-1-1-RC1-tp9311p9370.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 -

 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro,

MLlib v1.1 supports variance for regression and gini impurity and entropy
for classification.
http://spark.apache.org/docs/latest/mllib-decision-tree.html

If the information gain calculation can be performed by distributed
aggregation then it might be possible to plug it into the existing
implementation. We want to perform such calculations (for e.g. median) for
the gradient boosting models (coming up in the 1.2 release) using absolute
error and deviance as loss functions but I don't think anyone is planning
to work on it yet. :-)

-Manish

On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta alexbare...@gmail.com
wrote:

 I see that, as of v. 1.1, MLLib supports regression and classification tree
 models. I assume this means that it uses a squared-error loss function for
 the first and logistic cost function for the second. I don't see support
 for quantile regression via an absolute error cost function. Or am I
 missing something?

 If, as it seems, this is missing, how do you recommend to implement it?

 Alex



[VOTE][RESULT] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Andrew Or
This is canceled in favor of RC2 with the following blockers:

https://issues.apache.org/jira/browse/SPARK-4434
https://issues.apache.org/jira/browse/SPARK-3633

The latter one involves a regression from 1.0.2 to 1.1.0, NOT from 1.1.0 to
1.1.1. For this reason, we are currently investigating this issue but may
not necessarily block on this to release 1.1.1.


2014-11-17 10:42 GMT-08:00 Debasish Das debasish.da...@gmail.com:

Andrew,

 I put up 1.1.1 branch and I am getting shuffle failures while doing
 flatMap followed by groupBy...My cluster memory is less than the memory I
 need and therefore flatMap does around 400 GB of shuffle...memory is around
 120 GB...

 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
 mapId=-1, reduceId=22)

 I searched on user-list and this issue has been found over there:


 http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html

 I wanted to make sure whether 1.1.1 does not have the same bug...-1 from
 me till we figure out the root cause...

 Thanks.

 Deb

 On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or and...@databricks.com wrote:

 This seems like a legitimate blocker. We will cut another RC to include
 the
 revert.

 2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp:

  Now I've finished to revert for SPARK-4434 and opened PR.
 
 
  (2014/11/16 17:08), Josh Rosen wrote:
 
  -1
 
  I found a potential regression in 1.1.1 related to spark-submit and
  cluster
  deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
 
  I think that this is worth fixing.
 
  On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com
  wrote:
 
   +1
 
  Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
  are
  fixed. Hive version inspection works as expected.
 
 
  On 11/15/14 8:25 AM, Zach Fry wrote:
 
   +0
 
  I expect to start testing on Monday but won't have enough results to
  change
  my vote from +0
  until Monday night or Tuesday morning.
 
  Thanks,
  Zach
 
 
 
  --
  View this message in context: http://apache-spark-
  developers-list.1001551.n3.nabble.com/VOTE-Release-
  Apache-Spark-1-1-1-RC1-tp9311p9370.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 
 -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 





Re: Quantile regression in tree models

2014-11-17 Thread Alessandro Baretta
Manish,

Thanks for pointing me to the relevant docs. It is unfortunate that
absolute error is not supported yet. I can't seem to find a Jira for it.

Now, here's the what the comments say in the current master branch:
/**
 * :: Experimental ::
 * A class that implements Stochastic Gradient Boosting
 * for regression and binary classification problems.
 *
 * The implementation is based upon:
 *   J.H. Friedman.  Stochastic Gradient Boosting.  1999.
 *
 * Notes:
 *  - This currently can be run with several loss functions.  However, only
SquaredError is
 *fully supported.  Specifically, the loss function should be used to
compute the gradient
 *(to re-label training instances on each iteration) and to weight weak
hypotheses.
 *Currently, gradients are computed correctly for the available loss
functions,
 *but weak hypothesis weights are not computed correctly for LogLoss or
AbsoluteError.
 *Running with those losses will likely behave reasonably, but lacks
the same guarantees.
...
*/

By the looks of it, the GradientBoosting API would support an absolute
error type loss function to perform quantile regression, except for weak
hypothesis weights. Does this refer to the weights of the leaves of the
trees?

Alex

On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde manish...@gmail.com wrote:

 Hi Alessandro,

 MLlib v1.1 supports variance for regression and gini impurity and entropy
 for classification.
 http://spark.apache.org/docs/latest/mllib-decision-tree.html

 If the information gain calculation can be performed by distributed
 aggregation then it might be possible to plug it into the existing
 implementation. We want to perform such calculations (for e.g. median) for
 the gradient boosting models (coming up in the 1.2 release) using absolute
 error and deviance as loss functions but I don't think anyone is planning
 to work on it yet. :-)

 -Manish

 On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta 
 alexbare...@gmail.com wrote:

 I see that, as of v. 1.1, MLLib supports regression and classification
 tree
 models. I assume this means that it uses a squared-error loss function for
 the first and logistic cost function for the second. I don't see support
 for quantile regression via an absolute error cost function. Or am I
 missing something?

 If, as it seems, this is missing, how do you recommend to implement it?

 Alex





Using sampleByKey

2014-11-17 Thread Debasish Das
Hi,

I have a rdd whose key is a userId and value is (movieId, rating)...

I want to sample 80% of the (movieId,rating) that each userId has seen for
train, rest is for test...

val indexedRating = sc.textFile(...).map{x= Rating(x(0), x(1), x(2))

val keyedRatings = indexedRating.map{x = (x.product, (x.user, x.rating))}

val keyedTraining = keyedRatings.sample(true, 0.8, 1L)

val keyedTest = keyedRatings.subtract(keyedTraining)

blocks = sc.maxParallelism

println(sRating keys ${keyedRatings.groupByKey(blocks).count()})

println(sTraining keys ${keyedTraining.groupByKey(blocks).count()})

println(sTest keys ${keyedTest.groupByKey(blocks).count()})

My expectation was that the println will produce exact number of keys for
keyedRatings, keyedTraining and keyedTest but this is not the case...

On MovieLens for example I am noticing the following:

Rating keys 3706

Training keys 3676

Test keys 3470

I also tried sampleByKey as follows:

val keyedRatings = indexedRating.map{x = (x.product, (x.user, x.rating))}

val fractions = keyedRatings.map{x= (x._1, 0.8)}.collect.toMap

val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)

val keyedTest = keyedRatings.subtract(keyedTraining)

Still I get the results as:

Rating keys 3706

Training keys 3682

Test keys 3459

Any idea what's is wrong here...

Are my assumptions about behavior of sample/sampleByKey on a key-value RDD
correct ? If this is a bug I can dig deeper...

Thanks.

Deb


matrix computation in spark

2014-11-17 Thread liaoyuxi
Hi,
Matrix computation is critical for algorithm efficiency like least square, 
Kalman filter and so on.
For now, the mllib module offers limited linear algebra on matrix, especially 
for distributed matrix.

We have been working on establishing distributed matrix computation APIs based 
on data structures in MLlib.
The main idea is to partition the matrix into sub-blocks, based on the strategy 
in the following paper.
http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
In our experiment, it's communication-optimal.
But operations like factorization may not be appropriate to carry out in blocks.

Any suggestions and guidance are welcome.

Thanks,
Yuxi



Re: matrix computation in spark

2014-11-17 Thread Zongheng Yang
There's been some work at the AMPLab on a distributed matrix library on top
of Spark; see here [1]. In particular, the repo contains a couple
factorization algorithms.

[1] https://github.com/amplab/ml-matrix

Zongheng

On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote:

 Hi,
 Matrix computation is critical for algorithm efficiency like least square,
 Kalman filter and so on.
 For now, the mllib module offers limited linear algebra on matrix,
 especially for distributed matrix.

 We have been working on establishing distributed matrix computation APIs
 based on data structures in MLlib.
 The main idea is to partition the matrix into sub-blocks, based on the
 strategy in the following paper.
 http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
 In our experiment, it's communication-optimal.
 But operations like factorization may not be appropriate to carry out in
 blocks.

 Any suggestions and guidance are welcome.

 Thanks,
 Yuxi




Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro,

I think absolute error as splitting criterion might be feasible with the
current architecture -- I think the sufficient statistics we collect
currently might be able to support this. Could you let us know scenarios
where absolute error has significantly outperformed squared error for
regression trees? Also, what's your use case that makes squared error
undesirable.

For gradient boosting, you are correct. The weak hypothesis weights refer
to tree predictions in each of the branches. We plan to explain this in the
1.2 documentation and may be add some more clarifications to the Javadoc.

I will try to search for JIRAs or create new ones and update this thread.

-Manish

On Monday, November 17, 2014, Alessandro Baretta alexbare...@gmail.com
wrote:

 Manish,

 Thanks for pointing me to the relevant docs. It is unfortunate that
 absolute error is not supported yet. I can't seem to find a Jira for it.

 Now, here's the what the comments say in the current master branch:
 /**
  * :: Experimental ::
  * A class that implements Stochastic Gradient Boosting
  * for regression and binary classification problems.
  *
  * The implementation is based upon:
  *   J.H. Friedman.  Stochastic Gradient Boosting.  1999.
  *
  * Notes:
  *  - This currently can be run with several loss functions.  However,
 only SquaredError is
  *fully supported.  Specifically, the loss function should be used to
 compute the gradient
  *(to re-label training instances on each iteration) and to weight
 weak hypotheses.
  *Currently, gradients are computed correctly for the available loss
 functions,
  *but weak hypothesis weights are not computed correctly for LogLoss
 or AbsoluteError.
  *Running with those losses will likely behave reasonably, but lacks
 the same guarantees.
 ...
 */

 By the looks of it, the GradientBoosting API would support an absolute
 error type loss function to perform quantile regression, except for weak
 hypothesis weights. Does this refer to the weights of the leaves of the
 trees?

 Alex

 On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde manish...@gmail.com
 javascript:_e(%7B%7D,'cvml','manish...@gmail.com'); wrote:

 Hi Alessandro,

 MLlib v1.1 supports variance for regression and gini impurity and entropy
 for classification.
 http://spark.apache.org/docs/latest/mllib-decision-tree.html

 If the information gain calculation can be performed by distributed
 aggregation then it might be possible to plug it into the existing
 implementation. We want to perform such calculations (for e.g. median) for
 the gradient boosting models (coming up in the 1.2 release) using absolute
 error and deviance as loss functions but I don't think anyone is planning
 to work on it yet. :-)

 -Manish

 On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta 
 alexbare...@gmail.com
 javascript:_e(%7B%7D,'cvml','alexbare...@gmail.com'); wrote:

 I see that, as of v. 1.1, MLLib supports regression and classification
 tree
 models. I assume this means that it uses a squared-error loss function
 for
 the first and logistic cost function for the second. I don't see support
 for quantile regression via an absolute error cost function. Or am I
 missing something?

 If, as it seems, this is missing, how do you recommend to implement it?

 Alex






Re: matrix computation in spark

2014-11-17 Thread 顾荣
Hey Yuxi,

We also have implemented a distributed matrix multiplication library in
PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
implemented three distributed matrix multiplication algorithms on Spark. As
we see, communication-optimal does not always means the total-optimal.
Thus, besides the CARMA matrix multiplication you mentioned, we also
implemented the Block-splitting matrix multiplication and Broadcast matrix
multiplication. They are more efficient than the CARMA matrix
multiplication for some situations, for example a large matrix multiplies a
small matrix.

Actually, We have shared the work on Spark Meetup@Beijing on October 26th.(
http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The
slide can be download from the archive here
http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd

Best,
Rong

2014-11-18 13:11 GMT+08:00 顾荣 gurongwal...@gmail.com:

 Hey Yuxi,

 We also have implemented a distributed matrix multiplication library in
 PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
 implemented three distributed matrix multiplication algorithms on Spark. As
 we see, communication-optimal does not always means the total-optimal.
 Thus, besides the CARMA matrix multiplication you mentioned, we also
 implemented the Block-splitting matrix multiplication and Broadcast matrix
 multiplication. They are more efficient than the CARMA matrix
 multiplication for some situations, for example a large matrix multiplies a
 small matrix.

 Actually, We have shared the work on Spark Meetup@Beijing on October
 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/
 ). The slide is also attached in this mail.

 Best,
 Rong

 2014-11-18 11:36 GMT+08:00 Zongheng Yang zonghen...@gmail.com:

 There's been some work at the AMPLab on a distributed matrix library on
 top
 of Spark; see here [1]. In particular, the repo contains a couple
 factorization algorithms.

 [1] https://github.com/amplab/ml-matrix

 Zongheng

 On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote:

  Hi,
  Matrix computation is critical for algorithm efficiency like least
 square,
  Kalman filter and so on.
  For now, the mllib module offers limited linear algebra on matrix,
  especially for distributed matrix.
 
  We have been working on establishing distributed matrix computation APIs
  based on data structures in MLlib.
  The main idea is to partition the matrix into sub-blocks, based on the
  strategy in the following paper.
  http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
  In our experiment, it's communication-optimal.
  But operations like factorization may not be appropriate to carry out in
  blocks.
 
  Any suggestions and guidance are welcome.
 
  Thanks,
  Yuxi
 
 




 --
 --
 Rong Gu
 Department of Computer Science and Technology
 State Key Laboratory for Novel Software Technology
 Nanjing University
 Phone: +86 15850682791
 Email: gurongwal...@gmail.com
 Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/




-- 
--
Rong Gu
Department of Computer Science and Technology
State Key Laboratory for Novel Software Technology
Nanjing University
Phone: +86 15850682791
Email: gurongwal...@gmail.com
Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/


答复: matrix computation in spark

2014-11-17 Thread liaoyuxi
Hi,
I checked the work of ml-matrix. For now, it doesn’t include matrix multiply 
and LU decomposition. What’s your plan? Can we contribute our work to these 
parts?
Otherwise, the block number of row/column is decided manually, As we mentioned, 
the CARMA method in paper is communication-optimal.

发件人: Zongheng Yang [mailto:zonghen...@gmail.com]
发送时间: 2014年11月18日 11:37
收件人: liaoyuxi; d...@spark.incubator.apache.org
抄送: Shivaram Venkataraman
主题: Re: matrix computation in spark

There's been some work at the AMPLab on a distributed matrix library on top of 
Spark; see here [1]. In particular, the repo contains a couple factorization 
algorithms.

[1] https://github.com/amplab/ml-matrix

Zongheng

On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi 
liaoy...@huawei.commailto:liaoy...@huawei.com wrote:
Hi,
Matrix computation is critical for algorithm efficiency like least square, 
Kalman filter and so on.
For now, the mllib module offers limited linear algebra on matrix, especially 
for distributed matrix.

We have been working on establishing distributed matrix computation APIs based 
on data structures in MLlib.
The main idea is to partition the matrix into sub-blocks, based on the strategy 
in the following paper.
http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
In our experiment, it's communication-optimal.
But operations like factorization may not be appropriate to carry out in blocks.

Any suggestions and guidance are welcome.

Thanks,
Yuxi


Re: matrix computation in spark

2014-11-17 Thread Reza Zadeh
Hi Yuxi,

We are integrating the ml-matrix from the AMPlab repo into MLlib, tracked
by this JIRA: https://issues.apache.org/jira/browse/SPARK-3434

We already have matrix multiply, but are missing LU decomposition. Could
you please track that JIRA, once the initial design is in, we can sync on
how to contribute LU decomposition.

Let's move the discussion to the JIRA.

Thanks!

On Mon, Nov 17, 2014 at 9:49 PM, 顾荣 gurongwal...@gmail.com wrote:

 Hey Yuxi,

 We also have implemented a distributed matrix multiplication library in
 PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
 implemented three distributed matrix multiplication algorithms on Spark. As
 we see, communication-optimal does not always means the total-optimal.
 Thus, besides the CARMA matrix multiplication you mentioned, we also
 implemented the Block-splitting matrix multiplication and Broadcast matrix
 multiplication. They are more efficient than the CARMA matrix
 multiplication for some situations, for example a large matrix multiplies a
 small matrix.

 Actually, We have shared the work on Spark Meetup@Beijing on October
 26th.(
 http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The
 slide can be download from the archive here
 http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd

 Best,
 Rong

 2014-11-18 13:11 GMT+08:00 顾荣 gurongwal...@gmail.com:

  Hey Yuxi,
 
  We also have implemented a distributed matrix multiplication library in
  PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
  implemented three distributed matrix multiplication algorithms on Spark.
 As
  we see, communication-optimal does not always means the total-optimal.
  Thus, besides the CARMA matrix multiplication you mentioned, we also
  implemented the Block-splitting matrix multiplication and Broadcast
 matrix
  multiplication. They are more efficient than the CARMA matrix
  multiplication for some situations, for example a large matrix
 multiplies a
  small matrix.
 
  Actually, We have shared the work on Spark Meetup@Beijing on October
  26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/
  ). The slide is also attached in this mail.
 
  Best,
  Rong
 
  2014-11-18 11:36 GMT+08:00 Zongheng Yang zonghen...@gmail.com:
 
  There's been some work at the AMPLab on a distributed matrix library on
  top
  of Spark; see here [1]. In particular, the repo contains a couple
  factorization algorithms.
 
  [1] https://github.com/amplab/ml-matrix
 
  Zongheng
 
  On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote:
 
   Hi,
   Matrix computation is critical for algorithm efficiency like least
  square,
   Kalman filter and so on.
   For now, the mllib module offers limited linear algebra on matrix,
   especially for distributed matrix.
  
   We have been working on establishing distributed matrix computation
 APIs
   based on data structures in MLlib.
   The main idea is to partition the matrix into sub-blocks, based on the
   strategy in the following paper.
   http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
   In our experiment, it's communication-optimal.
   But operations like factorization may not be appropriate to carry out
 in
   blocks.
  
   Any suggestions and guidance are welcome.
  
   Thanks,
   Yuxi
  
  
 
 
 
 
  --
  --
  Rong Gu
  Department of Computer Science and Technology
  State Key Laboratory for Novel Software Technology
  Nanjing University
  Phone: +86 15850682791
  Email: gurongwal...@gmail.com
  Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/
 



 --
 --
 Rong Gu
 Department of Computer Science and Technology
 State Key Laboratory for Novel Software Technology
 Nanjing University
 Phone: +86 15850682791
 Email: gurongwal...@gmail.com
 Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/