Re: [VOTE] Release Apache Spark 1.1.1 (RC1)
This seems like a legitimate blocker. We will cut another RC to include the revert. 2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp: Now I've finished to revert for SPARK-4434 and opened PR. (2014/11/16 17:08), Josh Rosen wrote: -1 I found a potential regression in 1.1.1 related to spark-submit and cluster deploy mode: https://issues.apache.org/jira/browse/SPARK-4434 I think that this is worth fixing. On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote: +1 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are fixed. Hive version inspection works as expected. On 11/15/14 8:25 AM, Zach Fry wrote: +0 I expect to start testing on Monday but won't have enough results to change my vote from +0 until Monday night or Tuesday morning. Thanks, Zach -- View this message in context: http://apache-spark- developers-list.1001551.n3.nabble.com/VOTE-Release- Apache-Spark-1-1-1-RC1-tp9311p9370.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.1 (RC1)
Andrew, I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap followed by groupBy...My cluster memory is less than the memory I need and therefore flatMap does around 400 GB of shuffle...memory is around 120 GB... 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=22) I searched on user-list and this issue has been found over there: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html I wanted to make sure whether 1.1.1 does not have the same bug...-1 from me till we figure out the root cause... Thanks. Deb On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or and...@databricks.com wrote: This seems like a legitimate blocker. We will cut another RC to include the revert. 2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp: Now I've finished to revert for SPARK-4434 and opened PR. (2014/11/16 17:08), Josh Rosen wrote: -1 I found a potential regression in 1.1.1 related to spark-submit and cluster deploy mode: https://issues.apache.org/jira/browse/SPARK-4434 I think that this is worth fixing. On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote: +1 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are fixed. Hive version inspection works as expected. On 11/15/14 8:25 AM, Zach Fry wrote: +0 I expect to start testing on Monday but won't have enough results to change my vote from +0 until Monday night or Tuesday morning. Thanks, Zach -- View this message in context: http://apache-spark- developers-list.1001551.n3.nabble.com/VOTE-Release- Apache-Spark-1-1-1-RC1-tp9311p9370.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: mvn or sbt for studying and developing Spark?
* I moved from sbt to maven in June specifically due to Andrew Or's describing mvn as the default build tool. Developers should keep in mind that jenkins uses mvn so we need to run mvn before submitting PR's - even if sbt were used for day to day dev work To be clear, I think that the PR builder actually uses sbt https://github.com/apache/spark/blob/master/dev/run-tests#L198 currently, but there are master builds that make sure maven doesn't break (amongst other things). * In addition, as Sean has alluded to, the Intellij seems to comprehend the maven builds a bit more readily than sbt Yeah, this is a very good point. I have used `sbt/sbt gen-idea` in the past, but I'm currently using the maven integration of inteliJ since it seems more stable. * But for command line and day to day dev purposes: sbt sounds great to use Those sound bites you provided about exposing built-in test databases for hive and for displaying available testcases are sweet. Any easy/convenient way to see more of those kinds of facilities available through sbt ? The Spark SQL developer readme https://github.com/apache/spark/tree/master/sql has a little bit of this, but we really should have some documentation on using SBT as well. Integrating with those systems is generally easier if you are also working with Spark in Maven. (And I wouldn't classify all of those Maven-built systems as legacy, Michael :) Also a good point, though I've seen some pretty clever uses of sbt's external project references to link spark into other projects. I'll certainly admit I have a bias towards new shiny things in general though, so my definition of legacy is probably skewed :)
Re: mvn or sbt for studying and developing Spark?
The docs on using sbt are here: https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt They'll be published with 1.2.0 presumably. On 2014년 11월 17일 (월) at 오후 2:49 Michael Armbrust mich...@databricks.com wrote: * I moved from sbt to maven in June specifically due to Andrew Or's describing mvn as the default build tool. Developers should keep in mind that jenkins uses mvn so we need to run mvn before submitting PR's - even if sbt were used for day to day dev work To be clear, I think that the PR builder actually uses sbt https://github.com/apache/spark/blob/master/dev/run-tests#L198 currently, but there are master builds that make sure maven doesn't break (amongst other things). * In addition, as Sean has alluded to, the Intellij seems to comprehend the maven builds a bit more readily than sbt Yeah, this is a very good point. I have used `sbt/sbt gen-idea` in the past, but I'm currently using the maven integration of inteliJ since it seems more stable. * But for command line and day to day dev purposes: sbt sounds great to use Those sound bites you provided about exposing built-in test databases for hive and for displaying available testcases are sweet. Any easy/convenient way to see more of those kinds of facilities available through sbt ? The Spark SQL developer readme https://github.com/apache/spark/tree/master/sql has a little bit of this, but we really should have some documentation on using SBT as well. Integrating with those systems is generally easier if you are also working with Spark in Maven. (And I wouldn't classify all of those Maven-built systems as legacy, Michael :) Also a good point, though I've seen some pretty clever uses of sbt's external project references to link spark into other projects. I'll certainly admit I have a bias towards new shiny things in general though, so my definition of legacy is probably skewed :)
Re: [VOTE] Release Apache Spark 1.1.1 (RC1)
+0 (non-binding) Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn, plain-vanilla Hadoop 2.3.0. No regressions. However, 12% to 22% increase in run time relative to 1.0.0 release. (No other environment or configuration changes.) Would have recommended +1 were it not for added latency. Not sure if added latency a function of 1.0 vs 1.1 or 1.0 vs 1.1.1 changes, as we've never tested with 1.1.0. But thought I'd share the results. (This is somewhat disappointing.) Kevin Markey On 11/17/2014 11:42 AM, Debasish Das wrote: Andrew, I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap followed by groupBy...My cluster memory is less than the memory I need and therefore flatMap does around 400 GB of shuffle...memory is around 120 GB... 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=22) I searched on user-list and this issue has been found over there: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html I wanted to make sure whether 1.1.1 does not have the same bug...-1 from me till we figure out the root cause... Thanks. Deb On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or and...@databricks.com wrote: This seems like a legitimate blocker. We will cut another RC to include the revert. 2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp: Now I've finished to revert for SPARK-4434 and opened PR. (2014/11/16 17:08), Josh Rosen wrote: -1 I found a potential regression in 1.1.1 related to spark-submit and cluster deploy mode: https://issues.apache.org/jira/browse/SPARK-4434 I think that this is worth fixing. On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote: +1 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are fixed. Hive version inspection works as expected. On 11/15/14 8:25 AM, Zach Fry wrote: +0 I expect to start testing on Monday but won't have enough results to change my vote from +0 until Monday night or Tuesday morning. Thanks, Zach -- View this message in context: http://apache-spark- developers-list.1001551.n3.nabble.com/VOTE-Release- Apache-Spark-1-1-1-RC1-tp9311p9370.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.1 (RC1)
Hey Kevin, If you are upgrading from 1.0.X to 1.1.X checkout the upgrade notes here [1] - it could be that default changes caused a regression for your workload. Do you still see a regression if you restore the configuration changes? It's great to hear specifically about issues like this, so please fork a new thread and describe your workload if you see a regression. The main focus of a patch release vote like this is to test regressions against the previous release on the same line (e.g. 1.1.1 vs 1.1.0) though of course we still want to be cognizant of 1.0-to-1.1 regressions and make sure we can address them down the road. [1] https://spark.apache.org/releases/spark-release-1-1-0.html On Mon, Nov 17, 2014 at 2:04 PM, Kevin Markey kevin.mar...@oracle.com wrote: +0 (non-binding) Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn, plain-vanilla Hadoop 2.3.0. No regressions. However, 12% to 22% increase in run time relative to 1.0.0 release. (No other environment or configuration changes.) Would have recommended +1 were it not for added latency. Not sure if added latency a function of 1.0 vs 1.1 or 1.0 vs 1.1.1 changes, as we've never tested with 1.1.0. But thought I'd share the results. (This is somewhat disappointing.) Kevin Markey On 11/17/2014 11:42 AM, Debasish Das wrote: Andrew, I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap followed by groupBy...My cluster memory is less than the memory I need and therefore flatMap does around 400 GB of shuffle...memory is around 120 GB... 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=22) I searched on user-list and this issue has been found over there: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html I wanted to make sure whether 1.1.1 does not have the same bug...-1 from me till we figure out the root cause... Thanks. Deb On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or and...@databricks.com wrote: This seems like a legitimate blocker. We will cut another RC to include the revert. 2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp: Now I've finished to revert for SPARK-4434 and opened PR. (2014/11/16 17:08), Josh Rosen wrote: -1 I found a potential regression in 1.1.1 related to spark-submit and cluster deploy mode: https://issues.apache.org/jira/browse/SPARK-4434 I think that this is worth fixing. On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote: +1 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are fixed. Hive version inspection works as expected. On 11/15/14 8:25 AM, Zach Fry wrote: +0 I expect to start testing on Monday but won't have enough results to change my vote from +0 until Monday night or Tuesday morning. Thanks, Zach -- View this message in context: http://apache-spark- developers-list.1001551.n3.nabble.com/VOTE-Release- Apache-Spark-1-1-1-RC1-tp9311p9370.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Quantile regression in tree models
Hi Alessandro, MLlib v1.1 supports variance for regression and gini impurity and entropy for classification. http://spark.apache.org/docs/latest/mllib-decision-tree.html If the information gain calculation can be performed by distributed aggregation then it might be possible to plug it into the existing implementation. We want to perform such calculations (for e.g. median) for the gradient boosting models (coming up in the 1.2 release) using absolute error and deviance as loss functions but I don't think anyone is planning to work on it yet. :-) -Manish On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta alexbare...@gmail.com wrote: I see that, as of v. 1.1, MLLib supports regression and classification tree models. I assume this means that it uses a squared-error loss function for the first and logistic cost function for the second. I don't see support for quantile regression via an absolute error cost function. Or am I missing something? If, as it seems, this is missing, how do you recommend to implement it? Alex
[VOTE][RESULT] Release Apache Spark 1.1.1 (RC1)
This is canceled in favor of RC2 with the following blockers: https://issues.apache.org/jira/browse/SPARK-4434 https://issues.apache.org/jira/browse/SPARK-3633 The latter one involves a regression from 1.0.2 to 1.1.0, NOT from 1.1.0 to 1.1.1. For this reason, we are currently investigating this issue but may not necessarily block on this to release 1.1.1. 2014-11-17 10:42 GMT-08:00 Debasish Das debasish.da...@gmail.com: Andrew, I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap followed by groupBy...My cluster memory is less than the memory I need and therefore flatMap does around 400 GB of shuffle...memory is around 120 GB... 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=22) I searched on user-list and this issue has been found over there: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html I wanted to make sure whether 1.1.1 does not have the same bug...-1 from me till we figure out the root cause... Thanks. Deb On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or and...@databricks.com wrote: This seems like a legitimate blocker. We will cut another RC to include the revert. 2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp: Now I've finished to revert for SPARK-4434 and opened PR. (2014/11/16 17:08), Josh Rosen wrote: -1 I found a potential regression in 1.1.1 related to spark-submit and cluster deploy mode: https://issues.apache.org/jira/browse/SPARK-4434 I think that this is worth fixing. On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote: +1 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are fixed. Hive version inspection works as expected. On 11/15/14 8:25 AM, Zach Fry wrote: +0 I expect to start testing on Monday but won't have enough results to change my vote from +0 until Monday night or Tuesday morning. Thanks, Zach -- View this message in context: http://apache-spark- developers-list.1001551.n3.nabble.com/VOTE-Release- Apache-Spark-1-1-1-RC1-tp9311p9370.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Quantile regression in tree models
Manish, Thanks for pointing me to the relevant docs. It is unfortunate that absolute error is not supported yet. I can't seem to find a Jira for it. Now, here's the what the comments say in the current master branch: /** * :: Experimental :: * A class that implements Stochastic Gradient Boosting * for regression and binary classification problems. * * The implementation is based upon: * J.H. Friedman. Stochastic Gradient Boosting. 1999. * * Notes: * - This currently can be run with several loss functions. However, only SquaredError is *fully supported. Specifically, the loss function should be used to compute the gradient *(to re-label training instances on each iteration) and to weight weak hypotheses. *Currently, gradients are computed correctly for the available loss functions, *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. *Running with those losses will likely behave reasonably, but lacks the same guarantees. ... */ By the looks of it, the GradientBoosting API would support an absolute error type loss function to perform quantile regression, except for weak hypothesis weights. Does this refer to the weights of the leaves of the trees? Alex On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde manish...@gmail.com wrote: Hi Alessandro, MLlib v1.1 supports variance for regression and gini impurity and entropy for classification. http://spark.apache.org/docs/latest/mllib-decision-tree.html If the information gain calculation can be performed by distributed aggregation then it might be possible to plug it into the existing implementation. We want to perform such calculations (for e.g. median) for the gradient boosting models (coming up in the 1.2 release) using absolute error and deviance as loss functions but I don't think anyone is planning to work on it yet. :-) -Manish On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta alexbare...@gmail.com wrote: I see that, as of v. 1.1, MLLib supports regression and classification tree models. I assume this means that it uses a squared-error loss function for the first and logistic cost function for the second. I don't see support for quantile regression via an absolute error cost function. Or am I missing something? If, as it seems, this is missing, how do you recommend to implement it? Alex
Using sampleByKey
Hi, I have a rdd whose key is a userId and value is (movieId, rating)... I want to sample 80% of the (movieId,rating) that each userId has seen for train, rest is for test... val indexedRating = sc.textFile(...).map{x= Rating(x(0), x(1), x(2)) val keyedRatings = indexedRating.map{x = (x.product, (x.user, x.rating))} val keyedTraining = keyedRatings.sample(true, 0.8, 1L) val keyedTest = keyedRatings.subtract(keyedTraining) blocks = sc.maxParallelism println(sRating keys ${keyedRatings.groupByKey(blocks).count()}) println(sTraining keys ${keyedTraining.groupByKey(blocks).count()}) println(sTest keys ${keyedTest.groupByKey(blocks).count()}) My expectation was that the println will produce exact number of keys for keyedRatings, keyedTraining and keyedTest but this is not the case... On MovieLens for example I am noticing the following: Rating keys 3706 Training keys 3676 Test keys 3470 I also tried sampleByKey as follows: val keyedRatings = indexedRating.map{x = (x.product, (x.user, x.rating))} val fractions = keyedRatings.map{x= (x._1, 0.8)}.collect.toMap val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L) val keyedTest = keyedRatings.subtract(keyedTraining) Still I get the results as: Rating keys 3706 Training keys 3682 Test keys 3459 Any idea what's is wrong here... Are my assumptions about behavior of sample/sampleByKey on a key-value RDD correct ? If this is a bug I can dig deeper... Thanks. Deb
matrix computation in spark
Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi
Re: matrix computation in spark
There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote: Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi
Re: Quantile regression in tree models
Hi Alessandro, I think absolute error as splitting criterion might be feasible with the current architecture -- I think the sufficient statistics we collect currently might be able to support this. Could you let us know scenarios where absolute error has significantly outperformed squared error for regression trees? Also, what's your use case that makes squared error undesirable. For gradient boosting, you are correct. The weak hypothesis weights refer to tree predictions in each of the branches. We plan to explain this in the 1.2 documentation and may be add some more clarifications to the Javadoc. I will try to search for JIRAs or create new ones and update this thread. -Manish On Monday, November 17, 2014, Alessandro Baretta alexbare...@gmail.com wrote: Manish, Thanks for pointing me to the relevant docs. It is unfortunate that absolute error is not supported yet. I can't seem to find a Jira for it. Now, here's the what the comments say in the current master branch: /** * :: Experimental :: * A class that implements Stochastic Gradient Boosting * for regression and binary classification problems. * * The implementation is based upon: * J.H. Friedman. Stochastic Gradient Boosting. 1999. * * Notes: * - This currently can be run with several loss functions. However, only SquaredError is *fully supported. Specifically, the loss function should be used to compute the gradient *(to re-label training instances on each iteration) and to weight weak hypotheses. *Currently, gradients are computed correctly for the available loss functions, *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. *Running with those losses will likely behave reasonably, but lacks the same guarantees. ... */ By the looks of it, the GradientBoosting API would support an absolute error type loss function to perform quantile regression, except for weak hypothesis weights. Does this refer to the weights of the leaves of the trees? Alex On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde manish...@gmail.com javascript:_e(%7B%7D,'cvml','manish...@gmail.com'); wrote: Hi Alessandro, MLlib v1.1 supports variance for regression and gini impurity and entropy for classification. http://spark.apache.org/docs/latest/mllib-decision-tree.html If the information gain calculation can be performed by distributed aggregation then it might be possible to plug it into the existing implementation. We want to perform such calculations (for e.g. median) for the gradient boosting models (coming up in the 1.2 release) using absolute error and deviance as loss functions but I don't think anyone is planning to work on it yet. :-) -Manish On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta alexbare...@gmail.com javascript:_e(%7B%7D,'cvml','alexbare...@gmail.com'); wrote: I see that, as of v. 1.1, MLLib supports regression and classification tree models. I assume this means that it uses a squared-error loss function for the first and logistic cost function for the second. I don't see support for quantile regression via an absolute error cost function. Or am I missing something? If, as it seems, this is missing, how do you recommend to implement it? Alex
Re: matrix computation in spark
Hey Yuxi, We also have implemented a distributed matrix multiplication library in PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We implemented three distributed matrix multiplication algorithms on Spark. As we see, communication-optimal does not always means the total-optimal. Thus, besides the CARMA matrix multiplication you mentioned, we also implemented the Block-splitting matrix multiplication and Broadcast matrix multiplication. They are more efficient than the CARMA matrix multiplication for some situations, for example a large matrix multiplies a small matrix. Actually, We have shared the work on Spark Meetup@Beijing on October 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The slide can be download from the archive here http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd Best, Rong 2014-11-18 13:11 GMT+08:00 顾荣 gurongwal...@gmail.com: Hey Yuxi, We also have implemented a distributed matrix multiplication library in PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We implemented three distributed matrix multiplication algorithms on Spark. As we see, communication-optimal does not always means the total-optimal. Thus, besides the CARMA matrix multiplication you mentioned, we also implemented the Block-splitting matrix multiplication and Broadcast matrix multiplication. They are more efficient than the CARMA matrix multiplication for some situations, for example a large matrix multiplies a small matrix. Actually, We have shared the work on Spark Meetup@Beijing on October 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The slide is also attached in this mail. Best, Rong 2014-11-18 11:36 GMT+08:00 Zongheng Yang zonghen...@gmail.com: There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote: Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi -- -- Rong Gu Department of Computer Science and Technology State Key Laboratory for Novel Software Technology Nanjing University Phone: +86 15850682791 Email: gurongwal...@gmail.com Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/ -- -- Rong Gu Department of Computer Science and Technology State Key Laboratory for Novel Software Technology Nanjing University Phone: +86 15850682791 Email: gurongwal...@gmail.com Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/
答复: matrix computation in spark
Hi, I checked the work of ml-matrix. For now, it doesn’t include matrix multiply and LU decomposition. What’s your plan? Can we contribute our work to these parts? Otherwise, the block number of row/column is decided manually, As we mentioned, the CARMA method in paper is communication-optimal. 发件人: Zongheng Yang [mailto:zonghen...@gmail.com] 发送时间: 2014年11月18日 11:37 收件人: liaoyuxi; d...@spark.incubator.apache.org 抄送: Shivaram Venkataraman 主题: Re: matrix computation in spark There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.commailto:liaoy...@huawei.com wrote: Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi
Re: matrix computation in spark
Hi Yuxi, We are integrating the ml-matrix from the AMPlab repo into MLlib, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-3434 We already have matrix multiply, but are missing LU decomposition. Could you please track that JIRA, once the initial design is in, we can sync on how to contribute LU decomposition. Let's move the discussion to the JIRA. Thanks! On Mon, Nov 17, 2014 at 9:49 PM, 顾荣 gurongwal...@gmail.com wrote: Hey Yuxi, We also have implemented a distributed matrix multiplication library in PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We implemented three distributed matrix multiplication algorithms on Spark. As we see, communication-optimal does not always means the total-optimal. Thus, besides the CARMA matrix multiplication you mentioned, we also implemented the Block-splitting matrix multiplication and Broadcast matrix multiplication. They are more efficient than the CARMA matrix multiplication for some situations, for example a large matrix multiplies a small matrix. Actually, We have shared the work on Spark Meetup@Beijing on October 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The slide can be download from the archive here http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd Best, Rong 2014-11-18 13:11 GMT+08:00 顾荣 gurongwal...@gmail.com: Hey Yuxi, We also have implemented a distributed matrix multiplication library in PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We implemented three distributed matrix multiplication algorithms on Spark. As we see, communication-optimal does not always means the total-optimal. Thus, besides the CARMA matrix multiplication you mentioned, we also implemented the Block-splitting matrix multiplication and Broadcast matrix multiplication. They are more efficient than the CARMA matrix multiplication for some situations, for example a large matrix multiplies a small matrix. Actually, We have shared the work on Spark Meetup@Beijing on October 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The slide is also attached in this mail. Best, Rong 2014-11-18 11:36 GMT+08:00 Zongheng Yang zonghen...@gmail.com: There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote: Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data structures in MLlib. The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper. http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf In our experiment, it's communication-optimal. But operations like factorization may not be appropriate to carry out in blocks. Any suggestions and guidance are welcome. Thanks, Yuxi -- -- Rong Gu Department of Computer Science and Technology State Key Laboratory for Novel Software Technology Nanjing University Phone: +86 15850682791 Email: gurongwal...@gmail.com Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/ -- -- Rong Gu Department of Computer Science and Technology State Key Laboratory for Novel Software Technology Nanjing University Phone: +86 15850682791 Email: gurongwal...@gmail.com Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/