Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Andrew Or
This seems like a legitimate blocker. We will cut another RC to include the revert. 2014-11-16 17:29 GMT-08:00 Kousuke Saruta saru...@oss.nttdata.co.jp: Now I've finished to revert for SPARK-4434 and opened PR. (2014/11/16 17:08), Josh Rosen wrote: -1 I found a potential regression in

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Debasish Das
Andrew, I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap followed by groupBy...My cluster memory is less than the memory I need and therefore flatMap does around 400 GB of shuffle...memory is around 120 GB... 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in

Re: mvn or sbt for studying and developing Spark?

2014-11-17 Thread Michael Armbrust
* I moved from sbt to maven in June specifically due to Andrew Or's describing mvn as the default build tool. Developers should keep in mind that jenkins uses mvn so we need to run mvn before submitting PR's - even if sbt were used for day to day dev work To be clear, I think that the PR

Re: mvn or sbt for studying and developing Spark?

2014-11-17 Thread Nicholas Chammas
The docs on using sbt are here: https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt They'll be published with 1.2.0 presumably. On 2014년 11월 17일 (월) at 오후 2:49 Michael Armbrust mich...@databricks.com wrote: * I moved from sbt to maven in June specifically due

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Kevin Markey
+0 (non-binding) Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn, plain-vanilla Hadoop 2.3.0. No regressions. However, 12% to 22% increase in run time relative to 1.0.0 release. (No other environment or configuration changes.) Would have recommended +1 were it not

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Patrick Wendell
Hey Kevin, If you are upgrading from 1.0.X to 1.1.X checkout the upgrade notes here [1] - it could be that default changes caused a regression for your workload. Do you still see a regression if you restore the configuration changes? It's great to hear specifically about issues like this, so

Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro, MLlib v1.1 supports variance for regression and gini impurity and entropy for classification. http://spark.apache.org/docs/latest/mllib-decision-tree.html If the information gain calculation can be performed by distributed aggregation then it might be possible to plug it into the

[VOTE][RESULT] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Andrew Or
This is canceled in favor of RC2 with the following blockers: https://issues.apache.org/jira/browse/SPARK-4434 https://issues.apache.org/jira/browse/SPARK-3633 The latter one involves a regression from 1.0.2 to 1.1.0, NOT from 1.1.0 to 1.1.1. For this reason, we are currently investigating this

Re: Quantile regression in tree models

2014-11-17 Thread Alessandro Baretta
Manish, Thanks for pointing me to the relevant docs. It is unfortunate that absolute error is not supported yet. I can't seem to find a Jira for it. Now, here's the what the comments say in the current master branch: /** * :: Experimental :: * A class that implements Stochastic Gradient

Using sampleByKey

2014-11-17 Thread Debasish Das
Hi, I have a rdd whose key is a userId and value is (movieId, rating)... I want to sample 80% of the (movieId,rating) that each userId has seen for train, rest is for test... val indexedRating = sc.textFile(...).map{x= Rating(x(0), x(1), x(2)) val keyedRatings = indexedRating.map{x =

matrix computation in spark

2014-11-17 Thread liaoyuxi
Hi, Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on. For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix. We have been working on establishing distributed matrix computation APIs based on data

Re: matrix computation in spark

2014-11-17 Thread Zongheng Yang
There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms. [1] https://github.com/amplab/ml-matrix Zongheng On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi liaoy...@huawei.com wrote: Hi,

Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro, I think absolute error as splitting criterion might be feasible with the current architecture -- I think the sufficient statistics we collect currently might be able to support this. Could you let us know scenarios where absolute error has significantly outperformed squared error

Re: matrix computation in spark

2014-11-17 Thread 顾荣
Hey Yuxi, We also have implemented a distributed matrix multiplication library in PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We implemented three distributed matrix multiplication algorithms on Spark. As we see, communication-optimal does not always means the

答复: matrix computation in spark

2014-11-17 Thread liaoyuxi
Hi, I checked the work of ml-matrix. For now, it doesn’t include matrix multiply and LU decomposition. What’s your plan? Can we contribute our work to these parts? Otherwise, the block number of row/column is decided manually, As we mentioned, the CARMA method in paper is communication-optimal.

Re: matrix computation in spark

2014-11-17 Thread Reza Zadeh
Hi Yuxi, We are integrating the ml-matrix from the AMPlab repo into MLlib, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-3434 We already have matrix multiply, but are missing LU decomposition. Could you please track that JIRA, once the initial design is in, we can sync on how