Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-04 Thread Xiangrui Meng
Hi Deb, I've been working with David to add or enhance some features to breeze to make its performance comparable to bare-bone implementations. I'm going to update that PR this week with sparse support to KMeans. You are certainly welcome to update the GLM part. Make sure you are using the master

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-04 Thread Xiangrui Meng
Hi DB, I saw you released the L-BFGS code under com.dbtsai.lbfgs on maven central, so I assume that Robert (the author of RISO) is not going to maintain it. Is it correct? For the breeze implementation, do you mind sharing more details about the issues you have? I saw the hack you did to get

Re: ALS solve.solvePositive

2014-03-06 Thread Xiangrui Meng
If the matrix is very ill-conditioned, then A^T A becomes numerically rank deficient. However, if you use a reasonably large positive regularization constant (lambda), A^T A + lambda I should be still positive definite. What was the regularization constant (lambda) you set? Could you test whether

Re: ALS solve.solvePositive

2014-03-10 Thread Xiangrui Meng
. Deb On Thu, Mar 6, 2014 at 7:20 PM, Xiangrui Meng men...@gmail.com wrote: If the matrix is very ill-conditioned, then A^T A becomes numerically rank deficient. However, if you use a reasonably large positive regularization constant (lambda), A^T A + lambda I should be still positive

Moving MLlib JIRA tickets to Spark

2014-03-10 Thread Xiangrui Meng
Hi all, I'm going to move all MLlib JIRA tickets (https://spark-project.atlassian.net/browse/MLLIB) to Spark because we can migrate only one project to Apache JIRA. Please create new MLlib JIRA tickets under Spark in the future and set the component to MLlib. Thanks, Xiangrui

Re: Moving MLlib JIRA tickets to Spark

2014-03-10 Thread Xiangrui Meng
Done. The original urls should work as well, so you don't need to update the url in github. -Xiangrui On Mon, Mar 10, 2014 at 6:20 PM, Xiangrui Meng men...@gmail.com wrote: Hi all, I'm going to move all MLlib JIRA tickets (https://spark-project.atlassian.net/browse/MLLIB) to Spark because we

Re: ALS solve.solvePositive

2014-03-11 Thread Xiangrui Meng
Hi Deb, did you use ALS with implicit feedback? -Xiangrui On Mon, Mar 10, 2014 at 1:17 PM, Xiangrui Meng men...@gmail.com wrote: Choosing lambda = 0.1 shouldn't lead to the error you got. This is probably a bug. Do you mind sharing a small amount of data that can re-produce the error

Re: ALS solve.solvePositive

2014-03-19 Thread Xiangrui Meng
... On Mar 11, 2014 7:02 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, did you use ALS with implicit feedback? -Xiangrui On Mon, Mar 10, 2014 at 1:17 PM, Xiangrui Meng men...@gmail.com wrote: Choosing lambda = 0.1 shouldn't lead to the error you got. This is probably a bug. Do you mind

Re: ALS solve.solvePositive

2014-03-19 Thread Xiangrui Meng
to ALS improvements ? Are they all added to the master ? There are at least 3 PRs that Sean and you contributed recently related to ALS efficiency. A JIRA or gist will definitely help a lot. Thanks. Deb On Wed, Mar 19, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote: Another question

Re: ArrayIndexOutOfBoundsException in ALS.implicit

2014-03-28 Thread Xiangrui Meng
Hi bearrito, This is a known issue (https://spark-project.atlassian.net/browse/SPARK-1281) and it should be easy to fix by switching to a hash partitioner. CC'ed dev list in case someone volunteers to work on it. Best, Xiangrui On Thu, Mar 27, 2014 at 8:38 PM, bearrito

Re: ALS array index out of bound with 50 factors

2014-04-06 Thread Xiangrui Meng
Hi Deb, Are you using the master branch or a particular commit? Do you have negative or out-of-integer-range user or product ids? There is an issue with ALS' partitioning (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not sure whether that is the reason. Could you try to see

Re: Any suggestion about JIRA 1006 MLlib ALS gets stack overflow with too many iterations?

2014-04-06 Thread Xiangrui Meng
Btw, explicit ALS doesn't need persist because each intermediate factor is only used once. -Xiangrui On Sun, Apr 6, 2014 at 9:13 PM, Xiangrui Meng men...@gmail.com wrote: The persist used in implicit ALS doesn't help StackOverflow problem. Persist doesn't cut lineage. We need to call count

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Xiangrui Meng
fine and I can generate factors... With 10 iterations run fails with array index out of bound... 25m users and 3m products are within int limits Does it help if I can point the logs for both the runs to you ? I will debug it further today... On Apr 7, 2014 9:54 AM, Xiangrui Meng men

Re: feature selection and sparse vector support

2014-04-10 Thread Xiangrui Meng
Hi Ignacio, Please create a JIRA and send a PR for the information gain computation, so it is easy to track the progress. The sparse vector support for NaiveBayes is already implemented in branch-1.0 and master. You only need to provide an RDD of sparse vectors (created from Vectors.sparse).

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Xiangrui Meng
+1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Xiangrui Meng
at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Xiangrui Meng
The markdown files are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote: Cannot agree more

Re: ArrayIndexOutOfBoundsException in ALS.implicit

2014-04-23 Thread Xiangrui Meng
Hi bearrito, this issue was fixed by Tor in https://github.com/apache/spark/pull/407. You can either try the master branch or wait for the 1.0 release. -Xiangrui On Fri, Mar 28, 2014 at 12:19 AM, Xiangrui Meng men...@gmail.com wrote: Hi bearrito, This is a known issue (https://spark

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread Xiangrui Meng
I don't think it is easy to make sparse faster than dense with this sparsity and feature dimension. You can try rcv1.binary, which should show the difference easily. David, the breeze operators used here are 1. DenseVector dot SparseVector 2. axpy DenseVector SparseVector However, the

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread Xiangrui Meng
is that in the benchmark code, after you call cache, you should also call count() to materialize the RDD. I saw in the result, the real difference is actually at the first step. Adding intercept is not a cheap operation for sparse vectors. Best, Xiangrui On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread Xiangrui Meng
rcv1.binary which only has 0.15% non-zero elements to verify the hypothesis. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men

Re: mllib vector templates

2014-05-05 Thread Xiangrui Meng
I fixed index type and value type to make things simple, especially when we need to provide Java and Python APIs. For raw features and feature transmations, we should allow generic types. -Xiangrui On Mon, May 5, 2014 at 3:40 PM, DB Tsai dbt...@stanford.edu wrote: David, Could we use Int,

Re: LabeledPoint dump LibSVM if SparseVector

2014-05-12 Thread Xiangrui Meng
Hi Deb, There is a saveAsLibSVMFile in MLUtils now. Also, I submitted a PR for standardizing text format of vectors and labeled point: https://github.com/apache/spark/pull/685 Best, Xiangrui On Sun, May 11, 2014 at 9:40 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I need to change

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-16 Thread Xiangrui Meng
, May 16, 2014 at 4:00 PM, Mridul Muralidharan mri...@gmail.com wrote: Effectively this is persist without fault tolerance. Failure of any node means complete lack of fault tolerance. I would be very skeptical of truncating lineage if it is not reliable. On 17-May-2014 3:49 am, Xiangrui Meng (JIRA

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Xiangrui Meng
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870 DB, could you add more info to that JIRA? Thanks! -Xiangrui On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com wrote: Btw, I tried rdd.map { i = System.getProperty(java.class.path) }.collect() but didn't

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Xiangrui Meng
be great. On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote: I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870 DB, could you add more info to that JIRA? Thanks! -Xiangrui On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com wrote: Btw, I tried

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Xiangrui Meng
, 2014 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote: I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870 DB, could you add more info to that JIRA? Thanks! -Xiangrui On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com wrote: Btw, I tried rdd.map { i

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-20 Thread Xiangrui Meng
Talked with Sandy and DB offline. I think the best solution is sending the secondary jars to the distributed cache of all containers rather than just the master, and set the classpath to include spark jar, primary app jar, and secondary jars before executor starts. In this way, user only needs to

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Xiangrui Meng
:59 PM, Xiangrui Meng men...@gmail.com wrote: Talked with Sandy and DB offline. I think the best solution is sending the secondary jars to the distributed cache of all containers rather than just the master, and set the classpath to include spark jar, primary app jar, and secondary jars

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Xiangrui Meng
instantiating dynamic classes, but I think it's weird that this code would work on Spark standalone but not on YARN. -Sandy On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng men...@gmail.com wrote: I think adding jars dynamically should work as long as the primary jar and the secondary jars do

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-22 Thread Xiangrui Meng
Hi DB, I found it is a little hard to implement the solution I mentioned: Do not send the primary jar and secondary jars to executors' distributed cache. Instead, add them to spark.jars in SparkSubmit and serve them via http by called sc.addJar in SparkContext. If you look at

Re: Contributions to MLlib

2014-05-22 Thread Xiangrui Meng
Hi Meethu, Thanks for asking! Scala is the native language in Spark. Implementing algorithms in Scala can utilize the full power of Spark Core. Also, Scala's syntax is very concise. Implementing ML algorithms using different languages would increase the maintenance cost. However, there are still

Re: LogisticRegression: Predicting continuous outcomes

2014-05-28 Thread Xiangrui Meng
Please find my comments inline. -Xiangrui On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: I'm looking to reuse the LogisticRegression model (with SGD) to predict a real-valued outcome variable. (I understand that logistic regression is generally applied to

Re: Standard preprocessing/scaling

2014-05-28 Thread Xiangrui Meng
RowMatrix has a method to compute column summary statistics. There is a trade-off here because centering may densify the data. A utility function that centers data would be useful for dense datasets. -Xiangrui On Wed, May 28, 2014 at 5:03 AM, dataginjaninja rickett.stepha...@gmail.com wrote: I

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Xiangrui Meng
+1 Tested apps with standalone client mode and yarn cluster and client modes. Xiangrui On Wed, May 28, 2014 at 1:07 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through

Which version does the binary compatibility test against by default?

2014-06-02 Thread Xiangrui Meng
Is there a way to specify the target version? -Xiangrui

Re: Constraint Solver for Spark

2014-06-05 Thread Xiangrui Meng
Hi Deb, Why do you want to make those methods public? If you only need to replace the solver for subproblems. You can try to make the solver pluggable. Now it supports least squares and non-negative least squares. You can define an interface for the subproblem solvers and maintain the IPM solver

Re: Constraint Solver for Spark

2014-06-06 Thread Xiangrui Meng
I don't quite understand why putting linear constraints can promote orthogonality. For the interfaces, if the subproblem is determined by Y^T Y and Y^T b for each iteration, then the least squares solver, the non-negative least squares solver, or your convex solver is simply a function (A, b) -

Re: Constraint Solver for Spark

2014-06-11 Thread Xiangrui Meng
be added to the classpathif it can be then definitely we should add these in ALS.scala... Thanks. Deb On Thu, Jun 5, 2014 at 11:31 PM, Xiangrui Meng men...@gmail.com wrote: I don't quite understand why putting linear constraints can promote orthogonality. For the interfaces

Re: Constraint Solver for Spark

2014-06-11 Thread Xiangrui Meng
ranks are high... But seems like that's not possible without a broadcast step which might kill all the runtime gain... On Wed, Jun 11, 2014 at 12:21 AM, Xiangrui Meng men...@gmail.com wrote: For explicit feedback, ALS uses only observed ratings for computation. So XtXs are not the same

Re: Checkpointed RDD still causing StackOverflow

2014-06-23 Thread Xiangrui Meng
Calling checkpoint() alone doesn't cut the lineage. It only marks the RDD as to be checkpointed. The lineage is cut after the first time this RDD is materialized. You see StackOverflow becaure the lineage is still there. -Xiangrui On Sun, Jun 22, 2014 at 6:37 PM, dash b...@nd.edu wrote: Hi

Re: Constraint Solver for Spark

2014-07-02 Thread Xiangrui Meng
paper also shows very similar results compared to CVX: http://web.stanford.edu/~boyd/papers/pdf/prox_algs.pdf Thanks. Deb On Wed, Jun 11, 2014 at 3:21 PM, Xiangrui Meng men...@gmail.com wrote: You idea is close to what implicit feedback does. You can check the paper, which is short

Re: Constraint Solver for Spark

2014-07-07 Thread Xiangrui Meng
Hey Deb, If your goal is to solve the subproblems in ALS, exploring sparsity doesn't give you much benefit because the data is small and dense. Porting either ECOS's or PDCO's implementation but using dense representation should be sufficient. Feel free to open a JIRA and we can move our

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-07 Thread Xiangrui Meng
+1 Ran mllib examples. On Sun, Jul 6, 2014 at 1:21 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac OS X. Matei On Jul 6, 2014, at 1:54 AM, Andrew Or and...@databricks.com wrote: +1, verified that the UI bug is in fact fixed in

Re: Contribution to MLlib

2014-07-09 Thread Xiangrui Meng
I don't know if anyone is working on it either. If that JIRA is not moved to Apache JIRA, feel free to create a new one and make a note that you are working on it. Thanks! -Xiangrui On Wed, Jul 9, 2014 at 4:56 AM, RJ Nowling rnowl...@gmail.com wrote: Hi Meethu, There is no code for a Gaussian

Re: libgfortran Dependency

2014-07-09 Thread Xiangrui Meng
It is documented in the official doc: http://spark.apache.org/docs/latest/mllib-guide.html On Wed, Jul 9, 2014 at 7:35 PM, Taka Shinagawa taka.epsi...@gmail.com wrote: Hi, After testing Spark 1.0.1-RC2 on EC2 instances from the standard Ubuntu and Amazon Linux AMIs, I've noticed the MLlib's

[VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
Please vote on releasing the following candidate as Apache Spark version 0.9.2! The tag to be voted on is v0.9.2-rc1 (commit 4322c0ba): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4322c0ba7f411cf9a2483895091440011742246b The release files, including signatures, digests, etc.

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
I start the voting with a +1. Ran tests on the release candidates and some basic operations in spark-shell and pyspark (local and standalone). -Xiangrui On Thu, Jul 17, 2014 at 3:16 AM, Xiangrui Meng men...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-18 Thread Xiangrui Meng
...@databricks.com wrote: +1 On Thursday, July 17, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac, verified CHANGES.txt is good, verified several of the bug fixes. Matei On Jul 17, 2014, at 11:12 AM, Xiangrui Meng men...@gmail.com javascript:; wrote: I start

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-22 Thread Xiangrui Meng
Hi all, The vote has passed with 7 +1 votes (4 binding) and 0 -1 vote: +1: Xiangrui Meng* Matei Zaharia* DB Tsai Reynold Xin* Patrick Wendell* Andrew Or Sean McNamara I'm closing this vote and going to package v0.9.2 today. Thanks everyone for voting! Best, Xiangrui On Fri, Jul 18, 2014 at 9

Announcing Spark 0.9.2

2014-07-23 Thread Xiangrui Meng
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is a maintenance release with bug fixes across several areas of Spark, including Spark Core, PySpark, MLlib, Streaming, and GraphX. We recommend all 0.9.x users to upgrade to this stable release. Contributions to this release came

Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-28 Thread Xiangrui Meng
+1 Tested basic spark-shell and pyspark operations and MLlib examples on a Mac. On Mon, Jul 28, 2014 at 8:29 PM, Mubarak Seyed spark.devu...@gmail.com wrote: +1 (non-binding) Tested this on Mac OS X. On Mon, Jul 28, 2014 at 6:52 PM, Andrew Or and...@databricks.com wrote: +1 Tested on

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-02 Thread Xiangrui Meng
You can try enabling spark.files.userClassPathFirst. But I'm not sure whether it could solve your problem. -Xiangrui On Sat, Aug 2, 2014 at 10:13 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I have deployed spark stable 1.0.1 on the cluster but I have new code that I added in

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-02 Thread Xiangrui Meng
with the rest of the application code ? On Sat, Aug 2, 2014 at 10:46 AM, Xiangrui Meng men...@gmail.com wrote: You can try enabling spark.files.userClassPathFirst. But I'm not sure whether it could solve your problem. -Xiangrui On Sat, Aug 2, 2014 at 10:13 AM, Debasish Das debasish.da...@gmail.com

Re: -1s on pull requests?

2014-08-05 Thread Xiangrui Meng
I think the build number is included in the SparkQA message, for example: https://github.com/apache/spark/pull/1788 The build number 17941 is in the URL https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17941/consoleFull;. Just need to be careful to match the number. Another

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-05 Thread Xiangrui Meng
, there might be bugs in it... Any suggestions will be appreciated Thanks. Deb On Sat, Aug 2, 2014 at 11:12 AM, Xiangrui Meng men...@gmail.com wrote: Yes, that should work. spark-mllib-1.1.0 should be compatible with spark-core-1.0.1. On Sat, Aug 2, 2014 at 10:54 AM, Debasish Das

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Xiangrui Meng
like that in make-distribution script ? Thanks. Deb On Tue, Aug 5, 2014 at 10:37 AM, Xiangrui Meng men...@gmail.com wrote: If you cannot change the Spark jar deployed on the cluster, an easy solution would be renaming ALS in your jar. If userClassPathFirst doesn't work, could you create

Re: Welcoming two new committers

2014-08-08 Thread Xiangrui Meng
Congrats, Joey Andrew!! -Xiangrui On Fri, Aug 8, 2014 at 12:14 AM, Christopher Nguyen c...@adatao.com wrote: +1 Joey Andrew :) -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com [ah-'DAY-tao] linkedin.com/in/ctnguyen On Thu, Aug 7, 2014 at 10:39 PM, Joseph Gonzalez

Re: Lost executor on YARN ALS iterations

2014-08-20 Thread Xiangrui Meng
Hi Deb, I think this may be the same issue as described in https://issues.apache.org/jira/browse/SPARK-2121 . We know that the container got killed by YARN because it used much more memory that it requested. But we haven't figured out the root cause yet. +Sandy Best, Xiangrui On Tue, Aug 19,

Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Xiangrui Meng
+1. Tested some MLlib example code. For default changes, maybe it is useful to mention the default broadcast factory changed to torrent. On Wed, Sep 3, 2014 at 12:34 AM, Michael Armbrust mich...@databricks.com wrote: +1 On Wed, Sep 3, 2014 at 12:29 AM, Reynold Xin r...@databricks.com wrote:

Re: Is breeze thread safe in Spark?

2014-09-03 Thread Xiangrui Meng
RJ, could you provide a code example that can re-produce the bug you observed in local testing? Breeze's += is not thread-safe. But in a Spark job, calls to a resultHandler is synchronized:

Re: [mllib] Add multiplying large scale matrices

2014-09-08 Thread Xiangrui Meng
Sorry for my late reply! I'm also very interested in the implementation of distributed matrix multiplication. As Shivaram mentioned, the communication is the concern here. But maybe we can start with a reasonable implementation and then iterate on its performance. It would be great if eventually

Re: Adding abstraction in MLlib

2014-09-12 Thread Xiangrui Meng
Hi Egor, Thanks for the feedback! We are aware of some of the issues you mentioned and there are JIRAs created for them. Specifically, I'm pushing out the design on pipeline features and algorithm/model parameters this week. We can move our discussion to

Re: why does BernoulliSampler class use a lower and upper bound?

2014-09-15 Thread Xiangrui Meng
It is also used in RDD.randomSplit. -Xiangrui On Mon, Sep 15, 2014 at 4:23 PM, Erik Erlandson e...@redhat.com wrote: I'm climbing under the hood in there for SPARK-3250, and I see this: override def sample(items: Iterator[T]): Iterator[T] = { items.filter { item = val x =

Re: Adding abstraction in MLlib

2014-09-17 Thread Xiangrui Meng
Hi Egor, I posted the design doc for pipeline and parameters on the JIRA, now I'm trying to work out some details of ML datasets, which I will post it later this week. You feedback is welcome! Best, Xiangrui On Mon, Sep 15, 2014 at 12:44 AM, Reynold Xin r...@databricks.com wrote: Hi Egor,

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Xiangrui Meng
The test accuracy doesn't mean the total loss. All points between (-1, 1) can separate points -1 and +1 and give you 1.0 accuracy, but their coressponding loss are different. -Xiangrui On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used LogisticRegression

Re: Cluster tests failing

2014-09-30 Thread Xiangrui Meng
Try to build the assembly jar first. ClusterSuite uses local-cluster mode, which requires the assembly jar. -Xiangrui On Tue, Sep 30, 2014 at 8:23 AM, Debasish Das debasish.da...@gmail.com wrote: I have done mvn clean several times... Consistently all the mllib tests that are using

Re: Breeze Library usage in Spark

2014-10-03 Thread Xiangrui Meng
Did you add a different version of breeze to the classpath? In Spark 1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze version you used is different from the one comes with Spark, you might see class not found. -Xiangrui On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch

Re: Standardized Distance Functions in MLlib

2014-10-08 Thread Xiangrui Meng
Hi Yu, We upgraded breeze to 0.10 yesterday. So we can call the distance functions you contributed to breeze easily. We don't want to maintain another copy of the implementation in MLlib to keep the maintenance cost low. Both spark and breeze are open-source projects. We should try our best to

Re: Issues with ALS positive definite

2014-10-16 Thread Xiangrui Meng
Do not use lambda=0.0. Use a small number instead. Cholesky factorization doesn't work on semi-positive systems with 0 eigenvalues. -Xiangrui On Wed, Oct 15, 2014 at 5:05 PM, Debasish Das debasish.da...@gmail.com wrote: But do you expect the mllib code to fail if I run with 0.0 regularization ?

Re: NNLS bug

2014-10-17 Thread Xiangrui Meng
Thanks for reporting the bug! I will take a look. -Xiangrui On Thu, Oct 16, 2014 at 11:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am validating the proximal algorithm for positive and bound constrained ALS and I came across the bug detailed in the JIRA while running ALS with

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-10-21 Thread Xiangrui Meng
Hi Ashutosh, The process you described is correct, with details documented in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . There is no outlier detection algorithm in MLlib. Before you start coding, please open an JIRA and let's discuss which algorithms are appropriate

Re: PR for Hierarchical Clustering Needs Review

2014-10-23 Thread Xiangrui Meng
Hi RJ, We are close to the v1.2 feature freeze deadline, so I'm busy with the pipeline feature and couple bugs. I will ask other developers to help review the PR. Thanks for working with Yu and helping the code review! Best, Xiangrui On Thu, Oct 23, 2014 at 2:58 AM, RJ Nowling

Re: matrix factorization cross validation

2014-10-29 Thread Xiangrui Meng
Let's narrow the context from matrix factorization to recommendation via ALS. It adds extra complexity if we treat it as a multi-class classification problem. ALS only outputs a single value for each prediction, which is hard to convert to probability distribution over the 5 rating levels.

Re: OOM when making bins in BinaryClassificationMetrics ?

2014-11-02 Thread Xiangrui Meng
Yes, if there are many distinct values, we need binning to compute the AUC curve. Usually, the scores are not evenly distribution, we cannot simply truncate the digits. Estimating the quantiles for binning is necessary, similar to RangePartitioner:

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Xiangrui Meng
Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Xiangrui Meng
+1 (binding) On Wed, Nov 5, 2014 at 7:52 PM, Mark Hamstra m...@clearstorydata.com wrote: +1 (binding) On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: +1 on this proposal. On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Will these

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
a issue... Any idea how to optimize this so that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote: ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map

Re: MLlib related query

2014-11-11 Thread Xiangrui Meng
Searched MLlib on Google Scholar and didn't find any:) MLlib implements well-recognized algorithms. Each of which may correspond to a paper or serveral papers. Please find the reference in the code if you are interested. -Xiangrui On Sat, Nov 8, 2014 at 1:37 AM, Manu Kaul manohar.k...@gmail.com

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
`sampleByKey` with the same fraction per stratum acts the same as `sample`. The operation you want is perhaps `sampleByKeyExact` here. However, when you use stratified sampling, there should not be many strata. My question is why we need to split on each user's ratings. If a user is missing in

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
in a labeled dataset ~ 100 ? On Tue, Nov 18, 2014 at 10:31 AM, Xiangrui Meng men...@gmail.com wrote: `sampleByKey` with the same fraction per stratum acts the same as `sample`. The operation you want is perhaps `sampleByKeyExact` here. However, when you use stratified sampling, there should

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-19 Thread Xiangrui Meng
+1. Checked version numbers and doc. Tested a few ML examples with Java 6 and verified some recently merged bug fixes. -Xiangrui On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote: I will start with a +1 2014-11-19 14:51 GMT-08:00 Andrew Or and...@databricks.com: Please

Re: [mllib] useFeatureScaling likes hardcode in LogisticRegressionWithLBFGS and is not comprehensive for users.

2014-11-26 Thread Xiangrui Meng
Hi Yanbo, We scale the model coefficients back after training. So scaling in prediction is not necessary. We had some discussion about this. I'd like to treat feature scaling as part of the feature transformation, and recommend users to apply feature scaling before training. It is a cleaner

Re: CrossValidator API in new spark.ml package

2014-12-15 Thread Xiangrui Meng
Yes, regularization path could be viewed as training multiple models at once. -Xiangrui On Sat, Dec 13, 2014 at 6:53 AM, DB Tsai dbt...@dbtsai.com wrote: Okay, I got it. In Estimator, fit(dataset: SchemaRDD, paramMaps: Array[ParamMap]): Seq[M] can be overwritten to implement regularization

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-15 Thread Xiangrui Meng
, 2.6000e+01, 2.0770e+03, 4.e+00, 6.9350e+03]), 0)] I had overwritten the naive bayes example. Will chase the older versions down Cheers k/ On Wed, Dec 3, 2014 at 4:19 PM, Xiangrui Meng men...@gmail.com wrote: Krishna, could you send me some code

Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng
Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install

Re: IDF for ml pipeline

2015-02-03 Thread Xiangrui Meng
Yes, we need a wrapper under spark.ml. Feel free to create a JIRA for it. -Xiangrui On Mon, Feb 2, 2015 at 8:56 PM, masaki rikitoku rikima3...@gmail.com wrote: Hi all I am trying the ml pipeline for text classfication now. recently, i succeed to execute the pipeline processing in ml

Re: DBSCAN for MLlib

2015-01-14 Thread Xiangrui Meng
Please find my comments on the JRIA page. -Xiangrui On Tue, Jan 13, 2015 at 1:49 PM, Muhammad Ali A'råby angelland...@yahoo.com.invalid wrote: I have to say, I have created a Jira task for it: [SPARK-5226] Add DBSCAN Clustering Algorithm to MLlib - ASF JIRA | | | | | | | | |

Re: KNN for large data set

2015-01-21 Thread Xiangrui Meng
For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote: Hi all, Please help me

Re: Spectral clustering

2015-01-20 Thread Xiangrui Meng
Fan and Stephen (cc'ed) are working on this feature. They will update the JIRA page and report progress soon. -Xiangrui On Fri, Jan 16, 2015 at 12:04 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Hi, thinking of picking up this Jira ticket:

Re: Batch prediciton for ALS

2015-02-17 Thread Xiangrui Meng
It may be too late to merge it into 1.3. I'm going to make another pass on your PR today. -Xiangrui On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Will it be possible to merge this PR to 1.3 ? https://github.com/apache/spark/pull/3098 The batch prediction

Re: mllib.recommendation Design

2015-02-17 Thread Xiangrui Meng
The current ALS implementation allow pluggable solvers for NormalEquation, where we put CholeskeySolver and NNLS solver. Please check the current implementation and let us know how your constraint solver would fit. For a general matrix factorization package, let's make a JIRA and move our

Re: [ml] Lost persistence for fold in crossvalidation.

2015-02-17 Thread Xiangrui Meng
There are three different regParams defined in the grid and there are tree folds. For simplicity, we didn't split the dataset into three and reuse them, but do the split for each fold. Then we need to cache 3*3 times. Note that the pipeline API is not yet optimized for performance. It would be

Re: Batch prediciton for ALS

2015-02-18 Thread Xiangrui Meng
a look at it again and try update with the new ALS... On Tue, Feb 17, 2015 at 3:22 PM, Xiangrui Meng men...@gmail.com wrote: It may be too late to merge it into 1.3. I'm going to make another pass on your PR today. -Xiangrui On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das debasish.da...@gmail.com

Re: Re-use scaling means and variances from StandardScalerModel

2015-01-09 Thread Xiangrui Meng
Feel free to create a JIRA for this issue. We might need to discuss what to put in the public constructors. In the meanwhile, you can use Java serialization to save/load the model: sc.parallelize(Seq(model), 1).saveAsObjectFile(/tmp/model) val model =

Re: multi-line comment style

2015-02-09 Thread Xiangrui Meng
I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the

Re: multi-line comment style

2015-02-09 Thread Xiangrui Meng
(glmnet(features, label, family=gaussian, alpha = 0, lambda = 0)) */ ~~~ So people can copy paste the R commands directly. Xiangrui On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng men...@gmail.com wrote: I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block

Re: enum-like types in Spark

2015-03-17 Thread Xiangrui Meng
is why I think #4 is fine. But I figured I'd give my spiel, because every developer loves language wars :) Imran On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng men...@gmail.com wrote: `case object` inside an `object` doesn't show up in Java. This is the minimal code I found to make

Re: enum-like types in Spark

2015-03-16 Thread Xiangrui Meng
. I doubt it really matters that much for Spark internals, which is why I think #4 is fine. But I figured I'd give my spiel, because every developer loves language wars :) Imran On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng men...@gmail.com wrote

  1   2   >