Re: weights not changed with different reg param

2014-12-29 Thread Xiangrui Meng
Could you post your code? It sounds like a bug. One thing to check is that wheher you set regType, which is None by default. -Xiangrui On Tue, Dec 23, 2014 at 3:36 PM, Thomas Kwan thomas.k...@manage.com wrote: Hi there We are on mllib 1.1.1, and trying different regularization parameters. We

Re: MLLib beginner question

2014-12-29 Thread Xiangrui Meng
-- Skype: boci13, Hangout: boci.b...@gmail.com On Tue, Dec 23, 2014 at 1:35 AM, Xiangrui Meng men...@gmail.com wrote: How big is the dataset you want to use in prediction? -Xiangrui On Mon, Dec 22, 2014 at 1:47 PM, boci

Re: SVDPlusPlus Recommender in MLLib

2014-12-29 Thread Xiangrui Meng
There is an SVD++ implementation in GraphX. It would be nice if you can compare its performance vs. Mahout. -Xiangrui On Wed, Dec 24, 2014 at 6:46 AM, Prafulla Wani prafulla.w...@gmail.com wrote: hi , Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is implemented in

Re: MLlib + Streaming

2014-12-23 Thread Xiangrui Meng
We have streaming linear regression (since v1.1) and k-means (v1.2) in MLlib. You can check the user guide: http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering -Xiangrui On Tue,

Re: retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Xiangrui Meng
Sean's PR may be relevant to this issue (https://github.com/apache/spark/pull/3702). As a workaround, you can try to truncate the raw scores to 4 digits (e.g., 0.5643215 - 0.5643) before sending it to BinaryClassificationMetrics. This may not work well if he score distribution is very skewed. See

Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng
Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install

Re: Interpreting MLLib's linear regression o/p

2014-12-22 Thread Xiangrui Meng
Did you check the indices in the LIBSVM data and the master file? Do they match? -Xiangrui On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I use LIBSVM format to specify my input feature vector, which used 1-based index. When I run regression the o/p is 0-indexed

Re: MLLib beginner question

2014-12-22 Thread Xiangrui Meng
How big is the dataset you want to use in prediction? -Xiangrui On Mon, Dec 22, 2014 at 1:47 PM, boci boci.b...@gmail.com wrote: Hi! I want to try out spark mllib in my spark project, but I got a little problem. I have training data (external file), but the real data com from another rdd.

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-18 Thread Xiangrui Meng
Hi Jay, Please try increasing executor memory (if the available memory is more than 2GB) and reduce numBlocks in ALS. The current implementation stores all subproblems in memory and hence the memory requirement is significant when k is large. You can also try reducing k and see whether the

Re: printing mllib.linalg.vector

2014-12-15 Thread Xiangrui Meng
you can use the default toString method to get the string representation. if you want to customized, check the indices/values fields. -Xiangrui On Fri, Dec 5, 2014 at 7:32 AM, debbie debbielarso...@hotmail.com wrote: Basic question: What is the best way to loop through one of these and print

Re: MLlib(Logistic Regression) + Spark Streaming.

2014-12-15 Thread Xiangrui Meng
If you want to train offline and predict online, you can use the current LR implementation to train a model and then apply model.predict on the dstream. -Xiangrui On Sun, Dec 7, 2014 at 6:30 PM, Nasir Khan nasirkhan.onl...@gmail.com wrote: I am new to spark. Lets say i want to develop a machine

Re: MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-15 Thread Xiangrui Meng
Is it possible that after filtering the feature dimension changed? This may happen if you use LIBSVM format but didn't specify the number of features. -Xiangrui On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I was able to run LinearRegressionwithSGD for a largeer

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Xiangrui Meng
Please check the number of partitions after sc.textFile. Use sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai dbt...@dbtsai.com wrote: You just need to use the latest master code without any configuration to get performance improvement from

Re: Stack overflow Error while executing spark SQL

2014-12-15 Thread Xiangrui Meng
Could you post the full stacktrace? It seems to be some recursive call in parsing. -Xiangrui On Tue, Dec 9, 2014 at 7:44 PM, jishnu.prat...@wipro.com wrote: Hi I am getting Stack overflow Error Exception in main java.lang.stackoverflowerror

Re: Building Desktop application for ALS-MlLib/ Training ALS

2014-12-15 Thread Xiangrui Meng
On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal saurabh.agra...@markit.com wrote: Hi, I am a new bee in spark and scala world I have been trying to implement Collaborative filtering using MlLib supplied out of the box with Spark and Scala I have 2 problems 1. The best

Re: ALS failure with size Integer.MAX_VALUE

2014-12-15 Thread Xiangrui Meng
of item blocks. And yes, I've been following the JIRA for the new ALS implementation. I'll try it out when it's ready for testing. . On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote: Hi Bharath, You can try setting a small item blocks in this case. 1200 is definitely too large

Re: ALS failure with size Integer.MAX_VALUE

2014-12-02 Thread Xiangrui Meng
Hi Bharath, You can try setting a small item blocks in this case. 1200 is definitely too large for ALS. Please try 30 or even smaller. I'm not sure whether this could solve the problem because you have 100 items connected with 10^8 users. There is a JIRA for this issue:

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Xiangrui Meng
The training RMSE may increase due to regularization. Squared loss only represents part of the global loss. If you watch the sum of the squared loss and the regularization, it should be non-increasing. -Xiangrui On Wed, Nov 26, 2014 at 9:53 AM, Sean Owen so...@cloudera.com wrote: I also modified

Re: Is spark streaming +MlLib for online learning?

2014-11-25 Thread Xiangrui Meng
In 1.2, we added streaming k-means: https://github.com/apache/spark/pull/2942 . -Xiangrui On Mon, Nov 24, 2014 at 5:25 PM, Joanne Contact joannenetw...@gmail.com wrote: Thank you Tobias! On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Tue, Nov 25, 2014 at

Re: K-means clustering

2014-11-25 Thread Xiangrui Meng
There is a simple example here: https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py . You can take advantage of sparsity by computing the distance via inner products: http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2 -Xiangrui On Tue, Nov 25, 2014 at 2:39

Re: MlLib Colaborative filtering factors

2014-11-25 Thread Xiangrui Meng
It is data-dependent, and hence needs hyper-parameter tuning, e.g., grid search. The first batch is certainly expensive. But after you figure out a small range for each parameter that fits your data, following batches should be not that expensive. There is an example from AMPCamp:

Re: why MatrixFactorizationModel private?

2014-11-25 Thread Xiangrui Meng
Besides API stability concerns, models constructed directly from users rather than returned by ALS may not work well. The userFeatures and productFeatures are both with partitioners so we can perform quick lookup for prediction. If you save userFeatures and productFeatures and load them back, it

Re: Python Logistic Regression error

2014-11-24 Thread Xiangrui Meng
The data is in LIBSVM format. So this line won't work: values = [float(s) for s in line.split(' ')] Please use the util function in MLUtils to load it as an RDD of LabeledPoint. http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point from pyspark.mllib.util import MLUtils

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Xiangrui Meng
Try building Spark with -Pnetlib-lgpl, which includes the JNI library in the Spark assembly jar. This is the simplest approach. If you want to include it as part of your project, make sure the library is inside the assembly jar or you specify it via `--jars` with spark-submit. -Xiangrui On Mon,

Re: MLIB KMeans Exception

2014-11-20 Thread Xiangrui Meng
How many features and how many partitions? You set kmeans_clusters to 1. If the feature dimension is large, it would be really expensive. You can check the WebUI and see task failures there. The stack trace you posted is from the driver. Btw, the total memory you have is 64GB * 10, so you can

Re: Client application that calls Spark and receives an MLlib model Scala Object and then predicts without Spark installed on hadoop

2014-11-15 Thread Xiangrui Meng
If Spark is not installed on the client side, you won't be able to deserialize the model. Instead of serializing the model object, you may serialize the model weights array and implement predict on the client side. -Xiangrui On Fri, Nov 14, 2014 at 2:54 PM, xiaoyan yu xiaoyan...@gmail.com wrote:

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote: Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
I think I understand where the bug is now. I created a JIRA (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR soon. -Xiangrui On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote: This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround: val a = sc.parallelize(1 to 10).zipWithIndex() a.partitions // call .partitions explicitly a.repartition(10).count() Thanks for reporting the bug! -Xiangrui On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng men

Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-14 Thread Xiangrui Meng
If you use Kryo serialier, you need to register mutable.BitSet and Rating: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102 The JIRA was marked resolved because chill resolved the problem in v0.4.0 and we have this

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread Xiangrui Meng
That means the -Pnetlib-lgpl option didn't work. Could you use sbt to build the assembly jar and see whether the .so file is inside the assembly jar? Which system and Java version are you using? -Xiangrui On Wed, Nov 12, 2014 at 2:22 PM, jpl jlefe...@soe.ucsc.edu wrote: Hi Xiangrui, thank you

Re: No module named pyspark - latest built

2014-11-12 Thread Xiangrui Meng
You need to use maven to include python files. See https://github.com/apache/spark/pull/1223 . -Xiangrui On Wed, Nov 12, 2014 at 4:48 PM, jamborta jambo...@gmail.com wrote: I have figured out that building the fat jar with sbt does not seem to included the pyspark scripts using the following

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Xiangrui Meng
-evaluator are some amazing systems for this, but they fundamentally use PMML as the model representation here. I have read some JIRA tickets that Xiangrui Meng is interested in getting PMML implemented to export MLLib models, is that happening? Further, would something like Manish Amde's

Re: MLLib Decision Tress algorithm hangs, others fine

2014-11-11 Thread Xiangrui Meng
Could you provide more information? For example, spark version, dataset size (number of instances/number of features), cluster size, error messages from both the drive and the executor. -Xiangrui On Mon, Nov 10, 2014 at 11:28 AM, tsj tsj...@gmail.com wrote: Hello all, I have some text data

Re: scala.MatchError

2014-11-11 Thread Xiangrui Meng
I think you need a Java bean class instead of a normal class. See example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html (switch to the java tab). -Xiangrui On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, This is my Instrument java

Re: MLLIB usage: BLAS dependency warning

2014-11-11 Thread Xiangrui Meng
Could you try jar tf on the assembly jar and grep netlib-native_system-linux-x86_64.so? -Xiangrui On Tue, Nov 11, 2014 at 7:11 PM, jpl jlefe...@soe.ucsc.edu wrote: Hi, I am having trouble using the BLAS libs with the MLLib functions. I am using org.apache.spark.mllib.clustering.KMeans (on a

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
a issue... Any idea how to optimize this so that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote: ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Xiangrui Meng
local matrix-matrix multiplication or distributed? On Tue, Nov 4, 2014 at 11:58 PM, ll duy.huynh@gmail.com wrote: what is the best way to implement a sparse x sparse matrix multiplication with spark? -- View this message in context:

Re: Matrix multiplication in spark

2014-11-05 Thread Xiangrui Meng
We are working on distributed block matrices. The main JIRA is at: https://issues.apache.org/jira/browse/SPARK-3434 The goal is to support basic distributed linear algebra, (dense first and then sparse). -Xiangrui On Wed, Nov 5, 2014 at 12:23 AM, ll duy.huynh@gmail.com wrote: @sowen.. i

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Xiangrui Meng
You can use breeze for local sparse-sparse matrix multiplication and then define an RDD of sub-matrices RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix) and then use join and aggregateByKey to implement this feature, which is the same as in MapReduce. -Xiangrui

Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread Xiangrui Meng
Which Spark version did you use? Could you check the WebUI and attach the error message on executors? -Xiangrui On Wed, Nov 5, 2014 at 8:23 AM, rok rokros...@gmail.com wrote: yes, the training set is fine, I've verified it. -- View this message in context:

Re: pass unique ID to mllib algorithms pyspark

2014-11-04 Thread Xiangrui Meng
The proposed new set of APIs (SPARK-3573, SPARK-3530) will address this issue. We carry over extra columns with training and prediction and then leverage on Spark SQL's execution plan optimization to decide which columns are really needed. For the current set of APIs, we can add `predictOnValues`

Re: is spark a good fit for sequential machine learning algorithms?

2014-11-03 Thread Xiangrui Meng
Many ML algorithms are sequential because they were not designed to be parallel. However, ML is not driven by algorithms in practice, but by data and applications. As datasets getting bigger and bigger, some algorithms got revised to work in parallel, like SGD and matrix factorization. MLlib tries

Re: Prediction using Classification with text attributes in Apache Spark MLLib

2014-11-02 Thread Xiangrui Meng
This operation requires two transformers: 1) Indexer, which maps string features into categorical features 2) OneHotEncoder, which flatten categorical features into binary features We are working on the new dataset implementation, so we can easily express those transformations. Sorry for late!

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng
DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a

Re: MLLib: libsvm - default value initialization

2014-10-30 Thread Xiangrui Meng
You can remove 0.5 from all non-zeros. -Xiangrui On Wed, Oct 29, 2014 at 9:20 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I have my sparse data in libsvm format. val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, mllib/data/sample_libsvm_data.txt) I am running Linear

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng
trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / testParsedData.count // println(Training Error = + trainErr) println(Calendar.getInstance().getTime()) } } Thanks, Best, Peng On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote: DId you cache the data

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Xiangrui Meng
FYI, there is a PR to make mllib.rdd.RDDFunctions public: https://github.com/apache/spark/pull/2907 -Xiangrui On Tue, Oct 28, 2014 at 5:18 AM, Yanbo Liang yanboha...@gmail.com wrote: Yes, it can import org.apache.spark.mllib.rdd.RDDFunctions but you can not use any method in this class or even

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-28 Thread Xiangrui Meng
=97FPVIfyCsbgsASL94CoDQusg=AFQjCNEQ6gUlwpr6KzlcZVd0sQeCSdjQgQsig2=Ne7pL_Z94wN4g9BwSutsXQ -Ilya Ganelin On Mon, Oct 27, 2014 at 6:12 PM, Xiangrui Meng men...@gmail.com wrote: Could you save the data before ALS and try to reproduce the problem? You might try reducing the number of partitions and not using

Re: deploying a model built in mllib

2014-10-27 Thread Xiangrui Meng
We are working on the pipeline features, which would make this procedure much easier in MLlib. This is still a WIP and the main JIRA is at: https://issues.apache.org/jira/browse/SPARK-1856 Best, Xiangrui On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani chirag.lakh...@gmail.com wrote: Hello, I

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Xiangrui Meng
Could you save the data before ALS and try to reproduce the problem? You might try reducing the number of partitions and not using Kryo serialization, just to narrow down the issue. -Xiangrui On Mon, Oct 27, 2014 at 1:29 PM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Burak. I always see this

Re: Read a TextFile(1 record contains 4 lines) into a RDD

2014-10-25 Thread Xiangrui Meng
If your file is not very large, try sc.wholeTextFiles(...).values.flatMap(_.split(\n).grouped(4).map(_.mkString(\n))) -Xiangrui On Sat, Oct 25, 2014 at 12:57 AM, Parthus peng.wei@gmail.com wrote: Hi, It might be a naive question, but I still wish that somebody could help me handle it.

Re: MLLib libsvm format

2014-10-21 Thread Xiangrui Meng
Yes. where the indices are one-based and **in ascending order**. -Xiangrui On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I have a question regarding the ordering of indices. The document says that the indices indices are one-based and in ascending order.

Re: create a Row Matrix

2014-10-21 Thread Xiangrui Meng
Please check out the example code: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/TallSkinnySVD.scala -Xiangrui On Tue, Oct 21, 2014 at 5:34 AM, viola viola.wiersc...@siemens.com wrote: Hi, I am VERY new to spark and mllib and ran into a

Re: spark1.0 principal component analysis

2014-10-16 Thread Xiangrui Meng
computePrincipalComponents returns a local matrix X, whose columns are the principal components (ordered), while those column vectors are in the same feature space as the input feature vectors. -Xiangrui On Thu, Oct 16, 2014 at 2:39 AM, al123 ant.lay...@hotmail.co.uk wrote: Hi, I don't think

Re: TF-IDF in Spark 1.1.0

2014-10-14 Thread Xiangrui Meng
You cannot recover the document from the TF-IDF vector, because HashingTF is not reversible. You can assign each document a unique ID, and join back the result after training. HasingTF can transform individual record: val docs: RDD[(String, Seq[String])] = ... val tf = new HashingTF() val

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
What is the feature dimension? I saw you used 100 partitions. How many cores does your cluster have? -Xiangrui On Tue, Oct 14, 2014 at 1:51 PM, Ray ray-w...@outlook.com wrote: Hi guys, An interesting thing, for the input dataset which has 1.5 million vectors, if set the KMeans's k_value = 100

Re: MLlib - Does LogisticRegressionModel.clearThreshold() no longer work?

2014-10-14 Thread Xiangrui Meng
LBFGS is better. If you data is easily separable, LR might return values very close or equal to either 0.0 or 1.0. It is rare but it may happen. -Xiangrui On Tue, Oct 14, 2014 at 3:18 PM, Aris arisofala...@gmail.com wrote: Wow...I just tried LogisticRegressionWithLBFGS, and using

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
Just ran a test on mnist8m (8m x 784) with k = 100 and numIter = 50. It worked fine. Ray, the error log you posted is after cluster termination, which is not the root cause. Could you search your log and find the real cause? On the executor tab screenshot, I saw only 200MB is used. Did you cache

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
I used k-means||, which is the default. And it took less than 1 minute to finish. 50 iterations took less than 25 minutes on a cluster of 9 m3.2xlarge EC2 nodes. Which deploy mode did you use? Is it yarn-client? -Xiangrui On Tue, Oct 14, 2014 at 6:03 PM, Ray ray-w...@outlook.com wrote: Hi

Re: MLUtil.kfold generates overlapped training and validation set?

2014-10-10 Thread Xiangrui Meng
1. No. 2. The seed per partition is fixed. So it should generate non-overlapping subsets. 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1. Best, Xiangrui On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all When we use MLUtils.kfold to generate training

Re: DIMSUM item similarity tests

2014-10-09 Thread Xiangrui Meng
please re-try with --driver-memory 10g . The default is 256m. -Xiangrui On Thu, Oct 9, 2014 at 2:33 AM, Clive Cox clive@rummble.com wrote: Hi, I'm trying out the DIMSUM item similarity from github master commit 69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is: Num items : 8860

Re: MLLib Linear regression

2014-10-08 Thread Xiangrui Meng
The proper step size partially depends on the Lipschitz constant of the objective. You should let the machine try different combinations of parameters and select the best. We are working with people from AMPLab to make hyperparameter tunning easier in MLlib 1.2. For the theory, Nesterov's book

Re: MLLib Linear regression

2014-10-07 Thread Xiangrui Meng
Did you test different regularization parameters and step sizes? In the combination that works, I don't see A + D. Did you test that combination? Are there any linear dependency between A's columns and D's columns? -Xiangrui On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak ssti...@live.com wrote:

Re: Breeze Library usage in Spark

2014-10-03 Thread Xiangrui Meng
Did you add a different version of breeze to the classpath? In Spark 1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze version you used is different from the one comes with Spark, you might see class not found. -Xiangrui On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread Xiangrui Meng
The current impl of ALS constructs least squares subproblems in memory. So for rank 100, the total memory it requires is about 480,189 * 100^2 / 2 * 8 bytes ~ 20GB, divided by the number of blocks. For rank 1000, this number goes up to 2TB, unfortunately. There is a JIRA for optimizing ALS:

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread Xiangrui Meng
It would be really helpful if you can help test the scalability of the new ALS impl: https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala . It should be faster and more scalable, but the code is messy now. Best, Xiangrui On Fri, Oct 3, 2014 at 11:57

Re: MLLib ALS question

2014-10-01 Thread Xiangrui Meng
ALS still needs to load and deserialize the in/out blocks (one by one) from disk and then construct least squares subproblems. All happen in RAM. The final model is also stored in memory. -Xiangrui On Wed, Oct 1, 2014 at 4:36 AM, Alex T chiorts...@gmail.com wrote: Hi, thanks for the reply. I

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Xiangrui Meng
Yes, the bigram in that demo only has two characters, which could separate different character sets. -Xiangrui On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei liquan...@gmail.com wrote: The program computes hashing bi-gram frequency normalized by total number of bigrams then filter out zero values.

Re: Help Troubleshooting Naive Bayes

2014-10-01 Thread Xiangrui Meng
The cost depends on the feature dimension, number of instances, number of classes, and number of partitions. Do you mind sharing those numbers? -Xiangrui On Wed, Oct 1, 2014 at 6:31 PM, Mike Bernico mike.bern...@gmail.com wrote: Hi Everyone, I'm working on training mllib's Naive Bayes to

Re: Print Decision Tree Models

2014-10-01 Thread Xiangrui Meng
Which Spark version are you using? It works in 1.1.0 but not in 1.0.0. -Xiangrui On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain ji...@sellpoints.com wrote: So I am trying to print the model output from MLlib however I am only getting things like the following:

Re: MLLib ALS question

2014-09-30 Thread Xiangrui Meng
You may need a cluster with more memory. The current ALS implementation constructs all subproblems in memory. With rank=10, that means (6.5M + 2.5M) * 10^2 / 2 * 8 bytes = 3.5GB. The ratings need 2GB, not counting the overhead. ALS creates in/out blocks to optimize the computation, which takes

Re: MLLib: Missing value imputation

2014-09-30 Thread Xiangrui Meng
We don't handle missing value imputation in the current version of MLlib. In future releases, we can store feature information in the dataset metadata, which may store the default value to replace missing values. But no one is committed to work on this feature. For now, you can filter out examples

Re: MLlib 1.2 New Interesting Features

2014-09-29 Thread Xiangrui Meng
Hi Krishna, Some planned features for MLlib 1.2 can be found via Spark JIRA: http://bit.ly/1ywotkm , though this list is not fixed. The feature freeze will happen by the end of Oct. Then we will cut branch-1.2 and start QA. I don't recommend using branch-1.2 for hands-on tutorial around Oct 29th

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Xiangrui Meng
The test accuracy doesn't mean the total loss. All points between (-1, 1) can separate points -1 and +1 and give you 1.0 accuracy, but their coressponding loss are different. -Xiangrui On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used LogisticRegression

Re: Build error when using spark with breeze

2014-09-26 Thread Xiangrui Meng
We removed commons-math3 from dependencies to avoid version conflict with hadoop-common. hadoop-common-2.3+ depends on commons-math3-3.1.1, while breeze depends on commons-math3-3.3. 3.3 is not backward compatible with 3.1.1. So we removed it because the breeze functions we use do not touch

Re: java.lang.OutOfMemoryError while running SVD MLLib example

2014-09-25 Thread Xiangrui Meng
7000x7000 is not tall-and-skinny matrix. Storing the dense matrix requires 784MB. The driver needs more storage for collecting result from executors as well as making a copy for LAPACK's dgesvd. So you need more memory. Do you need the full SVD? If not, try to use a small k, e.g, 50. -Xiangrui On

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-25 Thread Xiangrui Meng
For the vectorizer, what's the output feature dimension and are you creating sparse vectors or dense vectors? The model on the driver consists of numClasses * numFeatures doubles. However, the driver needs more memory in order to receive the task result (of the same size) from executors. So you

Re: K-means faster on Mahout then on Spark

2014-09-25 Thread Xiangrui Meng
Please also check the load balance of the RDD on YARN. How many partitions are you using? Does it match the number of CPU cores? -Xiangrui On Thu, Sep 25, 2014 at 12:28 PM, bhusted brian.hus...@gmail.com wrote: What is the size of your vector mine is set to 20? I am seeing slow results as well

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread Xiangrui Meng
You dataset is small. NaiveBayes should work under the default settings, even in local mode. Could you try local mode first without changing any Spark settings? Since your dataset is small, could you save the vectorized data (RDD[LabeledPoint]) and send me a sample? I want to take a look at the

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-22 Thread Xiangrui Meng
Does feature size 43839 equal to the number of terms? Check the output dimension of your feature vectorizer and reduce number of partitions to match the number of physical cores. I saw you set spark.storage.memoryFaction to 0.0. Maybe it is better to keep the default. Also please confirm the

Re: MLLib regression model weights

2014-09-18 Thread Xiangrui Meng
The importance should be based on some statistics, for example, the standard deviation of the feature column and the magnitude of the weight. If the columns are scaled to unit standard deviation (using StandardScaler), you can tell the importance by the absolute value of the weight. But there are

Re: SVD on larger than taller matrix

2014-09-18 Thread Xiangrui Meng
Did you cache `features`? Without caching it is slow because we need O(k) iterations. The storage requirement on the driver is about 2 * n * k = 2 * 3 million * 200 ~= 9GB, not considering any overhead. Computing U is also an expensive task in your case. We should use some randomized SVD

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Xiangrui Meng
We don't support kernels because it doesn't scale well. Please check When to use LIBLINEAR but not LIBSVM on http://www.csie.ntu.edu.tw/~cjlin/liblinear/index.html . I like Jey's suggestion on expanding features. -Xiangrui On Thu, Sep 18, 2014 at 12:29 PM, Jey Kottalam j...@cs.berkeley.edu wrote:

Re: Unable to load app logs for MLLib programs in history server

2014-09-18 Thread Xiangrui Meng
Could you create a JIRA for it? We can either remove special characters or encode with alphanumerics. -Xiangrui On Thu, Sep 18, 2014 at 3:50 PM, SK skrishna...@gmail.com wrote: Hi, The default log files for the Mllib examples use a rather long naming convention that includes special

Re: Efficient way to sum multiple columns

2014-09-15 Thread Xiangrui Meng
Please check the colStats method defined under mllib.stat.Statistics. -Xiangrui On Mon, Sep 15, 2014 at 1:00 PM, jamborta jambo...@gmail.com wrote: Hi all, I have an RDD that contains around 50 columns. I need to sum each column, which I am doing by running it through a for loop, creating an

Re: Define the name of the outputs with Java-Spark.

2014-09-15 Thread Xiangrui Meng
Spark doesn't support MultipleOutput at this time. You can cache the parent RDD. Then create RDDs from it and save them separately. -Xiangrui On Fri, Sep 12, 2014 at 7:45 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I would like to define the names of my output in Spark, I have a process

Re: Accuracy hit in classification with Spark

2014-09-15 Thread Xiangrui Meng
Thanks for the update! -Xiangrui On Sun, Sep 14, 2014 at 11:33 PM, jatinpreet jatinpr...@gmail.com wrote: Hi, I have been able to get the same accuracy with MLlib as Mahout's. The pre-processing phase of Mahout was the reason behind the accuracy mismatch. After studying and applying the

Re: MLLib sparse vector

2014-09-15 Thread Xiangrui Meng
Or you can use the factory method `Vectors.sparse`: val sv = Vectors.sparse(numProducts, productIds.map(x = (x, 1.0))) where numProducts should be the largest product id plus one. Best, Xiangrui On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore cdg...@cdgore.com wrote: Hi Sameer, MLLib uses

Re: sc.textFile problem due to newlines within a CSV record

2014-09-12 Thread Xiangrui Meng
I wrote an input format for Redshift's tables unloaded UNLOAD the ESCAPE option: https://github.com/mengxr/redshift-input-format , which can recognize multi-line records. Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and the delimiter character. You can apply the same escaping

Re: Accuracy hit in classification with Spark

2014-09-09 Thread Xiangrui Meng
If you are using the Mahout's Multinomial Naive Bayes, it should be the same as MLlib's. I tried MLlib with news20.scale downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html and the test accuracy is 82.4%. -Xiangrui On Tue, Sep 9, 2014 at 4:58 AM, jatinpreet

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Xiangrui Meng
You can try LinearRegression with sparse input. It converges the least squares solution if the linear system is over-determined, while the convergence rate depends on the condition number. Applying standard scaling is popular heuristic to reduce the condition number. If you are interested in

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Xiangrui Meng
There is an undocumented configuration to put users jars in front of spark jar. But I'm not very certain that it works as expected (and this is why it is undocumented). Please try turning on spark.yarn.user.classpath.first . -Xiangrui On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Xiangrui Meng
When you submit the job to yarn with spark-submit, set --conf spark.yarn.user.classpath.first=true . On Mon, Sep 8, 2014 at 10:46 AM, Penny Espinoza pesp...@societyconsulting.com wrote: I don't understand what you mean. Can you be more specific? From: Victor

Re: A problem for running MLLIB in amazon clound

2014-09-08 Thread Xiangrui Meng
Could you attach the driver log? -Xiangrui On Mon, Sep 8, 2014 at 7:23 AM, Hui Li hli161...@gmail.com wrote: I am running a very simple example using the SVMWithSGD on Amazon EMR. I haven't got any result after one hour long. My instance-type is: m3.large instance-count is: 3 Dataset

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Xiangrui Meng
to partition the problem cleanly and don't need consensus step (which is what I implemented in the code) Thanks Deb On Sep 7, 2014 11:35 PM, Xiangrui Meng men...@gmail.com wrote: You can try LinearRegression with sparse input. It converges the least squares solution if the linear system is over

Re: MLLib decision tree: Weights

2014-09-03 Thread Xiangrui Meng
This is not supported in MLlib. Hopefully, we will add support for weighted examples in v1.2. If you want to train weighted instances with the current tree implementation, please try importance sampling first to adjust the weights. For instance, an example with weight 0.3 is sampled with

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-03 Thread Xiangrui Meng
We have a pending PR (https://github.com/apache/spark/pull/216) for discretization but it has performance issues. We will try to spend more time to improve it. -Xiangrui On Tue, Sep 2, 2014 at 2:56 AM, filipus floe...@gmail.com wrote: i guess i found it

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-03 Thread Xiangrui Meng
I think they are the same. If you have hub (https://hub.github.com/) installed, you can run hub checkout https://github.com/apache/spark/pull/216 and then `sbt/sbt assembly` -Xiangrui On Wed, Sep 3, 2014 at 12:03 AM, filipus floe...@gmail.com wrote: howto install? just clone by git clone

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Xiangrui Meng
There is a sliding method implemented in MLlib (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala), which is used in computing Area Under Curve:

<    1   2   3   4   5   >