Re: word2vec more distributed

2015-02-09 Thread Xiangrui Meng
The C implementation of Word2Vec updates the model using multi-threads without locking. It is hard to implement it in a distributed way. In the MLlib implementation, each work holds the entire model in memory and output the part of model that gets updated. The driver still need to collect and

Re: calculating the mean of SparseVector RDD

2015-01-07 Thread Xiangrui Meng
There is some serialization overhead. You can try https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 . -Xiangrui On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote: I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector.

Re: Discrepancy in PCA values

2015-01-08 Thread Xiangrui Meng
The Julia code is computing the SVD of the Gram matrix. PCA should be applied to the covariance matrix. -Xiangrui On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com wrote: Hi All, I tried to do PCA for the Iris dataset [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Xiangrui Meng
Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS performance should be improved in 1.3.0. -Xiangrui On Fri, Feb 20, 2015 at 1:32 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi Ilya, thanks for your insight, this was the right clue. I had default parallelism already

Re: high GC in the Kmeans algorithm

2015-02-20 Thread Xiangrui Meng
huge? On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng men...@gmail.com wrote: Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using

Re: Scaling problem in RandomForest?

2015-03-16 Thread Xiangrui Meng
Try increasing the driver memory. We store trees on the driver node. If maxDepth=20 and numTrees=50, you may need a large driver memory to store all tree models. You might want to start with a smaller maxDepth and then increase it and see whether deep trees really help (vs. the cost). -Xiangrui

Re: Top rows per group

2015-03-16 Thread Xiangrui Meng
https://issues.apache.org/jira/browse/SPARK-5954 is for this issue and Shuo is working on it. We will first implement topByKey for RDD and them we could add it to DataFrames. -Xiangrui On Mon, Mar 9, 2015 at 9:43 PM, Moss rhoud...@gmail.com wrote: I do have a schemaRDD where I want to group by

Re: High GC time

2015-03-17 Thread Xiangrui Meng
The official guide may help: http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning -Xiangrui On Tue, Mar 17, 2015 at 8:27 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I am getting very high GC time in my jobs. For smaller/real-time load, this becomes a real problem.

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-17 Thread Xiangrui Meng
! Thanks, Jay On Mar 16, 2015, at 11:35 AM, Xiangrui Meng men...@gmail.com wrote: Try this: val ratings = purchase.map { line = line.split(',') match { case Array(user, item, rate) = (user.toInt, item.toInt, rate.toFloat) }.toDF(user, item, rate) Doc for DataFrames: http

Re: Garbage stats in Random Forest leaf node?

2015-03-17 Thread Xiangrui Meng
This is the default value (Double.MinValue) for invalid gain: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67 Please ignore it. Maybe we should update `toString` to use scientific notation. -Xiangrui On Mon, Mar

Re: IllegalAccessError in GraphX (Spark 1.3.0 LDA)

2015-03-17 Thread Xiangrui Meng
Please check your classpath and make sure you don't have multiple Spark versions deployed. If the classpath looks correct, please create a JIRA for this issue. Thanks! -Xiangrui On Tue, Mar 17, 2015 at 2:03 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi all, I'm trying to use the new LDA

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-17 Thread Xiangrui Meng
that I needed to bug you :) Jay On Mar 17, 2015, at 11:48 AM, Xiangrui Meng men...@gmail.com wrote: Please check this section in the user guide: http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection You need `import sqlContext.implicits

Re: Garbage stats in Random Forest leaf node?

2015-03-17 Thread Xiangrui Meng
17, 2015, at 11:53 AM, Xiangrui Meng men...@gmail.com wrote: This is the default value (Double.MinValue) for invalid gain: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67 Please ignore it. Maybe we should update

Re: MLlib Spam example gets stuck in Stage X

2015-03-20 Thread Xiangrui Meng
Su, which Spark version did you use? -Xiangrui On Thu, Mar 19, 2015 at 3:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote: To get these metrics out, you need to open the driver ui running on port 4040. And in there you will see Stages information and for each stage you can see how much time

Re: Any way to find out feature importance in Spark SVM?

2015-03-16 Thread Xiangrui Meng
You can compute the standard deviations of the training data using Statistics.colStats and then compare them with model coefficients to compute feature importance. -Xiangrui On Fri, Mar 13, 2015 at 11:35 AM, Natalia Connolly natalia.v.conno...@gmail.com wrote: Hello, While running an

Re: Logistic Regression displays ERRORs

2015-03-16 Thread Xiangrui Meng
Actually, they should be INFO or DEBUG. Line search steps are expected. You can configure log4j.properties to ignore those. A better solution would be reporting this at https://github.com/scalanlp/breeze/issues -Xiangrui On Thu, Mar 12, 2015 at 5:46 PM, cjwang c...@cjwang.us wrote: I am running

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-16 Thread Xiangrui Meng
Try this: val ratings = purchase.map { line = line.split(',') match { case Array(user, item, rate) = (user.toInt, item.toInt, rate.toFloat) }.toDF(user, item, rate) Doc for DataFrames: http://spark.apache.org/docs/latest/sql-programming-guide.html -Xiangrui On Mon, Mar 16, 2015 at 9:08 AM,

Re: MLlib/kmeans newbie question(s)

2015-03-09 Thread Xiangrui Meng
You need to change `== 1` to `== i`. `println(t)` happens on the workers, which may not be what you want. Try the following: noSets.filter(t = model.predict(Utils.featurize(t)) == i).collect().foreach(println) -Xiangrui On Sat, Mar 7, 2015 at 3:20 PM, Pierce Lamb richard.pierce.l...@gmail.com

Re: Can't cache RDD of collaborative filtering on MLlib

2015-03-09 Thread Xiangrui Meng
cache() is lazy. The data is stored into memory after the first time it gets materialized. So the first time you call `predict` after you load the model back from HDFS, it still takes time to load the actual data. The second time will be much faster. Or you can call `userJavaRDD.count()` and

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote: Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-30 Thread Xiangrui Meng
to get a similar result in 1.3. Sean and Shuo, which approach do you prefer? Do you know any existing work discussing this? Best, Xiangrui On Fri, Mar 27, 2015 at 11:27 AM, Xiangrui Meng men...@gmail.com wrote: This sounds like a bug ... Did you try a different lambda? It would be great if you

Re: kmeans|| in Spark is not real paralleled?

2015-03-30 Thread Xiangrui Meng
This PR updated the k-means|| initialization: https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d, which was included in 1.3.0. It should fix kmean|| initialization with large k. Please create a JIRA for this issue and send me the code and the dataset to produce this

Re: k-means can only run on one executor with one thread?

2015-03-30 Thread Xiangrui Meng
Hey Xi, Have you tried Spark 1.3.0? The initialization happens on the driver node and we fixed an issue with the initialization in 1.3.0. Again, please start with a smaller k, and increase it gradually, Let us know at what k the problem happens. Best, Xiangrui On Sat, Mar 28, 2015 at 3:11 AM,

Re: Setting a custom loss function for GradientDescent

2015-03-30 Thread Xiangrui Meng
You can extend Gradient, e.g., https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala#L266, and use it in GradientDescent:

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
has vectors of 200 dimensions. It is possible people never tested large dimension case. Thanks, David On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng men...@gmail.com wrote: Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Xiangrui Meng
as the input grew anyway. So, basically I don't know anything more than you do, sorry! On Tue, Mar 31, 2015 at 10:41 PM, Xiangrui Meng men...@gmail.com wrote: Hey Sean, That is true for explicit model, but not for implicit. The ALS-WR paper doesn't cover the implicit model. In implicit formulation

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread Xiangrui Meng
to solve my problem? Sendong Li 在 2015年3月31日,上午12:11,Xiangrui Meng men...@gmail.com 写道: setCheckpointInterval was added in the current master and branch-1.3. Please help check whether it works. It will be included in the 1.3.1 and 1.4.0 release. -Xiangrui On Mon, Mar 30, 2015 at 7:27 AM

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-03-31 Thread Xiangrui Meng
I cannot reproduce this error on master, but I'm not aware of any recent bug fixes that are related. Could you build and try the current master? -Xiangrui On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, DataFrame with an user defined type (here mllib.Vector)

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Xiangrui Meng
it comes to invariance. But FWIW I had always understood the regularization to be multiplied by the number of explicit ratings. On Mon, Mar 30, 2015 at 5:51 PM, Xiangrui Meng men...@gmail.com wrote: Okay, I didn't realize that I changed the behavior of lambda in 1.3. to make it scale

Re: gc time too long when using mllib als

2015-03-03 Thread Xiangrui Meng
Also try 1.3.0-RC1 or the current master. ALS should performance much better in 1.3. -Xiangrui On Tue, Mar 3, 2015 at 1:00 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to increase the parallelism/repartition the data to a higher number to get ride of those. Thanks Best Regards

Re: how to save Word2VecModel

2015-03-04 Thread Xiangrui Meng
+user On Wed, Mar 4, 2015, 8:21 AM Xiangrui Meng men...@gmail.com wrote: You can use the save/load implementation in naive Bayes as reference: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala Ping me on the JIRA page

Re: UnsatisfiedLinkError related to libgfortran when running MLLIB code on RHEL 5.8

2015-03-03 Thread Xiangrui Meng
libgfortran.x86_64 4.1.2-52.el5_8.1 comes with libgfortran.so.1 but not libgfortran.so.3. JBLAS requires the latter. If you have root access, you can try to install a newer version of libgfortran. Otherwise, maybe you can try Spark 1.3, which doesn't use JBLAS in ALS. -Xiangrui On Tue, Mar 3,

Re: Efficient way of scoring all items and users in an ALS model

2015-02-23 Thread Xiangrui Meng
You can use rdd.cartesian then find top-k by key to distribute the work to executors. There is a trick to boost the performance: you need to blockify user/product features and then use native matrix-matrix multiplication. There is a relevant PR from Deb: https://github.com/apache/spark/pull/3098 .

Re: shuffle data taking immense disk space during ALS

2015-02-23 Thread Xiangrui Meng
Did you try to use less number of partitions (user/product blocks)? Did you use implicit feedback? In the current implementation, we only do checkpointing with implicit feedback. We should adopt the checkpoint strategy implemented in LDA:

Re: Movie Recommendation tutorial

2015-02-23 Thread Xiangrui Meng
Which Spark version did you use? Btw, there are three datasets from MovieLens. The tutorial used the medium one (1 million). -Xiangrui On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez guilla...@databerries.com wrote: What do you mean? -- View this message in context:

Re: Need some help to create user defined type for ML pipeline

2015-02-23 Thread Xiangrui Meng
Yes, we are going to expose the developer API. There was a long discussion in the PR: https://github.com/apache/spark/pull/3637. So we marked them package private and look for feedback on how to improve it. Please implement your classes under `spark.ml` for now and let us know your feedback.

Re: Help vote for Spark talks at the Hadoop Summit

2015-02-25 Thread Xiangrui Meng
Made 3 votes to each of the talks. Looking forward to see them in Hadoop Summit:) -Xiangrui On Tue, Feb 24, 2015 at 9:54 PM, Reynold Xin r...@databricks.com wrote: Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could

Re: Movie Recommendation tutorial

2015-02-23 Thread Xiangrui Meng
= 1.0, and numIter = 20) 1.1.1 - RSME = 1.335831 (rank = 8 and lambda = 1.0, and numIter = 10) Cheers k/ On Mon, Feb 23, 2015 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Which Spark version did you use? Btw, there are three datasets from MovieLens. The tutorial used the medium one (1

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-23 Thread Xiangrui Meng
FYI, in 1.3 we support save/load tree models in Scala and Java. We will add save/load support to Python soon. -Xiangrui On Mon, Feb 23, 2015 at 2:57 PM, Sebastián Ramírez sebastian.rami...@senseta.com wrote: In your log it says: pickle.PicklingError: Can't pickle type 'thread.lock': it's not

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread Xiangrui Meng
Lisen, did you use all m-by-n pairs during training? Implicit model penalizes unobserved ratings, while explicit model doesn't. -Xiangrui On Feb 26, 2015 6:26 AM, Sean Owen so...@cloudera.com wrote: +user On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen so...@cloudera.com wrote: I think I may

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Xiangrui Meng
If you make `Image` a case class, then select(image.data) should work. On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I have a DataFrame that contains a user defined type. The type is an image with the following attribute class Image(w: Int, h: Int,

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Xiangrui Meng
Btw, the correct syntax for alias should be `df.select($image.data.as(features))`. On Tue, Feb 24, 2015 at 3:35 PM, Xiangrui Meng men...@gmail.com wrote: If you make `Image` a case class, then select(image.data) should work. On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com

Re: How to augment data to existing MatrixFactorizationModel?

2015-02-26 Thread Xiangrui Meng
It may take some work to do online updates with an MatrixFactorizationModel because you need to update some rows of the user/item factors. You may be interested in spark-indexedrdd (http://spark-packages.org/package/amplab/spark-indexedrdd). We support save/load in Scala/Java. We are going to add

Re: Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread Xiangrui Meng
Try the following: df.map { case Row(id: Int, num: Int, value: Double, x: Float) = // replace those with your types (id, Vectors.dense(num, value, x)) }.toDF(id, features) -Xiangrui On Thu, Feb 26, 2015 at 3:08 PM, mobsniuk mobsn...@gmail.com wrote: I've been searching around and see others

Re: Reg. KNN on MLlib

2015-02-26 Thread Xiangrui Meng
It is not in MLlib. There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-2336 and Ashutosh has an implementation for integer values. -Xiangrui On Thu, Feb 26, 2015 at 8:18 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Has KNN classification algorithm been implemented on MLlib?

Re: Using ORC input for mllib algorithms

2015-03-27 Thread Xiangrui Meng
This is a PR in review to support ORC via the SQL data source API: https://github.com/apache/spark/pull/3753. You can try pulling that PR and help test it. -Xiangrui On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I use sc.hadoopFile(directory,

Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Xiangrui Meng
Hi Martin, Could you attach the code snippet and the stack trace? The default implementation of some methods uses reflection, which may be the cause. Best, Xiangrui On Wed, Mar 25, 2015 at 3:18 PM, zapletal-mar...@email.cz wrote: Thanks Peter, I ended up doing something similar. I however

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-27 Thread Xiangrui Meng
This sounds like a bug ... Did you try a different lambda? It would be great if you can share your dataset or re-produce this issue on the public dataset. Thanks! -Xiangrui On Thu, Mar 26, 2015 at 7:56 AM, Ravi Mody rmody...@gmail.com wrote: After upgrading to 1.3.0, ALS.trainImplicit() has been

Re: Training Random Forest

2015-03-05 Thread Xiangrui Meng
We don't support warm starts or online updates for decision trees. So if you call train twice, only the second dataset is used for training. -Xiangrui On Thu, Mar 5, 2015 at 12:31 PM, drarse drarse.a...@gmail.com wrote: I am testing the Random Forest in Spark, but I have a question... If I train

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-04-01 Thread Xiangrui Meng
Ravi, we just merged https://issues.apache.org/jira/browse/SPARK-6642 and used the same lambda scaling as in 1.2. The change will be included in Spark 1.3.1, which will be released soon. Thanks for reporting this issue! -Xiangrui On Tue, Mar 31, 2015 at 8:53 PM, Xiangrui Meng men...@gmail.com

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Xiangrui Meng
I think before 1.3 you also get stackoverflow problem in ~35 iterations. In 1.3.x, please use setCheckpointInterval to solve this problem, which is available in the current master and 1.3.1 (to be released soon). Btw, do you find 80 iterations are needed for convergence? -Xiangrui On Wed, Apr 1,

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-02 Thread Xiangrui Meng
:18 PM, Xiangrui Meng men...@gmail.com wrote: I cannot reproduce this error on master, but I'm not aware of any recent bug fixes that are related. Could you build and try the current master? -Xiangrui On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all

Re: feature scaling in GeneralizedLinearAlgorithm.scala

2015-04-13 Thread Xiangrui Meng
Correct. Prediction doesn't touch that code path. -Xiangrui On Mon, Apr 13, 2015 at 9:58 AM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it says if userFeatureScaling is enabled, we will standardize the training

Re: Streaming Linear Regression problem

2015-04-20 Thread Xiangrui Meng
Did you keep adding new files under the `train/` folder? What was the exact warn message? -Xiangrui On Fri, Apr 17, 2015 at 4:56 AM, barisak baris.akg...@gmail.com wrote: Hi, I write this code for just train the Stream Linear Regression, but I took no data found warn, so no weights were not

Re: ChiSquared Test from user response flat files to RDD[Vector]?

2015-04-20 Thread Xiangrui Meng
You can find the user guide for vector creation here: http://spark.apache.org/docs/latest/mllib-data-types.html#local-vector. -Xiangrui On Mon, Apr 20, 2015 at 2:32 PM, Dan DeCapria, CivicScience dan.decap...@civicscience.com wrote: Hi Spark community, I'm very new to the Apache Spark

Re: MLlib - Naive Bayes Problem

2015-04-20 Thread Xiangrui Meng
Could you attach the full stack trace? Please also include the stack trace from executors, which you can find on the Spark WebUI. -Xiangrui On Thu, Apr 16, 2015 at 1:00 PM, riginos samarasrigi...@gmail.com wrote: I have a big dataset of categories of cars and descriptions of cars. So i want to

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-04-20 Thread Xiangrui Meng
You should check where MyDenseVectorUDT is defined and whether it was on the classpath (or in the assembly jar) at runtime. Make sure the full class name (with package name) is used. Btw, UDTs are not public yet, so please use it with caution. -Xiangrui On Fri, Apr 17, 2015 at 12:45 AM, Jaonary

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Xiangrui Meng
SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in 1.3. We should allow DataFrames in ALS.train. I will submit a patch. You can use `ALS.train(training.rdd, ...)` for now as a workaround. -Xiangrui On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley jos...@databricks.com wrote:

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-22 Thread Xiangrui Meng
This is the size of the serialized task closure. Is stage 246 part of ALS iterations, or something before or after it? -Xiangrui On Tue, Apr 21, 2015 at 10:36 AM, Christian S. Perone christian.per...@gmail.com wrote: Hi Sean, thanks for the answer. I tried to call repartition() on the input

Re: setting cost in linear SVM [Python]

2015-04-23 Thread Xiangrui Meng
If by C you mean the parameter C in LIBLINEAR, the corresponding parameter in MLlib is regParam: https://github.com/apache/spark/blob/master/python/pyspark/mllib/classification.py#L273, while regParam = 1/C. -Xiangrui On Wed, Apr 22, 2015 at 3:25 PM, Pagliari, Roberto rpagli...@appcomsci.com

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-22 Thread Xiangrui Meng
The patched was merged and it will be included in 1.3.2 and 1.4.0. Thanks for reporting the bug! -Xiangrui On Tue, Apr 21, 2015 at 2:51 PM, ayan guha guha.a...@gmail.com wrote: Thank you all. On 22 Apr 2015 04:29, Xiangrui Meng men...@gmail.com wrote: SchemaRDD subclasses RDD in 1.2

Re: StackOverflow Error when run ALS with 100 iterations

2015-04-23 Thread Xiangrui Meng
ALS.setCheckpointInterval was added in Spark 1.3.1. You need to upgrade Spark to use this feature. -Xiangrui On Wed, Apr 22, 2015 at 9:03 PM, amghost zhengweita...@outlook.com wrote: Hi, would you please how to checkpoint the training set rdd since all things are done in ALS.train method.

Re: Problem with using Spark ML

2015-04-22 Thread Xiangrui Meng
Please try reducing the step size. The native BLAS library is not required. -Xiangrui On Tue, Apr 21, 2015 at 5:15 AM, Staffan staffan.arvids...@gmail.com wrote: Hi, I've written an application that performs some machine learning on some data. I've validated that the data _should_ give a good

Re: [MLlib] fail to run word2vec

2015-04-22 Thread Xiangrui Meng
We store the vectors on the driver node. So it is hard to handle a really large vocabulary. You can use setMinCount to filter out infrequent word to reduce the model size. -Xiangrui On Wed, Apr 22, 2015 at 12:32 AM, gm yu husty...@gmail.com wrote: When use Mllib.Word2Vec, I meet the following

Re: the indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Xiangrui Meng
Having ordered indices is a contract of SparseVector: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector. We do not verify it for performance. -Xiangrui On Wed, Apr 22, 2015 at 8:26 AM, yaochunnan yaochun...@gmail.com wrote: Hi all, I am using

Re: LDA code little error @Xiangrui Meng

2015-04-22 Thread Xiangrui Meng
to see there is error -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LDA-code-little-error-Xiangrui-Meng-tp22621.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: StandardScaler failing with OOM errors in PySpark

2015-04-22 Thread Xiangrui Meng
What is the feature dimension? Did you set the driver memory? -Xiangrui On Tue, Apr 21, 2015 at 6:59 AM, rok rokros...@gmail.com wrote: I'm trying to use the StandardScaler in pyspark on a relatively small (a few hundred Mb) dataset of sparse vectors with 800k features. The fit method of

Re: Getting error running MLlib example with new cluster

2015-04-27 Thread Xiangrui Meng
How did you run the example app? Did you use spark-submit? -Xiangrui On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote: Sorry, accidentally sent the last email before finishing. I had asked this question before, but wanted to ask again as I think it is now related to my pom

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-27 Thread Xiangrui Meng
) org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527) org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203) On Thu, Apr 23, 2015 at 2:49 AM, Xiangrui Meng men...@gmail.com wrote: This is the size of the serialized task closure. Is stage 246 part of ALS iterations, or something

Re: StandardScaler failing with OOM errors in PySpark

2015-04-27 Thread Xiangrui Meng
-- I have to run in client mode so I can't set spark.driver.memory -- I've tried setting the spark.yarn.am.memory and overhead parameters but it doesn't seem to have an effect. Thanks, Rok On Apr 23, 2015, at 7:47 AM, Xiangrui Meng men...@gmail.com wrote: What is the feature dimension

Re: gridsearch - python

2015-04-27 Thread Xiangrui Meng
We will try to make them available in 1.4, which is coming soon. -Xiangrui On Thu, Apr 23, 2015 at 10:18 PM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I know grid search with cross validation is not supported. However, I was wondering if there is something availalable for the time being.

Re: OOM error with GMMs on 4GB dataset

2015-05-06 Thread Xiangrui Meng
Did you set `--driver-memory` with spark-submit? -Xiangrui On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni vmuttin...@ebay.com wrote: Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). The spark (1.3.1) job is allocated 120 executors with 6GB each and the driver also

Re: MLLib SVMWithSGD is failing for large dataset

2015-05-18 Thread Xiangrui Meng
Reducing the number of instances won't help in this case. We use the driver to collect partial gradients. Even with tree aggregation, it still puts heavy workload on the driver with 20M features. Please try to reduce the number of partitions before training. We are working on a more scalable

Re: StandardScaler failing with OOM errors in PySpark

2015-05-18 Thread Xiangrui Meng
in how the JVM is created. No matter which memory settings I specify, the JVM for the driver is always made with 512Mb of memory. So I'm not sure if this is a feature or a bug? rok On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng men...@gmail.com wrote: You might need to specify driver memory

Re: bug: numClasses is not a valid argument of LogisticRegressionWithSGD

2015-05-18 Thread Xiangrui Meng
LogisticRegressionWithSGD doesn't support multi-class. Please use LogisticRegressionWithLBFGS instead. -Xiangrui On Mon, Apr 27, 2015 at 12:37 PM, Pagliari, Roberto rpagli...@appcomsci.com wrote: With the Python APIs, the available arguments I got (using inspect module) are the following:

Re: org.apache.spark.ml.recommendation.ALS

2015-04-14 Thread Xiangrui Meng
? Thanks, Jay On Apr 9, 2015, at 4:38 PM, Xiangrui Meng men...@gmail.com wrote: Could you share ALSNew.scala? Which Scala version did you use? -Xiangrui On Wed, Apr 8, 2015 at 4:09 PM, Jay Katukuri jkatuk...@apple.com wrote: Hi Xiangrui, I tried running this on my local machine

Re: multinomial and Bernoulli model in NaiveBayes

2015-04-15 Thread Xiangrui Meng
CC Leah, who added Bernoulli option to MLlib's NaiveBayes. -Xiangrui On Wed, Apr 15, 2015 at 4:49 AM, 姜林和 linhe_ji...@163.com wrote: Dear meng: Thanks for the great work for park machine learning, and I saw the changes for NaiveBayes algorithm , separate the algorithm to : multinomial

Re: spark ml model info

2015-04-14 Thread Xiangrui Meng
If you are using Scala/Java or pyspark.mllib.classification.LogisticRegressionModel, you should be able to call weights and intercept to get the model coefficients. If you are using the pipeline API in Python, you can try model._java_model.weights(), we are going to add a method to get the weights

Re: Help understanding the FP-Growth algrithm

2015-04-14 Thread Xiangrui Meng
If you want to see an example that calls MLlib's FPGrowth, you can find them under the examples/ folder: Scala: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/FPGrowthExample.scala, Java:

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Xiangrui Meng
tried passing the spark-sql jar using the -jar spark-sql_2.11-1.3.0.jar Thanks, Jay On Mar 17, 2015, at 12:50 PM, Xiangrui Meng men...@gmail.com wrote: Please remember to copy the user list next time. I might not be able to respond quickly. There are many others who can help or who can

Re: java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint

2015-04-06 Thread Xiangrui Meng
Did you try to treat RDD[(Double, Vector)] as RDD[LabeledPoint]? If that is the case, you need to cast them explicitly: rdd.map { case (label, features) = LabeledPoint(label, features) } -Xiangrui On Mon, Apr 6, 2015 at 11:59 AM, Joanne Contact joannenetw...@gmail.com wrote: Hello Sparkers,

Re: DataFrame -- help with encoding factor variables

2015-04-06 Thread Xiangrui Meng
Before OneHotEncoder or LabelIndexer is merged, you can define an UDF to do the mapping. val labelToIndex = udf { ... } featureDF.withColumn(f3_dummy, labelToIndex(col(f3))) See instructions here

Re: How to work with sparse data in Python?

2015-04-06 Thread Xiangrui Meng
We support sparse vectors in MLlib, which recognizes MLlib's sparse vector and SciPy's csc_matrix with a single column. You can create RDD of sparse vectors for your data and save/load them to/from parquet format using dataframes. Sparse matrix supported will be added in 1.4. -Xiangrui On Mon,

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Xiangrui Meng
) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks, Jay On Apr 6, 2015, at 12:24 PM, Xiangrui Meng men...@gmail.com wrote: Please attach the full stack trace. -Xiangrui On Mon, Apr 6, 2015 at 12:06 PM, Jay

Re: MLlib: save models to HDFS?

2015-04-03 Thread Xiangrui Meng
In 1.3, you can use model.save(sc, hdfs path). You can check the code examples here: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples. -Xiangrui On Fri, Apr 3, 2015 at 2:17 PM, Justin Yip yipjus...@prediction.io wrote: Hello Zhou, You can look at the

Re: Need help with ALS Recommendation code

2015-04-05 Thread Xiangrui Meng
Could you try `sbt package` or `sbt compile` and see whether there are errors? It seems that you haven't reached the ALS code yet. -Xiangrui On Sat, Apr 4, 2015 at 5:06 AM, Phani Yadavilli -X (pyadavil) pyada...@cisco.com wrote: Hi , I am trying to run the following command in the Movie

Re: Add row IDs column to data frame

2015-04-05 Thread Xiangrui Meng
Sorry, it should be toDF(text, id). On Sun, Apr 5, 2015 at 9:21 PM, Xiangrui Meng men...@gmail.com wrote: Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote: What would be the most efficient neat method

Re: Add row IDs column to data frame

2015-04-05 Thread Xiangrui Meng
Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote: What would be the most efficient neat method to add a column with row ids to dataframe? I can think of something as below, but it completes with errors (at

Re: ML consumption time based on data volume - same cluster

2015-04-07 Thread Xiangrui Meng
This could be empirically verified in spark-perf: https://github.com/databricks/spark-perf. Theoretically, it would be 2x for k-means and logistic regression, because computation is doubled but communication cost remains the same. -Xiangrui On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv

Re: How to implement an Evaluator for a ML pipeline?

2015-05-19 Thread Xiangrui Meng
The documentation needs to be updated to state that higher metric values are better (https://issues.apache.org/jira/browse/SPARK-7740). I don't know why if you negate the return value of the Evaluator you still get the highest regularization parameter candidate. Maybe you should check the log

Re: Word2Vec with billion-word corpora

2015-05-19 Thread Xiangrui Meng
With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to store the model. That is 64GB. We store the model on the driver node in the current implementation. So I don't think it would work. You might try increasing the minCount to decrease the vocabulary size and reduce the

Re: Discretization

2015-05-19 Thread Xiangrui Meng
Thanks for asking! We should improve the documentation. The sample dataset is actually mimicking the MNIST digits dataset, where the values are gray levels (0-255). So by dividing by 16, we want to map it to 16 coarse bins for the gray levels. Actually, there is a bug in the doc, we should convert

Re: Stratified sampling with DataFrames

2015-05-19 Thread Xiangrui Meng
You need to convert DataFrame to RDD, call sampleByKey, and then apply the schema back to create DataFrame. val df: DataFrame = ... val schema = df.schema val sampledRDD = df.rdd.keyBy(r = r.getAs[Int](0)).sampleByKey(...).values val sampled = sqlContext.createDataFrame(sampledRDD, schema)

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-19 Thread Xiangrui Meng
: MyDenseVectorUDT do exist in the assembly jar and in this example all the code is in a single file to make sure every thing is included. On Tue, Apr 21, 2015 at 1:17 AM, Xiangrui Meng men...@gmail.com wrote: You should check where MyDenseVectorUDT is defined and whether it was on the classpath

Re: Increase maximum amount of columns for covariance matrix for principal components

2015-05-19 Thread Xiangrui Meng
We use a dense array to store the covariance matrix on the driver node. So its length is limited by the integer range, which is 65536 * 65536 (actually half). -Xiangrui On Wed, May 13, 2015 at 1:57 AM, Sebastian Alfers sebastian.alf...@googlemail.com wrote: Hello, in order to compute a huge

Re: k-means core function for temporal geo data

2015-05-19 Thread Xiangrui Meng
I'm not sure whether k-means would converge with this customized distance measure. You can list (weighted) time as a feature along with coordinates, and then use Euclidean distance. For other supported distance measures, you can check Derrick's package:

Re: spark mllib kmeans

2015-05-19 Thread Xiangrui Meng
Just curious, what distance measure do you need? -Xiangrui On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: take a look at this https://github.com/derrickburns/generalized-kmeans-clustering Best, Jao On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko

Re: LogisticRegressionWithLBFGS with large feature set

2015-05-19 Thread Xiangrui Meng
For ML applications, the best setting to set the number of partitions to match the number of cores to reduce shuffle size. You have 3072 partitions but 128 executors, which causes the overhead. For the MultivariateOnlineSummarizer, we plan to add flags to specify what need to be computed to reduce

Re: Find KNN in Spark SQL

2015-05-19 Thread Xiangrui Meng
Spark SQL doesn't provide spatial features. Large-scale KNN is usually combined with locality-sensitive hashing (LSH). This Spark package may be helpful: http://spark-packages.org/package/mrsqueeze/spark-hash. -Xiangrui On Sat, May 9, 2015 at 9:25 PM, Dong Li lid...@lidong.net.cn wrote: Hello

Re: question about customize kmeans distance measure

2015-05-19 Thread Xiangrui Meng
MLlib only supports Euclidean distance for k-means. You can find Bregman divergence support in Derrick's package: http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Which distance measure do you want to use? -Xiangrui On Tue, May 12, 2015 at 7:23 PM, June

<    1   2   3   4   5   >