Re: Help understanding the FP-Growth algrithm

2015-04-14 Thread Xiangrui Meng
If you want to see an example that calls MLlib's FPGrowth, you can find them under the examples/ folder: Scala: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/FPGrowthExample.scala, Java:

Re: feature scaling in GeneralizedLinearAlgorithm.scala

2015-04-13 Thread Xiangrui Meng
Correct. Prediction doesn't touch that code path. -Xiangrui On Mon, Apr 13, 2015 at 9:58 AM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it says if userFeatureScaling is enabled, we will standardize the training

Re: ML consumption time based on data volume - same cluster

2015-04-07 Thread Xiangrui Meng
This could be empirically verified in spark-perf: https://github.com/databricks/spark-perf. Theoretically, it would be 2x for k-means and logistic regression, because computation is doubled but communication cost remains the same. -Xiangrui On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Xiangrui Meng
tried passing the spark-sql jar using the -jar spark-sql_2.11-1.3.0.jar Thanks, Jay On Mar 17, 2015, at 12:50 PM, Xiangrui Meng men...@gmail.com wrote: Please remember to copy the user list next time. I might not be able to respond quickly. There are many others who can help or who can

Re: java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint

2015-04-06 Thread Xiangrui Meng
Did you try to treat RDD[(Double, Vector)] as RDD[LabeledPoint]? If that is the case, you need to cast them explicitly: rdd.map { case (label, features) = LabeledPoint(label, features) } -Xiangrui On Mon, Apr 6, 2015 at 11:59 AM, Joanne Contact joannenetw...@gmail.com wrote: Hello Sparkers,

Re: DataFrame -- help with encoding factor variables

2015-04-06 Thread Xiangrui Meng
Before OneHotEncoder or LabelIndexer is merged, you can define an UDF to do the mapping. val labelToIndex = udf { ... } featureDF.withColumn(f3_dummy, labelToIndex(col(f3))) See instructions here

Re: How to work with sparse data in Python?

2015-04-06 Thread Xiangrui Meng
We support sparse vectors in MLlib, which recognizes MLlib's sparse vector and SciPy's csc_matrix with a single column. You can create RDD of sparse vectors for your data and save/load them to/from parquet format using dataframes. Sparse matrix supported will be added in 1.4. -Xiangrui On Mon,

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Xiangrui Meng
) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks, Jay On Apr 6, 2015, at 12:24 PM, Xiangrui Meng men...@gmail.com wrote: Please attach the full stack trace. -Xiangrui On Mon, Apr 6, 2015 at 12:06 PM, Jay

Re: Need help with ALS Recommendation code

2015-04-05 Thread Xiangrui Meng
Could you try `sbt package` or `sbt compile` and see whether there are errors? It seems that you haven't reached the ALS code yet. -Xiangrui On Sat, Apr 4, 2015 at 5:06 AM, Phani Yadavilli -X (pyadavil) pyada...@cisco.com wrote: Hi , I am trying to run the following command in the Movie

Re: Add row IDs column to data frame

2015-04-05 Thread Xiangrui Meng
Sorry, it should be toDF(text, id). On Sun, Apr 5, 2015 at 9:21 PM, Xiangrui Meng men...@gmail.com wrote: Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote: What would be the most efficient neat method

Re: Add row IDs column to data frame

2015-04-05 Thread Xiangrui Meng
Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote: What would be the most efficient neat method to add a column with row ids to dataframe? I can think of something as below, but it completes with errors (at

Re: MLlib: save models to HDFS?

2015-04-03 Thread Xiangrui Meng
In 1.3, you can use model.save(sc, hdfs path). You can check the code examples here: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples. -Xiangrui On Fri, Apr 3, 2015 at 2:17 PM, Justin Yip yipjus...@prediction.io wrote: Hello Zhou, You can look at the

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Xiangrui Meng
I think before 1.3 you also get stackoverflow problem in ~35 iterations. In 1.3.x, please use setCheckpointInterval to solve this problem, which is available in the current master and 1.3.1 (to be released soon). Btw, do you find 80 iterations are needed for convergence? -Xiangrui On Wed, Apr 1,

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-02 Thread Xiangrui Meng
:18 PM, Xiangrui Meng men...@gmail.com wrote: I cannot reproduce this error on master, but I'm not aware of any recent bug fixes that are related. Could you build and try the current master? -Xiangrui On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-04-01 Thread Xiangrui Meng
Ravi, we just merged https://issues.apache.org/jira/browse/SPARK-6642 and used the same lambda scaling as in 1.2. The change will be included in Spark 1.3.1, which will be released soon. Thanks for reporting this issue! -Xiangrui On Tue, Mar 31, 2015 at 8:53 PM, Xiangrui Meng men...@gmail.com

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Xiangrui Meng
as the input grew anyway. So, basically I don't know anything more than you do, sorry! On Tue, Mar 31, 2015 at 10:41 PM, Xiangrui Meng men...@gmail.com wrote: Hey Sean, That is true for explicit model, but not for implicit. The ALS-WR paper doesn't cover the implicit model. In implicit formulation

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread Xiangrui Meng
to solve my problem? Sendong Li 在 2015年3月31日,上午12:11,Xiangrui Meng men...@gmail.com 写道: setCheckpointInterval was added in the current master and branch-1.3. Please help check whether it works. It will be included in the 1.3.1 and 1.4.0 release. -Xiangrui On Mon, Mar 30, 2015 at 7:27 AM

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-03-31 Thread Xiangrui Meng
I cannot reproduce this error on master, but I'm not aware of any recent bug fixes that are related. Could you build and try the current master? -Xiangrui On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, DataFrame with an user defined type (here mllib.Vector)

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Xiangrui Meng
it comes to invariance. But FWIW I had always understood the regularization to be multiplied by the number of explicit ratings. On Mon, Mar 30, 2015 at 5:51 PM, Xiangrui Meng men...@gmail.com wrote: Okay, I didn't realize that I changed the behavior of lambda in 1.3. to make it scale

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote: Hi Burak, After I added .repartition(sc.defaultParallelism), I can see from the log the partition number is set

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-30 Thread Xiangrui Meng
to get a similar result in 1.3. Sean and Shuo, which approach do you prefer? Do you know any existing work discussing this? Best, Xiangrui On Fri, Mar 27, 2015 at 11:27 AM, Xiangrui Meng men...@gmail.com wrote: This sounds like a bug ... Did you try a different lambda? It would be great if you

Re: kmeans|| in Spark is not real paralleled?

2015-03-30 Thread Xiangrui Meng
This PR updated the k-means|| initialization: https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d, which was included in 1.3.0. It should fix kmean|| initialization with large k. Please create a JIRA for this issue and send me the code and the dataset to produce this

Re: k-means can only run on one executor with one thread?

2015-03-30 Thread Xiangrui Meng
Hey Xi, Have you tried Spark 1.3.0? The initialization happens on the driver node and we fixed an issue with the initialization in 1.3.0. Again, please start with a smaller k, and increase it gradually, Let us know at what k the problem happens. Best, Xiangrui On Sat, Mar 28, 2015 at 3:11 AM,

Re: Setting a custom loss function for GradientDescent

2015-03-30 Thread Xiangrui Meng
You can extend Gradient, e.g., https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala#L266, and use it in GradientDescent:

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
has vectors of 200 dimensions. It is possible people never tested large dimension case. Thanks, David On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng men...@gmail.com wrote: Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui

Re: Using ORC input for mllib algorithms

2015-03-27 Thread Xiangrui Meng
This is a PR in review to support ORC via the SQL data source API: https://github.com/apache/spark/pull/3753. You can try pulling that PR and help test it. -Xiangrui On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I use sc.hadoopFile(directory,

Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Xiangrui Meng
Hi Martin, Could you attach the code snippet and the stack trace? The default implementation of some methods uses reflection, which may be the cause. Best, Xiangrui On Wed, Mar 25, 2015 at 3:18 PM, zapletal-mar...@email.cz wrote: Thanks Peter, I ended up doing something similar. I however

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-27 Thread Xiangrui Meng
This sounds like a bug ... Did you try a different lambda? It would be great if you can share your dataset or re-produce this issue on the public dataset. Thanks! -Xiangrui On Thu, Mar 26, 2015 at 7:56 AM, Ravi Mody rmody...@gmail.com wrote: After upgrading to 1.3.0, ALS.trainImplicit() has been

Re: MLlib Spam example gets stuck in Stage X

2015-03-20 Thread Xiangrui Meng
Su, which Spark version did you use? -Xiangrui On Thu, Mar 19, 2015 at 3:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote: To get these metrics out, you need to open the driver ui running on port 4040. And in there you will see Stages information and for each stage you can see how much time

Re: High GC time

2015-03-17 Thread Xiangrui Meng
The official guide may help: http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning -Xiangrui On Tue, Mar 17, 2015 at 8:27 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I am getting very high GC time in my jobs. For smaller/real-time load, this becomes a real problem.

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-17 Thread Xiangrui Meng
! Thanks, Jay On Mar 16, 2015, at 11:35 AM, Xiangrui Meng men...@gmail.com wrote: Try this: val ratings = purchase.map { line = line.split(',') match { case Array(user, item, rate) = (user.toInt, item.toInt, rate.toFloat) }.toDF(user, item, rate) Doc for DataFrames: http

Re: Garbage stats in Random Forest leaf node?

2015-03-17 Thread Xiangrui Meng
This is the default value (Double.MinValue) for invalid gain: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67 Please ignore it. Maybe we should update `toString` to use scientific notation. -Xiangrui On Mon, Mar

Re: IllegalAccessError in GraphX (Spark 1.3.0 LDA)

2015-03-17 Thread Xiangrui Meng
Please check your classpath and make sure you don't have multiple Spark versions deployed. If the classpath looks correct, please create a JIRA for this issue. Thanks! -Xiangrui On Tue, Mar 17, 2015 at 2:03 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi all, I'm trying to use the new LDA

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-17 Thread Xiangrui Meng
that I needed to bug you :) Jay On Mar 17, 2015, at 11:48 AM, Xiangrui Meng men...@gmail.com wrote: Please check this section in the user guide: http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection You need `import sqlContext.implicits

Re: Garbage stats in Random Forest leaf node?

2015-03-17 Thread Xiangrui Meng
17, 2015, at 11:53 AM, Xiangrui Meng men...@gmail.com wrote: This is the default value (Double.MinValue) for invalid gain: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67 Please ignore it. Maybe we should update

Re: Scaling problem in RandomForest?

2015-03-16 Thread Xiangrui Meng
Try increasing the driver memory. We store trees on the driver node. If maxDepth=20 and numTrees=50, you may need a large driver memory to store all tree models. You might want to start with a smaller maxDepth and then increase it and see whether deep trees really help (vs. the cost). -Xiangrui

Re: Top rows per group

2015-03-16 Thread Xiangrui Meng
https://issues.apache.org/jira/browse/SPARK-5954 is for this issue and Shuo is working on it. We will first implement topByKey for RDD and them we could add it to DataFrames. -Xiangrui On Mon, Mar 9, 2015 at 9:43 PM, Moss rhoud...@gmail.com wrote: I do have a schemaRDD where I want to group by

Re: Any way to find out feature importance in Spark SVM?

2015-03-16 Thread Xiangrui Meng
You can compute the standard deviations of the training data using Statistics.colStats and then compare them with model coefficients to compute feature importance. -Xiangrui On Fri, Mar 13, 2015 at 11:35 AM, Natalia Connolly natalia.v.conno...@gmail.com wrote: Hello, While running an

Re: Logistic Regression displays ERRORs

2015-03-16 Thread Xiangrui Meng
Actually, they should be INFO or DEBUG. Line search steps are expected. You can configure log4j.properties to ignore those. A better solution would be reporting this at https://github.com/scalanlp/breeze/issues -Xiangrui On Thu, Mar 12, 2015 at 5:46 PM, cjwang c...@cjwang.us wrote: I am running

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-16 Thread Xiangrui Meng
Try this: val ratings = purchase.map { line = line.split(',') match { case Array(user, item, rate) = (user.toInt, item.toInt, rate.toFloat) }.toDF(user, item, rate) Doc for DataFrames: http://spark.apache.org/docs/latest/sql-programming-guide.html -Xiangrui On Mon, Mar 16, 2015 at 9:08 AM,

Re: MLlib/kmeans newbie question(s)

2015-03-09 Thread Xiangrui Meng
You need to change `== 1` to `== i`. `println(t)` happens on the workers, which may not be what you want. Try the following: noSets.filter(t = model.predict(Utils.featurize(t)) == i).collect().foreach(println) -Xiangrui On Sat, Mar 7, 2015 at 3:20 PM, Pierce Lamb richard.pierce.l...@gmail.com

Re: Can't cache RDD of collaborative filtering on MLlib

2015-03-09 Thread Xiangrui Meng
cache() is lazy. The data is stored into memory after the first time it gets materialized. So the first time you call `predict` after you load the model back from HDFS, it still takes time to load the actual data. The second time will be much faster. Or you can call `userJavaRDD.count()` and

Re: Training Random Forest

2015-03-05 Thread Xiangrui Meng
We don't support warm starts or online updates for decision trees. So if you call train twice, only the second dataset is used for training. -Xiangrui On Thu, Mar 5, 2015 at 12:31 PM, drarse drarse.a...@gmail.com wrote: I am testing the Random Forest in Spark, but I have a question... If I train

Re: how to save Word2VecModel

2015-03-04 Thread Xiangrui Meng
+user On Wed, Mar 4, 2015, 8:21 AM Xiangrui Meng men...@gmail.com wrote: You can use the save/load implementation in naive Bayes as reference: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala Ping me on the JIRA page

Re: gc time too long when using mllib als

2015-03-03 Thread Xiangrui Meng
Also try 1.3.0-RC1 or the current master. ALS should performance much better in 1.3. -Xiangrui On Tue, Mar 3, 2015 at 1:00 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to increase the parallelism/repartition the data to a higher number to get ride of those. Thanks Best Regards

Re: UnsatisfiedLinkError related to libgfortran when running MLLIB code on RHEL 5.8

2015-03-03 Thread Xiangrui Meng
libgfortran.x86_64 4.1.2-52.el5_8.1 comes with libgfortran.so.1 but not libgfortran.so.3. JBLAS requires the latter. If you have root access, you can try to install a newer version of libgfortran. Otherwise, maybe you can try Spark 1.3, which doesn't use JBLAS in ALS. -Xiangrui On Tue, Mar 3,

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread Xiangrui Meng
Lisen, did you use all m-by-n pairs during training? Implicit model penalizes unobserved ratings, while explicit model doesn't. -Xiangrui On Feb 26, 2015 6:26 AM, Sean Owen so...@cloudera.com wrote: +user On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen so...@cloudera.com wrote: I think I may

Re: How to augment data to existing MatrixFactorizationModel?

2015-02-26 Thread Xiangrui Meng
It may take some work to do online updates with an MatrixFactorizationModel because you need to update some rows of the user/item factors. You may be interested in spark-indexedrdd (http://spark-packages.org/package/amplab/spark-indexedrdd). We support save/load in Scala/Java. We are going to add

Re: Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread Xiangrui Meng
Try the following: df.map { case Row(id: Int, num: Int, value: Double, x: Float) = // replace those with your types (id, Vectors.dense(num, value, x)) }.toDF(id, features) -Xiangrui On Thu, Feb 26, 2015 at 3:08 PM, mobsniuk mobsn...@gmail.com wrote: I've been searching around and see others

Re: Reg. KNN on MLlib

2015-02-26 Thread Xiangrui Meng
It is not in MLlib. There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-2336 and Ashutosh has an implementation for integer values. -Xiangrui On Thu, Feb 26, 2015 at 8:18 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Has KNN classification algorithm been implemented on MLlib?

Re: Help vote for Spark talks at the Hadoop Summit

2015-02-25 Thread Xiangrui Meng
Made 3 votes to each of the talks. Looking forward to see them in Hadoop Summit:) -Xiangrui On Tue, Feb 24, 2015 at 9:54 PM, Reynold Xin r...@databricks.com wrote: Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Xiangrui Meng
If you make `Image` a case class, then select(image.data) should work. On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I have a DataFrame that contains a user defined type. The type is an image with the following attribute class Image(w: Int, h: Int,

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Xiangrui Meng
Btw, the correct syntax for alias should be `df.select($image.data.as(features))`. On Tue, Feb 24, 2015 at 3:35 PM, Xiangrui Meng men...@gmail.com wrote: If you make `Image` a case class, then select(image.data) should work. On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com

Re: Efficient way of scoring all items and users in an ALS model

2015-02-23 Thread Xiangrui Meng
You can use rdd.cartesian then find top-k by key to distribute the work to executors. There is a trick to boost the performance: you need to blockify user/product features and then use native matrix-matrix multiplication. There is a relevant PR from Deb: https://github.com/apache/spark/pull/3098 .

Re: shuffle data taking immense disk space during ALS

2015-02-23 Thread Xiangrui Meng
Did you try to use less number of partitions (user/product blocks)? Did you use implicit feedback? In the current implementation, we only do checkpointing with implicit feedback. We should adopt the checkpoint strategy implemented in LDA:

Re: Movie Recommendation tutorial

2015-02-23 Thread Xiangrui Meng
Which Spark version did you use? Btw, there are three datasets from MovieLens. The tutorial used the medium one (1 million). -Xiangrui On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez guilla...@databerries.com wrote: What do you mean? -- View this message in context:

Re: Need some help to create user defined type for ML pipeline

2015-02-23 Thread Xiangrui Meng
Yes, we are going to expose the developer API. There was a long discussion in the PR: https://github.com/apache/spark/pull/3637. So we marked them package private and look for feedback on how to improve it. Please implement your classes under `spark.ml` for now and let us know your feedback.

Re: Movie Recommendation tutorial

2015-02-23 Thread Xiangrui Meng
= 1.0, and numIter = 20) 1.1.1 - RSME = 1.335831 (rank = 8 and lambda = 1.0, and numIter = 10) Cheers k/ On Mon, Feb 23, 2015 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Which Spark version did you use? Btw, there are three datasets from MovieLens. The tutorial used the medium one (1

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-23 Thread Xiangrui Meng
FYI, in 1.3 we support save/load tree models in Scala and Java. We will add save/load support to Python soon. -Xiangrui On Mon, Feb 23, 2015 at 2:57 PM, Sebastián Ramírez sebastian.rami...@senseta.com wrote: In your log it says: pickle.PicklingError: Can't pickle type 'thread.lock': it's not

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Xiangrui Meng
Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS performance should be improved in 1.3.0. -Xiangrui On Fri, Feb 20, 2015 at 1:32 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi Ilya, thanks for your insight, this was the right clue. I had default parallelism already

Re: high GC in the Kmeans algorithm

2015-02-20 Thread Xiangrui Meng
huge? On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng men...@gmail.com wrote: Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using

Re: Unknown sample in Naive Baye's

2015-02-19 Thread Xiangrui Meng
on this. Thanks, Jatin On Wed, Feb 18, 2015 at 3:07 AM, Xiangrui Meng men...@gmail.com wrote: If there exists a sample that doesn't not belong to A/B/C, it means that there exists another class D or Unknown besides A/B/C. You should have some of these samples in the training set in order to let naive

Re: Stepsize with Linear Regression

2015-02-17 Thread Xiangrui Meng
The best step size depends on the condition number of the problem. You can try some conditioning heuristics first, e.g., normalizing the columns, and then try a common step size like 0.01. We should implement line search for linear regression in the future, as in LogisticRegressionWithLBFGS. Line

Re: high GC in the Kmeans algorithm

2015-02-17 Thread Xiangrui Meng
Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using YourKit or VisualVM. -Xiangrui On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com

Re: feeding DataFrames into predictive algorithms

2015-02-17 Thread Xiangrui Meng
Hey Sandy, The work should be done by a VectorAssembler, which combines multiple columns (double/int/vector) into a vector column, which becomes the features column for regression. We can going to create JIRAs for each of these standard feature transformers. It would be great if you can help

Re: MLib usage on Spark Streaming

2015-02-17 Thread Xiangrui Meng
JavaDStream.foreachRDD (https://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function)) and Statistics.corr

Re: Large Similarity Job failing

2015-02-17 Thread Xiangrui Meng
The complexity of DIMSUM is independent of the number of rows but still have quadratic dependency on the number of columns. 1.5M columns may be too large to use DIMSUM. Try to increase the threshold and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das

Re: [POWERED BY] Radius Intelligence

2015-02-17 Thread Xiangrui Meng
Thanks! I added Radius to https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. -Xiangrui On Tue, Feb 10, 2015 at 12:02 AM, Alexis Roos alexis.r...@gmail.com wrote: Also long due given our usage of Spark .. Radius Intelligence: URL: radius.com Description: Spark, MLLib Using

Re: Unknown sample in Naive Baye's

2015-02-17 Thread Xiangrui Meng
If there exists a sample that doesn't not belong to A/B/C, it means that there exists another class D or Unknown besides A/B/C. You should have some of these samples in the training set in order to let naive Bayes learn the priors. -Xiangrui On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet

Re: Naive Bayes model fails after a few predictions

2015-02-17 Thread Xiangrui Meng
Could you share the error log? What do you mean by 500 instead of 200? If this is the number of files, try to use `repartition` before calling naive Bayes, which works the best when the number of partitions matches the number of cores, or even less. -Xiangrui On Tue, Feb 10, 2015 at 10:34 PM,

Re: WARN from Similarity Calculation

2015-02-17 Thread Xiangrui Meng
It may be caused by GC pause. Did you check the GC time in the Spark UI? -Xiangrui On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing

Re: MLLib: feature standardization

2015-02-09 Thread Xiangrui Meng
`mean()` and `variance()` are not defined in `Vector`. You can use the mean and variance implementation from commons-math3 (http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html) if you don't want to implement them. -Xiangrui On Fri, Feb 6, 2015 at 12:50 PM, SK

Re: Number of goals to win championship

2015-02-09 Thread Xiangrui Meng
Logistic regression outputs probabilities if the data fits the model assumption. Otherwise, you might need to calibrate its output to correctly read it. You may be interested in reading this: http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression/. We have isotonic

Re: no option to add intercepts for StreamingLinearAlgorithm

2015-02-09 Thread Xiangrui Meng
No particular reason. We didn't add it in the first version. Let's add it in 1.4. -Xiangrui On Thu, Feb 5, 2015 at 3:44 PM, jamborta jambo...@gmail.com wrote: hi all, just wondering if there is a reason why it is not possible to add intercepts for streaming regression models? I understand

Re: [MLlib] Performance issues when building GBM models

2015-02-09 Thread Xiangrui Meng
Could you check the Spark UI and see whether there are RDDs being kicked out during the computation? We cache the residual RDD after each iteration. If we don't have enough memory/disk, it gets recomputed and results something like `t(n) = t(n-1) + const`. We might cache the features multiple

Re: word2vec more distributed

2015-02-09 Thread Xiangrui Meng
The C implementation of Word2Vec updates the model using multi-threads without locking. It is hard to implement it in a distributed way. In the MLlib implementation, each work holds the entire model in memory and output the part of model that gets updated. The driver still need to collect and

Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-20 Thread Xiangrui Meng
The assumption of implicit feedback model is that the unobserved ratings are more likely to be negative. So you may want to add some negatives for evaluation. Otherwise, the input ratings are all 1 and the test ratings are all 1 as well. The baseline predictor, which uses the average rating (that

Re: How to create distributed matrixes from hive tables.

2015-01-20 Thread Xiangrui Meng
You can get a SchemaRDD from the Hive table, map it into a RDD of Vectors, and then construct a RowMatrix. The transformations are lazy, so there is no external storage requirement for intermediate data. -Xiangrui On Sun, Jan 18, 2015 at 4:07 AM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, We

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Xiangrui Meng
You can save the cluster centers as a SchemaRDD of two columns (id: Int, center: Array[Double]). When you load it back, you can construct the k-means model from its cluster centers. -Xiangrui On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian lian.cs@gmail.com wrote: This is because KMeanModel is

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-14 Thread Xiangrui Meng
. Best, Xiangrui On Wed, Jan 14, 2015 at 1:04 PM, Nishanth P S nishant...@gmail.com wrote: Yes, we are close to having more 2 billion users. In this case what is the best way to handle this. Thanks, Nishanth On Fri, Jan 9, 2015 at 9:50 PM, Xiangrui Meng men...@gmail.com wrote: Do you have

Re: Using a RowMatrix inside a map

2015-01-14 Thread Xiangrui Meng
Yes, you can only use RowMatrix.multiply() within the driver. We are working on distributed block matrices and linear algebra operations on top of it, which would fit your use cases well. It may take several PRs to finish. You can find the first one here: https://github.com/apache/spark/pull/3200

Re: Discrepancy in PCA values

2015-01-12 Thread Xiangrui Meng
values given by Spark and other two. Thanks, Upul On Sat, Jan 10, 2015 at 11:17 AM, Xiangrui Meng men...@gmail.com wrote: You need to subtract mean values to obtain the covariance matrix (http://en.wikipedia.org/wiki/Covariance_matrix). On Fri, Jan 9, 2015 at 6:41 PM, Upul Bandara upulband

Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Xiangrui Meng
): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5, required: 8 Just calling colStats doesn't actually compute those statistics, does it? It looks like the computation is only carried out once you call the .mean() method. On Sat, Jan 10, 2015 at 7:04 AM, Xiangrui Meng men...@gmail.com wrote

Re: including the spark-mllib in build.sbt

2015-01-12 Thread Xiangrui Meng
I don't know the root cause. Could you try including only libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1 It should be sufficient because mllib depends on core. -Xiangrui On Mon, Jan 12, 2015 at 2:27 PM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, I am trying to build my

Re: OptionalDataException during Naive Bayes Training

2015-01-09 Thread Xiangrui Meng
How big is your data? Did you see other error messages from executors? It seems to me like a shuffle communication error. This thread may be relevant: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccalrnvjuvtgae_ag1rqey_cod1nmrlfpesxgsb7g8r21h0bm...@mail.gmail.com%3E

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-09 Thread Xiangrui Meng
Do you have more than 2 billion users/products? If not, you can pair each user/product id with an integer (check RDD.zipWithUniqueId), use them in ALS, and then join the original bigInt IDs back after training. -Xiangrui On Fri, Jan 9, 2015 at 5:12 PM, nishanthps nishant...@gmail.com wrote: Hi,

Re: Discrepancy in PCA values

2015-01-09 Thread Xiangrui Meng
; Thanks, Upul On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng men...@gmail.com wrote: The Julia code is computing the SVD of the Gram matrix. PCA should be applied to the covariance matrix. -Xiangrui On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com wrote: Hi All, I tried

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Xiangrui Meng
is there an easy/obvious fix? On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng men...@gmail.com wrote: There is some serialization overhead. You can try https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 . -Xiangrui On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote

Re: Zipping RDDs of equal size not possible

2015-01-09 Thread Xiangrui Meng
sample 2 * n tuples, split them into two parts, balance the sizes of these parts by filtering some tuples out How do you guarantee that the two RDDs have the same size? -Xiangrui On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke 1wil...@informatik.uni-hamburg.de wrote: Hi Spark community, I have

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-09 Thread Xiangrui Meng
exception On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote: Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, Word2Vec and TF-IDF algorithms

Re: Discrepancy in PCA values

2015-01-08 Thread Xiangrui Meng
The Julia code is computing the SVD of the Gram matrix. PCA should be applied to the covariance matrix. -Xiangrui On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com wrote: Hi All, I tried to do PCA for the Iris dataset [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib

Re: calculating the mean of SparseVector RDD

2015-01-07 Thread Xiangrui Meng
There is some serialization overhead. You can try https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 . -Xiangrui On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote: I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector.

Re: MLLIB and Openblas library in non-default dir

2015-01-06 Thread Xiangrui Meng
script Am I correct? On Mon, Jan 5, 2015 at 10:35 PM, Xiangrui Meng men...@gmail.com wrote: It might be hard to do that with spark-submit, because the executor JVMs may be already up and running before a user runs spark-submit. You can try to use `System.setProperty` to change the property

Re: [MLLib] storageLevel in ALS

2015-01-06 Thread Xiangrui Meng
Which Spark version are you using? We made this configurable in 1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L202 -Xiangrui On Tue, Jan 6, 2015 at 12:57 PM, Fernando O. fot...@gmail.com wrote: Hi, I was doing a tests

Re: confidence/probability for prediction in MLlib

2015-01-06 Thread Xiangrui Meng
This is addressed in https://issues.apache.org/jira/browse/SPARK-4789. In the new pipeline API, we can simply output two columns, one for the best predicted class, and the other for probabilities or confidence scores for each class. -Xiangrui On Tue, Jan 6, 2015 at 11:43 AM, Jianguo Li

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-06 Thread Xiangrui Meng
Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in local mode and not on distributed mode. Null

Re: MLLIB and Openblas library in non-default dir

2015-01-05 Thread Xiangrui Meng
It might be hard to do that with spark-submit, because the executor JVMs may be already up and running before a user runs spark-submit. You can try to use `System.setProperty` to change the property at runtime, though it doesn't seem to be a good solution. -Xiangrui On Fri, Jan 2, 2015 at 6:28

Re: python API for gradient boosting?

2015-01-05 Thread Xiangrui Meng
I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-5094. Hopefully someone would work on it and make it available in the 1.3 release. -Xiangrui On Sun, Jan 4, 2015 at 6:58 PM, Christopher Thom christopher.t...@quantium.com.au wrote: Hi, I wonder if anyone knows when a

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Xiangrui Meng
How big is your dataset, and what is the vocabulary size? -Xiangrui On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi, When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup usage. Here is the jstack output: main prio=10 tid=0x40112800

Re: Using TF-IDF from MLlib

2014-12-29 Thread Xiangrui Meng
Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella

<    1   2   3   4   5   >