Re: KMeansModel Construtor error

2014-07-15 Thread Xiangrui Meng
I don't think MLlib supports model serialization/deserialization. You got the error because the constructor is private. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2488 and we try to make sure it is implemented in v1.1. For now, you can modify the KMeansModel and remove

Re: Error when testing with large sparse svm

2014-07-15 Thread Xiangrui Meng
crater, was the error message the same as what you posted before: 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote Akka client disassociated 14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0) 14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on

Re: akka disassociated on GC

2014-07-16 Thread Xiangrui Meng
, 2014 at 10:48 PM, Makoto Yui yuin...@gmail.com wrote: Hello, (2014/06/19 23:43), Xiangrui Meng wrote: The execution was slow for more large KDD cup 2012, Track 2 dataset (235M+ records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the sequential aggregation of dense vectors

Re: Error when testing with large sparse svm

2014-07-16 Thread Xiangrui Meng
Then it may be a new issue. Do you mind creating a JIRA to track this issue? It would be great if you can help locate the line in BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui On Tue, Jul 15, 2014 at 10:56 PM, crater cq...@ucmerced.edu wrote: I don't really have my code,

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng
Check the number of inodes (df -i). The assembly build may create many small files. -Xiangrui On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote: Hi all, I am encountering the following error: INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng
driver application Chris On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote: Check the number of inodes (df -i). The assembly build may create many small files. -Xiangrui On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote: Hi all, I am

Re: Kmeans

2014-07-17 Thread Xiangrui Meng
... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com On Thursday, July 17, 2014 11:57 AM, Xiangrui Meng men...@gmail.com wrote: kmeans.py contains a naive

Re: MLLib - Regularized logistic regression in python

2014-07-17 Thread Xiangrui Meng
1) This is a miss, unfortunately ... We will add support for regularization and intercept in the coming v1.1. (JIRA: https://issues.apache.org/jira/browse/SPARK-2550) 2) It has overflow problems in Python but not in Scala. We can stabilize the computation by ensuring exp only takes a negative

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui On Thu, Jul 17, 2014 at

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
to less than 200 seconds from over 700 seconds. I am not sure how to repartition the data to match the CPU cores. How do I do it? Thank you. Ravi On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote: Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g

Re: Large scale ranked recommendation

2014-07-18 Thread Xiangrui Meng
Nick's suggestion is a good approach for your data. The item factors to broadcast should be a few MBs. -Xiangrui On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux decho...@gmail.com wrote: And you might want to apply clustering before. It is likely that every user and every item are not

Re: Python: saving/reloading RDD

2014-07-18 Thread Xiangrui Meng
You can save RDDs to text files using RDD.saveAsTextFile and load it back using sc.textFile. But make sure the record to string conversion is correctly implemented if the type is not primitive and you have the parser to load them back. -Xiangrui On Jul 18, 2014, at 8:39 AM, Roch Denis

Re: Large Task Size?

2014-07-20 Thread Xiangrui Meng
It was because of the latest change to task serialization: https://github.com/apache/spark/commit/1efb3698b6cf39a80683b37124d2736ebf3c9d9a The task size is no longer limited by akka.frameSize but we show warning messages if the task size is above 100KB. Please check the objects referenced in the

Re: LabeledPoint with weight

2014-07-21 Thread Xiangrui Meng
This is a useful feature but it may be hard to have it in v1.1 due to limited time. Hopefully, we can support it in v1.2. -Xiangrui On Mon, Jul 21, 2014 at 12:58 AM, Jiusheng Chen chenjiush...@gmail.com wrote: It seems MLlib right now doesn't support weighted training, training samples have

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-21 Thread Xiangrui Meng
You can also try a different region. I tested us-west-2 yesterday, and it worked well. -Xiangrui On Sun, Jul 20, 2014 at 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Actually the script in the master branch is also broken (it's pointing to an older AMI). Try 1.0.1 for launching

Re: error from DecisonTree Training:

2014-07-21 Thread Xiangrui Meng
This is a known issue: https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is working on it. -Xiangrui On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang j...@uow.edu.au wrote: So this is a bug unsolved (for java) yet? From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 18 July 2014 4:52

Re: akka disassociated on GC

2014-07-23 Thread Xiangrui Meng
patch, the evaluation has been processed successfully. I expect that these patches are merged in the next major release (v1.1?). Without them, it would be hard to use mllib for a large dataset. Thanks, Makoto (2014/07/16 15:05), Xiangrui Meng wrote: Hi Makoto, I don't remember I wrote

Re: spark github source build error

2014-07-23 Thread Xiangrui Meng
try `sbt/sbt clean` first? -Xiangrui On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma sharm...@umn.edu wrote: I am trying to build spark after cloning from github repo: I am executing: ./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly I am getting following error: [warn]

Announcing Spark 0.9.2

2014-07-23 Thread Xiangrui Meng
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is a maintenance release with bug fixes across several areas of Spark, including Spark Core, PySpark, MLlib, Streaming, and GraphX. We recommend all 0.9.x users to upgrade to this stable release. Contributions to this release came

Re: NMF implementaion is Spark

2014-07-25 Thread Xiangrui Meng
It is ALS with setNonnegative. -Xiangrui On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia buendia...@gmail.com wrote: Hi, Is there an implementation for Nonnegative Matrix Factorization in Spark? I understand that MLlib comes with matrix factorization, but it does not seem to cover the

Re: Questions about disk IOs

2014-07-25 Thread Xiangrui Meng
a shutdown On Jul 2, 2014, at 0:08, Xiangrui Meng men...@gmail.com wrote: Try to reduce number of partitions to match the number of cores. We will add treeAggregate to reduce the communication cost. PR: https://github.com/apache/spark/pull/1110 -Xiangrui On Tue, Jul 1, 2014 at 12:55 AM, Charles Li

Re: Kmeans: set initial centers explicitly

2014-07-27 Thread Xiangrui Meng
I think this is nice to have. Feel free to create a JIRA for it and it would be great if you can send a PR. Thanks! -Xiangrui On Thu, Jul 24, 2014 at 12:39 PM, SK skrishna...@gmail.com wrote: Hi, The mllib.clustering.kmeans implementation supports a random or parallel initialization mode to

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
1. I meant in the n (1k) by m (10k) case, we need to broadcast k centers and hence the total size is m * k. In 1.0, the driver needs to send the current centers to each partition one by one. In the current master, we use torrent to broadcast the centers to workers, which should be much faster. 2.

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
Great! Thanks for testing the new features! -Xiangrui On Mon, Jul 28, 2014 at 8:58 PM, durin m...@simon-schaefer.net wrote: Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes

Re: evaluating classification accuracy

2014-07-28 Thread Xiangrui Meng
Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and master. If you don't want to switch to 1.0.1 or master, try to cache and count test first. -Xiangrui On Mon, Jul 28, 2014 at 6:07 PM, SK skrishna...@gmail.com wrote: Hi, In order to evaluate the ML classification accuracy, I am

Re: Reading hdf5 formats with pyspark

2014-07-28 Thread Xiangrui Meng
That looks good to me since there is no Hadoop InputFormat for HDF5. But remember to specify the number of partitions in sc.parallelize to use all the nodes. You can change `process` to `read` which yields records one-by-one. Then sc.parallelize(files, numPartitions).flatMap(read) returns an RDD

Re: SPARK OWLQN Exception: Iteration Stage is so slow

2014-07-29 Thread Xiangrui Meng
Do you mind sharing more details, for example, specs of nodes and data size? -Xiangrui 2014-07-29 2:51 GMT-07:00 John Wu j...@zamplus.com: Hi all, There is a problem we can’t resolve. We implement the OWLQN algorithm in parallel with SPARK, We don’t know why It is very slow in every

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread Xiangrui Meng
Before torrent, http is the default way for broadcasting. The driver holds the data and the executors request the data via http, making the driver the bottleneck if the data is large. -Xiangrui On Tue, Jul 29, 2014 at 10:32 AM, durin m...@simon-schaefer.net wrote: Development is really rapid

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
Could you share more details about the dataset and the algorithm? For example, if the dataset has 10M+ features, it may be slow for the driver to collect the weights from executors (just a blind guess). -Xiangrui On Tue, Jul 29, 2014 at 9:15 PM, Tan Tim unname...@gmail.com wrote: Hi, all

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
The weight vector is usually dense and if you have many partitions, the driver may slow down. You can also take a look at the driver memory inside the Executor tab in WebUI. Another setting to check is the HDFS block size and whether the input data is evenly distributed to the executors. Are the

Re: why a machine learning application run slowly on the spark cluster

2014-07-30 Thread Xiangrui Meng
the total task reduce from 200 t0 64 after first stage just like this: But I don't know if this is reasonable. ​ On Wed, Jul 30, 2014 at 2:11 PM, Xiangrui Meng men...@gmail.com wrote: After you load the data in, call `.repartition(number of executors).cache()`. If the data is evenly

Re: about spark and using machine learning model

2014-08-04 Thread Xiangrui Meng
Some extra work is needed to close the loop. One related example is streaming linear regression added by Jeremy very recently: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala You can use a model trained offline

Re: fail to run LBFS in 5G KDD data in spark 1.0.1?

2014-08-06 Thread Xiangrui Meng
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver? I think the problem is with the feature dimension. KDD data has more than 20M features and in v1.0.1, the driver collects the partial gradients one by one, sums them up, does the update, and then sends the new weights back

Re: Naive Bayes parameters

2014-08-07 Thread Xiangrui Meng
It is used in data loading: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala#L76 On Thu, Aug 7, 2014 at 12:47 AM, SK skrishna...@gmail.com wrote: I followed the example in

Re: Regularization parameters

2014-08-07 Thread Xiangrui Meng
Then this may be a bug. Do you mind sharing the dataset that we can use to reproduce the problem? -Xiangrui On Thu, Aug 7, 2014 at 1:20 AM, SK skrishna...@gmail.com wrote: Spark 1.0.1 thanks -- View this message in context:

Re: Low Performance of Shark over Spark.

2014-08-07 Thread Xiangrui Meng
Did you cache the table? There are couple ways of caching a table in Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide On Thu, Aug 7, 2014 at 6:51 AM, vinay.kash...@socialinfra.net wrote: Dear all, I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. 6 worker

Re: questions about MLLib recommendation models

2014-08-07 Thread Xiangrui Meng
ratings.map{ case Rating(u,m,r) = { val pred = model.predict(u, m) (r - pred)*(r - pred) } }.mean() The code doesn't work because the userFeatures and productFeatures stored in the model are RDDs. You tried to serialize them into the task closure, and execute `model.predict` on an

Re: KMeans Input Format

2014-08-07 Thread Xiangrui Meng
Besides durin's suggestion, please also confirm driver and executor memory in the WebUI, since they are small according to the log: 14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) -Xiangrui

Re: Spark: Could not load native gpl library

2014-08-07 Thread Xiangrui Meng
Is the GPL library only available on the driver node? If that is the case, you need to add them to `--jars` option of spark-submit. -Xiangrui On Thu, Aug 7, 2014 at 6:59 PM, Jikai Lei hangel...@gmail.com wrote: I had the following error when trying to run a very simple spark job (which uses

Re: Where do my partitions go?

2014-08-08 Thread Xiangrui Meng
They are two different RDDs. Spark doesn't guarantee that the first partition of RDD1 and the first partition of RDD2 will stay in the same worker node. If that is the case, if you have 1000 single-partition RDDs the first worker will have very heavy load. -Xiangrui On Thu, Aug 7, 2014 at 2:20

Re: scopt.OptionParser

2014-08-08 Thread Xiangrui Meng
Thanks for posting the solution! You can also append `% provided` to the `spark-mllib` dependency line and remove `spark-core` (because spark-mllib already depends on spark-core) to make the assembly jar smaller. -Xiangrui On Fri, Aug 8, 2014 at 10:05 AM, SK skrishna...@gmail.com wrote: i was

Re: Does Spark 1.0.1 stil collect results in serial???

2014-08-08 Thread Xiangrui Meng
For the reduce/aggregate question, driver collects results in sequence. We now use tree aggregation in MLlib to reduce driver's load: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89 It is faster than aggregate when there are many

Re: Partitioning a libsvm format file

2014-08-10 Thread Xiangrui Meng
If the file is big enough, you can try MLUtils.loadLibSVMFile with a minPartitions argument. This doesn't shuffle data but it might not give you the exact number of partitions. If you want to have the exact number, use RDD.repartition, which requires data shuffling. -Xiangrui On Sun, Aug 10, 2014

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
Assuming that your data is very sparse, I would recommend RDD.repartition. But if it is not the case and you don't want to shuffle the data, you can try a CombineInputFormat and then parse the lines into labeled points. Coalesce may cause locality problems if you didn't use the right number of

Re: How to save mllib model to hdfs and reload it

2014-08-12 Thread Xiangrui Meng
For linear models, the constructors are now public. You can save the weights to HDFS, then load the weights back and use the constructor to create the model. -Xiangrui On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com wrote: hello: I want to know,if I use history data to

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
into a large partition if I cache it, so same issues as #1? For coalesce, could you share some best practice how to set the right number of partitions to avoid locality problem? Thanks! On Tue, Aug 12, 2014 at 3:51 PM, Xiangrui Meng men...@gmail.com wrote: Assuming that your data is very

Re: training recsys model

2014-08-13 Thread Xiangrui Meng
You can define an evaluation metric first and then use a grid search to find the best set of training parameters. Ampcamp has a tutorial showing how to do this for ALS: http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html -Xiangrui On Tue, Aug 12, 2014 at 8:01 PM,

Re: training recsys model

2014-08-14 Thread Xiangrui Meng
, Aug 13, 2014 at 1:26 PM, Xiangrui Meng men...@gmail.com wrote: You can define an evaluation metric first and then use a grid search to find the best set of training parameters. Ampcamp has a tutorial showing how to do this for ALS: http://ampcamp.berkeley.edu/big-data-mini-course/movie

Re: Spark Akka/actor failures.

2014-08-14 Thread Xiangrui Meng
Could you try to map it to row-majored first? Your approach may generate multiple copies of the data. The code should look like this: ~~~ val rows = rdd.map { case (j, values) = values.view.zipWithIndex.map { case (v, i) = (i, (j, v)) } }.groupByKey().map { case (i, entries) =

Re: ALS checkpoint performance

2014-08-15 Thread Xiangrui Meng
Guoqiang reported some results in his PRs https://github.com/apache/spark/pull/828 and https://github.com/apache/spark/pull/929 . But this is really problem-dependent. -Xiangrui On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, Are there any experiments

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
What is the ratio of examples labeled `s` to those labeled `b`? Also, Naive Bayes doesn't work on negative feature values. It assumes term frequencies as the input. We should throw an exception on negative feature values. -Xiangrui On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do phu...@vida.io wrote:

Re: Decision tree: categorical variables

2014-08-19 Thread Xiangrui Meng
The categorical features must be encoded into indices starting from 0: 0, 1, ..., numCategories - 1. Then you can provide the categoricalFeatureInfo map to specify which columns contain categorical features and the number of categories in each. Joseph is updating the user guide. But if you want to

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-08-19 Thread Xiangrui Meng
, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Thanks a lot Xiangrui. This will help. On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com wrote: Hi Rahul, We plan to add online model updates with Spark Streaming, perhaps in v1.1, starting with linear methods. Please open

Re: Only master is really busy at KMeans training

2014-08-19 Thread Xiangrui Meng
There are only 5 worker nodes. So please try to reduce the number of partitions to the number of available CPU cores. 1000 partitions are too bigger, because the driver needs to collect to task result from each partition. -Xiangrui On Tue, Aug 19, 2014 at 1:41 PM, durin m...@simon-schaefer.net

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
, Xiangrui Meng men...@gmail.com wrote: What is the ratio of examples labeled `s` to those labeled `b`? Also, Naive Bayes doesn't work on negative feature values. It assumes term frequencies as the input. We should throw an exception on negative feature values. -Xiangrui On Tue, Aug 19, 2014

Re: What about implementing various hypothesis test for Logistic Regression in MLlib

2014-08-20 Thread Xiangrui Meng
We implemented chi-squared tests in v1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166 and we will add more after v1.1. Feedback on which tests should come first would be greatly appreciated. -Xiangrui On Tue, Aug 19, 2014 at

Re: What about implementing various hypothesis test for LogisticRegression in MLlib

2014-08-24 Thread Xiangrui Meng
other testes, which are also important when using logistic regression to build score cards. Xiaobo Gu -- Original -- From: Xiangrui Meng;men...@gmail.com; Send time: Wednesday, Aug 20, 2014 2:18 PM To: guxiaobo1...@qq.com; Cc: user@spark.apache.orguser

Re: Only master is really busy at KMeans training

2014-08-26 Thread Xiangrui Meng
How many partitions now? Btw, which Spark version are you using? I checked your code and I don't understand why you want to broadcast vectors2, which is an RDD. var vectors2 = vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var broadcastVector =

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Xiangrui Meng
Hi Wei, Please keep us posted about the performance result you get. This would be very helpful. Best, Xiangrui On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan w...@us.ibm.com wrote: Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks,

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi

Re: minPartitions ignored for bz2?

2014-08-27 Thread Xiangrui Meng
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files before 1.2 (or a later version). But due to a bug (https://issues.apache.org/jira/browse/HADOOP-10614), you should try hadoop-2.5.0. -Xiangrui On Wed, Aug 27, 2014 at 2:49 PM, jerryye jerr...@gmail.com wrote: Hi, I'm running

Re: MLLib decision tree: Weights

2014-09-03 Thread Xiangrui Meng
This is not supported in MLlib. Hopefully, we will add support for weighted examples in v1.2. If you want to train weighted instances with the current tree implementation, please try importance sampling first to adjust the weights. For instance, an example with weight 0.3 is sampled with

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-03 Thread Xiangrui Meng
We have a pending PR (https://github.com/apache/spark/pull/216) for discretization but it has performance issues. We will try to spend more time to improve it. -Xiangrui On Tue, Sep 2, 2014 at 2:56 AM, filipus floe...@gmail.com wrote: i guess i found it

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-03 Thread Xiangrui Meng
I think they are the same. If you have hub (https://hub.github.com/) installed, you can run hub checkout https://github.com/apache/spark/pull/216 and then `sbt/sbt assembly` -Xiangrui On Wed, Sep 3, 2014 at 12:03 AM, filipus floe...@gmail.com wrote: howto install? just clone by git clone

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Xiangrui Meng
There is a sliding method implemented in MLlib (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala), which is used in computing Area Under Curve:

Re: [MLib] How do you normalize features?

2014-09-03 Thread Xiangrui Meng
Maybe copy the implementation of StandardScaler from 1.1 and use it in v1.0.x. -Xiangrui On Wed, Sep 3, 2014 at 5:10 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: It seems like the next release will add a nice org.apache.spark.mllib.feature package but what is the recommended way to

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Xiangrui Meng
, or at least expose this parameter to users (*CoalescedRDD *is a private class, and *RDD*.*coalesce* also don't have a parameter to control that). On Wed, Aug 13, 2014 at 12:28 AM, Xiangrui Meng men...@gmail.com wrote: Sorry, I missed #2. My suggestion is the same as #2. You need to set a bigger

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Xiangrui Meng
You can try LinearRegression with sparse input. It converges the least squares solution if the linear system is over-determined, while the convergence rate depends on the condition number. Applying standard scaling is popular heuristic to reduce the condition number. If you are interested in

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Xiangrui Meng
There is an undocumented configuration to put users jars in front of spark jar. But I'm not very certain that it works as expected (and this is why it is undocumented). Please try turning on spark.yarn.user.classpath.first . -Xiangrui On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Xiangrui Meng
When you submit the job to yarn with spark-submit, set --conf spark.yarn.user.classpath.first=true . On Mon, Sep 8, 2014 at 10:46 AM, Penny Espinoza pesp...@societyconsulting.com wrote: I don't understand what you mean. Can you be more specific? From: Victor

Re: A problem for running MLLIB in amazon clound

2014-09-08 Thread Xiangrui Meng
Could you attach the driver log? -Xiangrui On Mon, Sep 8, 2014 at 7:23 AM, Hui Li hli161...@gmail.com wrote: I am running a very simple example using the SVMWithSGD on Amazon EMR. I haven't got any result after one hour long. My instance-type is: m3.large instance-count is: 3 Dataset

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Xiangrui Meng
to partition the problem cleanly and don't need consensus step (which is what I implemented in the code) Thanks Deb On Sep 7, 2014 11:35 PM, Xiangrui Meng men...@gmail.com wrote: You can try LinearRegression with sparse input. It converges the least squares solution if the linear system is over

Re: Accuracy hit in classification with Spark

2014-09-09 Thread Xiangrui Meng
If you are using the Mahout's Multinomial Naive Bayes, it should be the same as MLlib's. I tried MLlib with news20.scale downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html and the test accuracy is 82.4%. -Xiangrui On Tue, Sep 9, 2014 at 4:58 AM, jatinpreet

Re: sc.textFile problem due to newlines within a CSV record

2014-09-12 Thread Xiangrui Meng
I wrote an input format for Redshift's tables unloaded UNLOAD the ESCAPE option: https://github.com/mengxr/redshift-input-format , which can recognize multi-line records. Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and the delimiter character. You can apply the same escaping

Re: Efficient way to sum multiple columns

2014-09-15 Thread Xiangrui Meng
Please check the colStats method defined under mllib.stat.Statistics. -Xiangrui On Mon, Sep 15, 2014 at 1:00 PM, jamborta jambo...@gmail.com wrote: Hi all, I have an RDD that contains around 50 columns. I need to sum each column, which I am doing by running it through a for loop, creating an

Re: Define the name of the outputs with Java-Spark.

2014-09-15 Thread Xiangrui Meng
Spark doesn't support MultipleOutput at this time. You can cache the parent RDD. Then create RDDs from it and save them separately. -Xiangrui On Fri, Sep 12, 2014 at 7:45 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I would like to define the names of my output in Spark, I have a process

Re: Accuracy hit in classification with Spark

2014-09-15 Thread Xiangrui Meng
Thanks for the update! -Xiangrui On Sun, Sep 14, 2014 at 11:33 PM, jatinpreet jatinpr...@gmail.com wrote: Hi, I have been able to get the same accuracy with MLlib as Mahout's. The pre-processing phase of Mahout was the reason behind the accuracy mismatch. After studying and applying the

Re: MLLib sparse vector

2014-09-15 Thread Xiangrui Meng
Or you can use the factory method `Vectors.sparse`: val sv = Vectors.sparse(numProducts, productIds.map(x = (x, 1.0))) where numProducts should be the largest product id plus one. Best, Xiangrui On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore cdg...@cdgore.com wrote: Hi Sameer, MLLib uses

Re: MLLib regression model weights

2014-09-18 Thread Xiangrui Meng
The importance should be based on some statistics, for example, the standard deviation of the feature column and the magnitude of the weight. If the columns are scaled to unit standard deviation (using StandardScaler), you can tell the importance by the absolute value of the weight. But there are

Re: SVD on larger than taller matrix

2014-09-18 Thread Xiangrui Meng
Did you cache `features`? Without caching it is slow because we need O(k) iterations. The storage requirement on the driver is about 2 * n * k = 2 * 3 million * 200 ~= 9GB, not considering any overhead. Computing U is also an expensive task in your case. We should use some randomized SVD

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Xiangrui Meng
We don't support kernels because it doesn't scale well. Please check When to use LIBLINEAR but not LIBSVM on http://www.csie.ntu.edu.tw/~cjlin/liblinear/index.html . I like Jey's suggestion on expanding features. -Xiangrui On Thu, Sep 18, 2014 at 12:29 PM, Jey Kottalam j...@cs.berkeley.edu wrote:

Re: Unable to load app logs for MLLib programs in history server

2014-09-18 Thread Xiangrui Meng
Could you create a JIRA for it? We can either remove special characters or encode with alphanumerics. -Xiangrui On Thu, Sep 18, 2014 at 3:50 PM, SK skrishna...@gmail.com wrote: Hi, The default log files for the Mllib examples use a rather long naming convention that includes special

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-22 Thread Xiangrui Meng
Does feature size 43839 equal to the number of terms? Check the output dimension of your feature vectorizer and reduce number of partitions to match the number of physical cores. I saw you set spark.storage.memoryFaction to 0.0. Maybe it is better to keep the default. Also please confirm the

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread Xiangrui Meng
You dataset is small. NaiveBayes should work under the default settings, even in local mode. Could you try local mode first without changing any Spark settings? Since your dataset is small, could you save the vectorized data (RDD[LabeledPoint]) and send me a sample? I want to take a look at the

Re: java.lang.OutOfMemoryError while running SVD MLLib example

2014-09-25 Thread Xiangrui Meng
7000x7000 is not tall-and-skinny matrix. Storing the dense matrix requires 784MB. The driver needs more storage for collecting result from executors as well as making a copy for LAPACK's dgesvd. So you need more memory. Do you need the full SVD? If not, try to use a small k, e.g, 50. -Xiangrui On

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-25 Thread Xiangrui Meng
For the vectorizer, what's the output feature dimension and are you creating sparse vectors or dense vectors? The model on the driver consists of numClasses * numFeatures doubles. However, the driver needs more memory in order to receive the task result (of the same size) from executors. So you

Re: K-means faster on Mahout then on Spark

2014-09-25 Thread Xiangrui Meng
Please also check the load balance of the RDD on YARN. How many partitions are you using? Does it match the number of CPU cores? -Xiangrui On Thu, Sep 25, 2014 at 12:28 PM, bhusted brian.hus...@gmail.com wrote: What is the size of your vector mine is set to 20? I am seeing slow results as well

Re: Build error when using spark with breeze

2014-09-26 Thread Xiangrui Meng
We removed commons-math3 from dependencies to avoid version conflict with hadoop-common. hadoop-common-2.3+ depends on commons-math3-3.1.1, while breeze depends on commons-math3-3.3. 3.3 is not backward compatible with 3.1.1. So we removed it because the breeze functions we use do not touch

Re: MLlib 1.2 New Interesting Features

2014-09-29 Thread Xiangrui Meng
Hi Krishna, Some planned features for MLlib 1.2 can be found via Spark JIRA: http://bit.ly/1ywotkm , though this list is not fixed. The feature freeze will happen by the end of Oct. Then we will cut branch-1.2 and start QA. I don't recommend using branch-1.2 for hands-on tutorial around Oct 29th

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Xiangrui Meng
The test accuracy doesn't mean the total loss. All points between (-1, 1) can separate points -1 and +1 and give you 1.0 accuracy, but their coressponding loss are different. -Xiangrui On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used LogisticRegression

Re: MLLib ALS question

2014-09-30 Thread Xiangrui Meng
You may need a cluster with more memory. The current ALS implementation constructs all subproblems in memory. With rank=10, that means (6.5M + 2.5M) * 10^2 / 2 * 8 bytes = 3.5GB. The ratings need 2GB, not counting the overhead. ALS creates in/out blocks to optimize the computation, which takes

Re: MLLib: Missing value imputation

2014-09-30 Thread Xiangrui Meng
We don't handle missing value imputation in the current version of MLlib. In future releases, we can store feature information in the dataset metadata, which may store the default value to replace missing values. But no one is committed to work on this feature. For now, you can filter out examples

Re: MLLib ALS question

2014-10-01 Thread Xiangrui Meng
ALS still needs to load and deserialize the in/out blocks (one by one) from disk and then construct least squares subproblems. All happen in RAM. The final model is also stored in memory. -Xiangrui On Wed, Oct 1, 2014 at 4:36 AM, Alex T chiorts...@gmail.com wrote: Hi, thanks for the reply. I

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Xiangrui Meng
Yes, the bigram in that demo only has two characters, which could separate different character sets. -Xiangrui On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei liquan...@gmail.com wrote: The program computes hashing bi-gram frequency normalized by total number of bigrams then filter out zero values.

Re: Help Troubleshooting Naive Bayes

2014-10-01 Thread Xiangrui Meng
The cost depends on the feature dimension, number of instances, number of classes, and number of partitions. Do you mind sharing those numbers? -Xiangrui On Wed, Oct 1, 2014 at 6:31 PM, Mike Bernico mike.bern...@gmail.com wrote: Hi Everyone, I'm working on training mllib's Naive Bayes to

Re: Print Decision Tree Models

2014-10-01 Thread Xiangrui Meng
Which Spark version are you using? It works in 1.1.0 but not in 1.0.0. -Xiangrui On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain ji...@sellpoints.com wrote: So I am trying to print the model output from MLlib however I am only getting things like the following:

Re: Breeze Library usage in Spark

2014-10-03 Thread Xiangrui Meng
Did you add a different version of breeze to the classpath? In Spark 1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze version you used is different from the one comes with Spark, you might see class not found. -Xiangrui On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread Xiangrui Meng
The current impl of ALS constructs least squares subproblems in memory. So for rank 100, the total memory it requires is about 480,189 * 100^2 / 2 * 8 bytes ~ 20GB, divided by the number of blocks. For rank 1000, this number goes up to 2TB, unfortunately. There is a JIRA for optimizing ALS:

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread Xiangrui Meng
It would be really helpful if you can help test the scalability of the new ALS impl: https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala . It should be faster and more scalable, but the code is messy now. Best, Xiangrui On Fri, Oct 3, 2014 at 11:57

<    1   2   3   4   5   >