Re: [MLib] How do you normalize features?

2014-09-03 Thread Xiangrui Meng
Maybe copy the implementation of StandardScaler from 1.1 and use it in v1.0.x. -Xiangrui On Wed, Sep 3, 2014 at 5:10 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: It seems like the next release will add a nice org.apache.spark.mllib.feature package but what is the recommended way to

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Xiangrui Meng
, or at least expose this parameter to users (*CoalescedRDD *is a private class, and *RDD*.*coalesce* also don't have a parameter to control that). On Wed, Aug 13, 2014 at 12:28 AM, Xiangrui Meng men...@gmail.com wrote: Sorry, I missed #2. My suggestion is the same as #2. You need to set a bigger

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Xiangrui Meng
Hi Wei, Please keep us posted about the performance result you get. This would be very helpful. Best, Xiangrui On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan w...@us.ibm.com wrote: Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks,

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi

Re: minPartitions ignored for bz2?

2014-08-27 Thread Xiangrui Meng
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files before 1.2 (or a later version). But due to a bug (https://issues.apache.org/jira/browse/HADOOP-10614), you should try hadoop-2.5.0. -Xiangrui On Wed, Aug 27, 2014 at 2:49 PM, jerryye jerr...@gmail.com wrote: Hi, I'm running

Re: Only master is really busy at KMeans training

2014-08-26 Thread Xiangrui Meng
How many partitions now? Btw, which Spark version are you using? I checked your code and I don't understand why you want to broadcast vectors2, which is an RDD. var vectors2 = vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var broadcastVector =

Re: What about implementing various hypothesis test for LogisticRegression in MLlib

2014-08-24 Thread Xiangrui Meng
other testes, which are also important when using logistic regression to build score cards. Xiaobo Gu -- Original -- From: Xiangrui Meng;men...@gmail.com; Send time: Wednesday, Aug 20, 2014 2:18 PM To: guxiaobo1...@qq.com; Cc: user@spark.apache.orguser

Re: What about implementing various hypothesis test for Logistic Regression in MLlib

2014-08-20 Thread Xiangrui Meng
We implemented chi-squared tests in v1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166 and we will add more after v1.1. Feedback on which tests should come first would be greatly appreciated. -Xiangrui On Tue, Aug 19, 2014 at

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
What is the ratio of examples labeled `s` to those labeled `b`? Also, Naive Bayes doesn't work on negative feature values. It assumes term frequencies as the input. We should throw an exception on negative feature values. -Xiangrui On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do phu...@vida.io wrote:

Re: Decision tree: categorical variables

2014-08-19 Thread Xiangrui Meng
The categorical features must be encoded into indices starting from 0: 0, 1, ..., numCategories - 1. Then you can provide the categoricalFeatureInfo map to specify which columns contain categorical features and the number of categories in each. Joseph is updating the user guide. But if you want to

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-08-19 Thread Xiangrui Meng
, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Thanks a lot Xiangrui. This will help. On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com wrote: Hi Rahul, We plan to add online model updates with Spark Streaming, perhaps in v1.1, starting with linear methods. Please open

Re: Only master is really busy at KMeans training

2014-08-19 Thread Xiangrui Meng
There are only 5 worker nodes. So please try to reduce the number of partitions to the number of available CPU cores. 1000 partitions are too bigger, because the driver needs to collect to task result from each partition. -Xiangrui On Tue, Aug 19, 2014 at 1:41 PM, durin m...@simon-schaefer.net

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
, Xiangrui Meng men...@gmail.com wrote: What is the ratio of examples labeled `s` to those labeled `b`? Also, Naive Bayes doesn't work on negative feature values. It assumes term frequencies as the input. We should throw an exception on negative feature values. -Xiangrui On Tue, Aug 19, 2014

Re: ALS checkpoint performance

2014-08-15 Thread Xiangrui Meng
Guoqiang reported some results in his PRs https://github.com/apache/spark/pull/828 and https://github.com/apache/spark/pull/929 . But this is really problem-dependent. -Xiangrui On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, Are there any experiments

Re: training recsys model

2014-08-14 Thread Xiangrui Meng
, Aug 13, 2014 at 1:26 PM, Xiangrui Meng men...@gmail.com wrote: You can define an evaluation metric first and then use a grid search to find the best set of training parameters. Ampcamp has a tutorial showing how to do this for ALS: http://ampcamp.berkeley.edu/big-data-mini-course/movie

Re: Spark Akka/actor failures.

2014-08-14 Thread Xiangrui Meng
Could you try to map it to row-majored first? Your approach may generate multiple copies of the data. The code should look like this: ~~~ val rows = rdd.map { case (j, values) = values.view.zipWithIndex.map { case (v, i) = (i, (j, v)) } }.groupByKey().map { case (i, entries) =

Re: training recsys model

2014-08-13 Thread Xiangrui Meng
You can define an evaluation metric first and then use a grid search to find the best set of training parameters. Ampcamp has a tutorial showing how to do this for ALS: http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html -Xiangrui On Tue, Aug 12, 2014 at 8:01 PM,

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
Assuming that your data is very sparse, I would recommend RDD.repartition. But if it is not the case and you don't want to shuffle the data, you can try a CombineInputFormat and then parse the lines into labeled points. Coalesce may cause locality problems if you didn't use the right number of

Re: How to save mllib model to hdfs and reload it

2014-08-12 Thread Xiangrui Meng
For linear models, the constructors are now public. You can save the weights to HDFS, then load the weights back and use the constructor to create the model. -Xiangrui On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com wrote: hello: I want to know,if I use history data to

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
into a large partition if I cache it, so same issues as #1? For coalesce, could you share some best practice how to set the right number of partitions to avoid locality problem? Thanks! On Tue, Aug 12, 2014 at 3:51 PM, Xiangrui Meng men...@gmail.com wrote: Assuming that your data is very

Re: Partitioning a libsvm format file

2014-08-10 Thread Xiangrui Meng
If the file is big enough, you can try MLUtils.loadLibSVMFile with a minPartitions argument. This doesn't shuffle data but it might not give you the exact number of partitions. If you want to have the exact number, use RDD.repartition, which requires data shuffling. -Xiangrui On Sun, Aug 10, 2014

Re: Where do my partitions go?

2014-08-08 Thread Xiangrui Meng
They are two different RDDs. Spark doesn't guarantee that the first partition of RDD1 and the first partition of RDD2 will stay in the same worker node. If that is the case, if you have 1000 single-partition RDDs the first worker will have very heavy load. -Xiangrui On Thu, Aug 7, 2014 at 2:20

Re: scopt.OptionParser

2014-08-08 Thread Xiangrui Meng
Thanks for posting the solution! You can also append `% provided` to the `spark-mllib` dependency line and remove `spark-core` (because spark-mllib already depends on spark-core) to make the assembly jar smaller. -Xiangrui On Fri, Aug 8, 2014 at 10:05 AM, SK skrishna...@gmail.com wrote: i was

Re: Does Spark 1.0.1 stil collect results in serial???

2014-08-08 Thread Xiangrui Meng
For the reduce/aggregate question, driver collects results in sequence. We now use tree aggregation in MLlib to reduce driver's load: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89 It is faster than aggregate when there are many

Re: Naive Bayes parameters

2014-08-07 Thread Xiangrui Meng
It is used in data loading: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala#L76 On Thu, Aug 7, 2014 at 12:47 AM, SK skrishna...@gmail.com wrote: I followed the example in

Re: Regularization parameters

2014-08-07 Thread Xiangrui Meng
Then this may be a bug. Do you mind sharing the dataset that we can use to reproduce the problem? -Xiangrui On Thu, Aug 7, 2014 at 1:20 AM, SK skrishna...@gmail.com wrote: Spark 1.0.1 thanks -- View this message in context:

Re: Low Performance of Shark over Spark.

2014-08-07 Thread Xiangrui Meng
Did you cache the table? There are couple ways of caching a table in Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide On Thu, Aug 7, 2014 at 6:51 AM, vinay.kash...@socialinfra.net wrote: Dear all, I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. 6 worker

Re: questions about MLLib recommendation models

2014-08-07 Thread Xiangrui Meng
ratings.map{ case Rating(u,m,r) = { val pred = model.predict(u, m) (r - pred)*(r - pred) } }.mean() The code doesn't work because the userFeatures and productFeatures stored in the model are RDDs. You tried to serialize them into the task closure, and execute `model.predict` on an

Re: KMeans Input Format

2014-08-07 Thread Xiangrui Meng
Besides durin's suggestion, please also confirm driver and executor memory in the WebUI, since they are small according to the log: 14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) -Xiangrui

Re: Spark: Could not load native gpl library

2014-08-07 Thread Xiangrui Meng
Is the GPL library only available on the driver node? If that is the case, you need to add them to `--jars` option of spark-submit. -Xiangrui On Thu, Aug 7, 2014 at 6:59 PM, Jikai Lei hangel...@gmail.com wrote: I had the following error when trying to run a very simple spark job (which uses

Re: fail to run LBFS in 5G KDD data in spark 1.0.1?

2014-08-06 Thread Xiangrui Meng
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver? I think the problem is with the feature dimension. KDD data has more than 20M features and in v1.0.1, the driver collects the partial gradients one by one, sums them up, does the update, and then sends the new weights back

Re: about spark and using machine learning model

2014-08-04 Thread Xiangrui Meng
Some extra work is needed to close the loop. One related example is streaming linear regression added by Jeremy very recently: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala You can use a model trained offline

Re: why a machine learning application run slowly on the spark cluster

2014-07-30 Thread Xiangrui Meng
the total task reduce from 200 t0 64 after first stage just like this: But I don't know if this is reasonable. ​ On Wed, Jul 30, 2014 at 2:11 PM, Xiangrui Meng men...@gmail.com wrote: After you load the data in, call `.repartition(number of executors).cache()`. If the data is evenly

Re: SPARK OWLQN Exception: Iteration Stage is so slow

2014-07-29 Thread Xiangrui Meng
Do you mind sharing more details, for example, specs of nodes and data size? -Xiangrui 2014-07-29 2:51 GMT-07:00 John Wu j...@zamplus.com: Hi all, There is a problem we can’t resolve. We implement the OWLQN algorithm in parallel with SPARK, We don’t know why It is very slow in every

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread Xiangrui Meng
Before torrent, http is the default way for broadcasting. The driver holds the data and the executors request the data via http, making the driver the bottleneck if the data is large. -Xiangrui On Tue, Jul 29, 2014 at 10:32 AM, durin m...@simon-schaefer.net wrote: Development is really rapid

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
Could you share more details about the dataset and the algorithm? For example, if the dataset has 10M+ features, it may be slow for the driver to collect the weights from executors (just a blind guess). -Xiangrui On Tue, Jul 29, 2014 at 9:15 PM, Tan Tim unname...@gmail.com wrote: Hi, all

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
The weight vector is usually dense and if you have many partitions, the driver may slow down. You can also take a look at the driver memory inside the Executor tab in WebUI. Another setting to check is the HDFS block size and whether the input data is evenly distributed to the executors. Are the

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
1. I meant in the n (1k) by m (10k) case, we need to broadcast k centers and hence the total size is m * k. In 1.0, the driver needs to send the current centers to each partition one by one. In the current master, we use torrent to broadcast the centers to workers, which should be much faster. 2.

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
Great! Thanks for testing the new features! -Xiangrui On Mon, Jul 28, 2014 at 8:58 PM, durin m...@simon-schaefer.net wrote: Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes

Re: evaluating classification accuracy

2014-07-28 Thread Xiangrui Meng
Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and master. If you don't want to switch to 1.0.1 or master, try to cache and count test first. -Xiangrui On Mon, Jul 28, 2014 at 6:07 PM, SK skrishna...@gmail.com wrote: Hi, In order to evaluate the ML classification accuracy, I am

Re: Reading hdf5 formats with pyspark

2014-07-28 Thread Xiangrui Meng
That looks good to me since there is no Hadoop InputFormat for HDF5. But remember to specify the number of partitions in sc.parallelize to use all the nodes. You can change `process` to `read` which yields records one-by-one. Then sc.parallelize(files, numPartitions).flatMap(read) returns an RDD

Re: Kmeans: set initial centers explicitly

2014-07-27 Thread Xiangrui Meng
I think this is nice to have. Feel free to create a JIRA for it and it would be great if you can send a PR. Thanks! -Xiangrui On Thu, Jul 24, 2014 at 12:39 PM, SK skrishna...@gmail.com wrote: Hi, The mllib.clustering.kmeans implementation supports a random or parallel initialization mode to

Re: NMF implementaion is Spark

2014-07-25 Thread Xiangrui Meng
It is ALS with setNonnegative. -Xiangrui On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia buendia...@gmail.com wrote: Hi, Is there an implementation for Nonnegative Matrix Factorization in Spark? I understand that MLlib comes with matrix factorization, but it does not seem to cover the

Re: Questions about disk IOs

2014-07-25 Thread Xiangrui Meng
a shutdown On Jul 2, 2014, at 0:08, Xiangrui Meng men...@gmail.com wrote: Try to reduce number of partitions to match the number of cores. We will add treeAggregate to reduce the communication cost. PR: https://github.com/apache/spark/pull/1110 -Xiangrui On Tue, Jul 1, 2014 at 12:55 AM, Charles Li

Re: akka disassociated on GC

2014-07-23 Thread Xiangrui Meng
patch, the evaluation has been processed successfully. I expect that these patches are merged in the next major release (v1.1?). Without them, it would be hard to use mllib for a large dataset. Thanks, Makoto (2014/07/16 15:05), Xiangrui Meng wrote: Hi Makoto, I don't remember I wrote

Re: spark github source build error

2014-07-23 Thread Xiangrui Meng
try `sbt/sbt clean` first? -Xiangrui On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma sharm...@umn.edu wrote: I am trying to build spark after cloning from github repo: I am executing: ./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly I am getting following error: [warn]

Announcing Spark 0.9.2

2014-07-23 Thread Xiangrui Meng
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is a maintenance release with bug fixes across several areas of Spark, including Spark Core, PySpark, MLlib, Streaming, and GraphX. We recommend all 0.9.x users to upgrade to this stable release. Contributions to this release came

Re: LabeledPoint with weight

2014-07-21 Thread Xiangrui Meng
This is a useful feature but it may be hard to have it in v1.1 due to limited time. Hopefully, we can support it in v1.2. -Xiangrui On Mon, Jul 21, 2014 at 12:58 AM, Jiusheng Chen chenjiush...@gmail.com wrote: It seems MLlib right now doesn't support weighted training, training samples have

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-21 Thread Xiangrui Meng
You can also try a different region. I tested us-west-2 yesterday, and it worked well. -Xiangrui On Sun, Jul 20, 2014 at 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Actually the script in the master branch is also broken (it's pointing to an older AMI). Try 1.0.1 for launching

Re: error from DecisonTree Training:

2014-07-21 Thread Xiangrui Meng
This is a known issue: https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is working on it. -Xiangrui On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang j...@uow.edu.au wrote: So this is a bug unsolved (for java) yet? From: Jack Yang [mailto:j...@uow.edu.au] Sent: Friday, 18 July 2014 4:52

Re: Large Task Size?

2014-07-20 Thread Xiangrui Meng
It was because of the latest change to task serialization: https://github.com/apache/spark/commit/1efb3698b6cf39a80683b37124d2736ebf3c9d9a The task size is no longer limited by akka.frameSize but we show warning messages if the task size is above 100KB. Please check the objects referenced in the

Re: Large scale ranked recommendation

2014-07-18 Thread Xiangrui Meng
Nick's suggestion is a good approach for your data. The item factors to broadcast should be a few MBs. -Xiangrui On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux decho...@gmail.com wrote: And you might want to apply clustering before. It is likely that every user and every item are not

Re: Python: saving/reloading RDD

2014-07-18 Thread Xiangrui Meng
You can save RDDs to text files using RDD.saveAsTextFile and load it back using sc.textFile. But make sure the record to string conversion is correctly implemented if the type is not primitive and you have the parser to load them back. -Xiangrui On Jul 18, 2014, at 8:39 AM, Roch Denis

Re: Kmeans

2014-07-17 Thread Xiangrui Meng
... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com On Thursday, July 17, 2014 11:57 AM, Xiangrui Meng men...@gmail.com wrote: kmeans.py contains a naive

Re: MLLib - Regularized logistic regression in python

2014-07-17 Thread Xiangrui Meng
1) This is a miss, unfortunately ... We will add support for regularization and intercept in the coming v1.1. (JIRA: https://issues.apache.org/jira/browse/SPARK-2550) 2) It has overflow problems in Python but not in Scala. We can stabilize the computation by ensuring exp only takes a negative

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui On Thu, Jul 17, 2014 at

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
to less than 200 seconds from over 700 seconds. I am not sure how to repartition the data to match the CPU cores. How do I do it? Thank you. Ravi On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote: Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g

Re: akka disassociated on GC

2014-07-16 Thread Xiangrui Meng
, 2014 at 10:48 PM, Makoto Yui yuin...@gmail.com wrote: Hello, (2014/06/19 23:43), Xiangrui Meng wrote: The execution was slow for more large KDD cup 2012, Track 2 dataset (235M+ records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the sequential aggregation of dense vectors

Re: Error when testing with large sparse svm

2014-07-16 Thread Xiangrui Meng
Then it may be a new issue. Do you mind creating a JIRA to track this issue? It would be great if you can help locate the line in BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui On Tue, Jul 15, 2014 at 10:56 PM, crater cq...@ucmerced.edu wrote: I don't really have my code,

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng
Check the number of inodes (df -i). The assembly build may create many small files. -Xiangrui On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote: Hi all, I am encountering the following error: INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng
driver application Chris On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote: Check the number of inodes (df -i). The assembly build may create many small files. -Xiangrui On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote: Hi all, I am

Re: ALS on EC2

2014-07-15 Thread Xiangrui Meng
Could you share the code of RecommendationALS and the complete spark-submit command line options? Thanks! -Xiangrui On Mon, Jul 14, 2014 at 11:23 PM, Srikrishna S srikrishna...@gmail.com wrote: Using properties file: null Main class: RecommendationALS Arguments: _train.csv _validation.csv

Re: KMeansModel Construtor error

2014-07-15 Thread Xiangrui Meng
I don't think MLlib supports model serialization/deserialization. You got the error because the constructor is private. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2488 and we try to make sure it is implemented in v1.1. For now, you can modify the KMeansModel and remove

Re: Error when testing with large sparse svm

2014-07-15 Thread Xiangrui Meng
crater, was the error message the same as what you posted before: 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote Akka client disassociated 14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0) 14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on

Re: mapPartitionsWithIndex

2014-07-14 Thread Xiangrui Meng
You should return an iterator in mapPartitionsWIthIndex. This is from the programming guide (http://spark.apache.org/docs/latest/programming-guide.html): mapPartitionsWithIndex(func): Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,

Re: Error when testing with large sparse svm

2014-07-14 Thread Xiangrui Meng
You need to set a larger `spark.akka.frameSize`, e.g., 128, for the serialized weight vector. There is a JIRA about switching automatically between sending through akka or broadcast: https://issues.apache.org/jira/browse/SPARK-2361 . -Xiangrui On Mon, Jul 14, 2014 at 12:15 AM, crater

Re: Error when testing with large sparse svm

2014-07-14 Thread Xiangrui Meng
Is it on a standalone server? There are several settings worthing checking: 1) number of partitions, which should match the number of cores 2) driver memory (you can see it from the executor tab of the Spark WebUI and set it with --driver-memory 10g 3) the version of Spark you were running Best,

Re: ML classifier and data format for dataset with variable number of features

2014-07-11 Thread Xiangrui Meng
You can load the dataset as an RDD of JSON object and use a flatMap to extract feature vectors at object level. Then you can filter the training examples you want for binary classification. If you want to try multiclass, checkout DB's PR at https://github.com/apache/spark/pull/1379 Best, Xiangrui

Re: KMeans code is rubbish

2014-07-10 Thread Xiangrui Meng
SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the

Re: Terminal freeze during SVM

2014-07-10 Thread Xiangrui Meng
news20.binary's feature dimension is 1.35M. So the serialized task size is above the default limit 10M. You need to set spark.akka.frameSize to, e.g, 20. Due to a bug SPARK-1112, this parameter is not passed to executors automatically, which causes Spark freezes. This was fixed in the latest

Re: How to RDD.take(middle 10 elements)

2014-07-10 Thread Xiangrui Meng
This is expensive but doable: rdd.zipWithIndex().filter { case (_, idx) = idx = 10 idx 20 }.collect() -Xiangrui On Thu, Jul 10, 2014 at 12:53 PM, Nick Chammas nicholas.cham...@gmail.com wrote: Interesting question on Stack Overflow: http://stackoverflow.com/q/24677180/877069 Basically, is

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-09 Thread Xiangrui Meng
, Xiangrui Meng men...@gmail.com wrote: It seems to me a setup issue. I just tested news20.binary (1355191 features) on a 2-node EC2 cluster and it worked well. I added one line to conf/spark-env.sh: export SPARK_JAVA_OPTS= -Dspark.akka.frameSize=20 and launched spark-shell with --driver-memory

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-07-08 Thread Xiangrui Meng
Hi Rahul, We plan to add online model updates with Spark Streaming, perhaps in v1.1, starting with linear methods. Please open a JIRA for Naive Bayes. For Naive Bayes, we need to update the priors and conditional probabilities, which means we should also remember the number of observations for

Re: Is MLlib NaiveBayes implementation for Spark 0.9.1 correct?

2014-07-08 Thread Xiangrui Meng
Well, I believe this is a correct implementation but please let us know if you run into problems. The NaiveBayes implementation in MLlib v1.0 supports sparse data, which is usually the case for text classificiation. I would recommend upgrading to v1.0. -Xiangrui On Tue, Jul 8, 2014 at 7:20 AM,

Re: got java.lang.AssertionError when run sbt/sbt compile

2014-07-08 Thread Xiangrui Meng
try sbt/sbt clean first On Tue, Jul 8, 2014 at 8:25 AM, bai阿蒙 smallmonkey...@hotmail.com wrote: Hi guys, when i try to compile the latest source by sbt/sbt compile, I got an error. Can any one help me? The following is the detail: it may cause by TestSQLContext.scala [error] [error]

Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Xiangrui Meng
1) The feature dimension should be a fixed number before you run NaiveBayes. If you use bag of words, you need to handle the word-to-index dictionary by yourself. You can either ignore the words that never appear in training (because they have no effect in prediction), or use hashing to randomly

Re: Help for the large number of the input data files

2014-07-08 Thread Xiangrui Meng
You can either use sc.wholeTextFiles and then a flatMap to reduce the number of partitions, or give more memory to the driver process by using --driver-memory 20g and then call RDD.repartition(small number) after you load the data in. -Xiangrui On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-07 Thread Xiangrui Meng
with slave2. 2) The execution was successful when run in local mode with reduced number of partitions. Does this imply issues communicating/coordinating across processes (i.e. driver, master and workers)? Thanks, Bharath On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng men...@gmail.com wrote

Re: Dense to sparse vector converter

2014-07-07 Thread Xiangrui Meng
No, but it should be easy to add one. -Xiangrui On Mon, Jul 7, 2014 at 12:37 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, Is there a method in Spark/MLlib to convert DenseVector to SparseVector? Best regards, Alexander

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-06 Thread Xiangrui Meng
task ID 2 ... On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng men...@gmail.com wrote: The feature dimension is small. You don't need a big akka.frameSize. The default one (10M) should be sufficient. Did you cache the data before calling LRWithSGD? -Xiangrui On Thu, Jul 3, 2014 at 10:02 AM

Re: MLLib : Math on Vector and Matrix

2014-07-03 Thread Xiangrui Meng
Hi Thunder, Please understand that both MLlib and breeze are in active development. Before v1.0, we used jblas but in the public APIs we only exposed Array[Double]. In v1.0, we introduced Vector that supports both dense and sparse data and switched the backend to breeze/netlib-java (except ALS).

Re: MLLib : Math on Vector and Matrix

2014-07-03 Thread Xiangrui Meng
Hi Dmitriy, It is sweet to have the bindings, but it is very easy to downgrade the performance with them. The BLAS/LAPACK APIs have been there for more than 20 years and they are still the top choice for high-performance linear algebra. I'm thinking about whether it is possible to make the

Re: One question about RDD.zip function when trying Naive Bayes

2014-07-02 Thread Xiangrui Meng
This is due to a bug in sampling, which was fixed in 1.0.1 and latest master. See https://github.com/apache/spark/pull/1234 . -Xiangrui On Wed, Jul 2, 2014 at 8:23 PM, x wasedax...@gmail.com wrote: Hello, I a newbie to Spark MLlib and ran into a curious case when following the instruction at

Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-01 Thread Xiangrui Meng
You can use either bin/run-example or bin/spark-summit to run example code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark classpath. There are examples in the official doc: http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here -Xiangrui On Tue, Jul 1, 2014 at

Re: Questions about disk IOs

2014-07-01 Thread Xiangrui Meng
Try to reduce number of partitions to match the number of cores. We will add treeAggregate to reduce the communication cost. PR: https://github.com/apache/spark/pull/1110 -Xiangrui On Tue, Jul 1, 2014 at 12:55 AM, Charles Li littlee1...@gmail.com wrote: Hi Spark, I am running LBFGS on our

Re: why is toBreeze private everywhere in mllib?

2014-07-01 Thread Xiangrui Meng
We were not ready to expose it as a public API in v1.0. Both breeze and MLlib are in rapid development. It would be possible to expose it as a developer API in v1.1. For now, it should be easy to define a toBreeze method in your own project. -Xiangrui On Tue, Jul 1, 2014 at 12:17 PM, Koert

Re: TaskNotSerializable when invoking KMeans.run

2014-06-30 Thread Xiangrui Meng
Could you post the code snippet and the error stack trace? -Xiangrui On Mon, Jun 30, 2014 at 7:03 AM, Daniel Micol dmi...@gmail.com wrote: Hello, I’m trying to use KMeans with MLLib but am getting a TaskNotSerializable error. I’m using Spark 0.9.1 and invoking the KMeans.run method with k = 2

Re: Spark 1.0 and Logistic Regression Python Example

2014-06-30 Thread Xiangrui Meng
You were using an old version of numpy, 1.4? I think this is fixed in the latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use the latest master. -Xiangrui On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote: Hi, I modified the example code for

Re: Improving Spark multithreaded performance?

2014-06-27 Thread Xiangrui Meng
Hi Kyle, A few questions: 1) Did you use `setIntercept(true)`? 2) How many features? I'm a little worried about driver's load because the final aggregation and weights update happen on the driver. Did you check driver's memory usage as well? Best, Xiangrui On Fri, Jun 27, 2014 at 8:10 AM,

Re: TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-27 Thread Xiangrui Meng
Try to use --executor-memory 12g with spark-summit. Or you can set it in conf/spark-defaults.properties and rsync it to all workers and then restart. -Xiangrui On Fri, Jun 27, 2014 at 1:05 PM, Peng Cheng pc...@uow.edu.au wrote: I give up, communication must be blocked by the complex EC2 network

Re: Anything like grid search available for mlbase?

2014-06-20 Thread Xiangrui Meng
This is a planned feature for v1.1. I'm going to work on it after v1.0.1 release. -Xiangrui On Jun 20, 2014, at 6:46 AM, Charles Earl charles.ce...@gmail.com wrote: Looking for something like scikit's grid search module. C

Re: Performance problems on SQL JOIN

2014-06-20 Thread Xiangrui Meng
Your data source is S3 and data is used twice. m1.large does not have very good network performance. Please try file.count() and see how fast it goes. -Xiangrui On Jun 20, 2014, at 8:16 AM, mathias math...@socialsignificance.co.uk wrote: Hi there, We're trying out Spark and are

Re: Contribution to Spark MLLib

2014-06-19 Thread Xiangrui Meng
Denis, I think it is fine to have PLSA in MLlib. But I'm not familiar with the modification you mentioned since the paper is new. We may need to spend more time to learn the trade-offs. Feel free to create a JIRA for PLSA and we can move our discussion there. It would be great if you can share

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-19 Thread Xiangrui Meng
It is because the frame size is not set correctly in executor backend. see spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate? On Jun 19, 2014, at 2:01 AM, Makoto Yui yuin...@gmail.com wrote: Xiangrui and Debasish, (2014/06/18 6:33), Debasish Das wrote: I did

Re: Execution stalls in LogisticRegressionWithSGD

2014-06-18 Thread Xiangrui Meng
,Integer is unrelated to mllib. Thanks, Bharath On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi Xiangrui , I'm using 1.0.0. Thanks, Bharath On 18-Jun-2014 1:43 am, Xiangrui Meng men...@gmail.com wrote: Hi Bharath, Thanks for posting the details

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Hi Makoto, How many partitions did you set? If there are too many partitions, please do a coalesce before calling ML algorithms. Btw, could you try the tree branch in my repo? https://github.com/mengxr/spark/tree/tree I used tree aggregate in this branch. It should help with the scalability.

Re: Contribution to Spark MLLib

2014-06-17 Thread Xiangrui Meng
Hi Jayati, Thanks for asking! MLlib algorithms are all implemented in Scala. It makes us easier to maintain if we have the implementations in one place. For the roadmap, please visit http://www.slideshare.net/xrmeng/m-llib-hadoopsummit to see features planned for v1.1. Before contributing new

Re: Execution stalls in LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Hi Bharath, Thanks for posting the details! Which Spark version are you using? Best, Xiangrui On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi, (Apologies for the long mail, but it's necessary to provide sufficient details considering the number of issues

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
, where n is the number of partitions. It would be great if someone can help test its scalability. Best, Xiangrui On Tue, Jun 17, 2014 at 1:32 PM, Makoto Yui yuin...@gmail.com wrote: Hi Xiangrui, (2014/06/18 4:58), Xiangrui Meng wrote: How many partitions did you set? If there are too many

<    1   2   3   4   5   >