Re: No module named pyspark - latest built

2014-11-12 Thread Xiangrui Meng
You need to use maven to include python files. See https://github.com/apache/spark/pull/1223 . -Xiangrui On Wed, Nov 12, 2014 at 4:48 PM, jamborta wrote: > I have figured out that building the fat jar with sbt does not seem to > included the pyspark scripts using the following command: > > sbt/sb

Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-14 Thread Xiangrui Meng
If you use Kryo serialier, you need to register mutable.BitSet and Rating: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102 The JIRA was marked resolved because chill resolved the problem in v0.4.0 and we have this workaro

Re: Client application that calls Spark and receives an MLlib model Scala Object and then predicts without Spark installed on hadoop

2014-11-15 Thread Xiangrui Meng
If Spark is not installed on the client side, you won't be able to deserialize the model. Instead of serializing the model object, you may serialize the model weights array and implement predict on the client side. -Xiangrui On Fri, Nov 14, 2014 at 2:54 PM, xiaoyan yu wrote: > I had the same need

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27 AM, lev wrote: > Hi, > > I'm having trouble using both zipWithIndex and repartition. When I use them > both, the following action will get stuck and won't return. > I'm using spark 1.1.0. > > > Those 2 lines work as expe

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
I think I understand where the bug is now. I created a JIRA (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR soon. -Xiangrui On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng wrote: > This is a bug. Could you make a JIRA? -Xiangrui > > On Sat, Nov 15, 2014 at 3:2

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround: val a = sc.parallelize(1 to 10).zipWithIndex() a.partitions // call .partitions explicitly a.repartition(10).count() Thanks for reporting the bug! -Xiangrui On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng wrote

Re: MLIB KMeans Exception

2014-11-20 Thread Xiangrui Meng
How many features and how many partitions? You set kmeans_clusters to 1. If the feature dimension is large, it would be really expensive. You can check the WebUI and see task failures there. The stack trace you posted is from the driver. Btw, the total memory you have is 64GB * 10, so you can c

Re: Python Logistic Regression error

2014-11-24 Thread Xiangrui Meng
The data is in LIBSVM format. So this line won't work: values = [float(s) for s in line.split(' ')] Please use the util function in MLUtils to load it as an RDD of LabeledPoint. http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point from pyspark.mllib.util import MLUtils examp

Re: Store kmeans model

2014-11-24 Thread Xiangrui Meng
KMeansModel is serializable. So you can use Java serialization, try sc.parallelize(Seq(model)).saveAsObjectFile(outputDir) sc.objectFile[KMeansModel](outputDir).first() We will try to address model export/import more formally in 1.3, e.g., https://www.github.com/apache/spark/pull/3062 -Xiangrui

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Xiangrui Meng
Try building Spark with -Pnetlib-lgpl, which includes the JNI library in the Spark assembly jar. This is the simplest approach. If you want to include it as part of your project, make sure the library is inside the assembly jar or you specify it via `--jars` with spark-submit. -Xiangrui On Mon, No

Re: Is spark streaming +MlLib for online learning?

2014-11-25 Thread Xiangrui Meng
In 1.2, we added streaming k-means: https://github.com/apache/spark/pull/2942 . -Xiangrui On Mon, Nov 24, 2014 at 5:25 PM, Joanne Contact wrote: > Thank you Tobias! > > On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer wrote: >> >> Hi, >> >> On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact >> wro

Re: K-means clustering

2014-11-25 Thread Xiangrui Meng
There is a simple example here: https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py . You can take advantage of sparsity by computing the distance via inner products: http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2 -Xiangrui On Tue, Nov 25, 2014 at 2:39

Re: MlLib Colaborative filtering factors

2014-11-25 Thread Xiangrui Meng
It is data-dependent, and hence needs hyper-parameter tuning, e.g., grid search. The first batch is certainly expensive. But after you figure out a small range for each parameter that fits your data, following batches should be not that expensive. There is an example from AMPCamp: http://ampcamp.b

Re: why MatrixFactorizationModel private?

2014-11-25 Thread Xiangrui Meng
Besides API stability concerns, models constructed directly from users rather than returned by ALS may not work well. The userFeatures and productFeatures are both with partitioners so we can perform quick lookup for prediction. If you save userFeatures and productFeatures and load them back, it is

Re: why MatrixFactorizationModel private?

2014-11-25 Thread Xiangrui Meng
Sent out a PR to make the constructor public and leave a note in the doc: https://github.com/apache/spark/pull/3459 . If you load userFeatures and productFeatures back and want to make predictions on individual records, it might be useful to call partitionBy(...).cache() on the userFeatures and pro

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Xiangrui Meng
The training RMSE may increase due to regularization. Squared loss only represents part of the global loss. If you watch the sum of the squared loss and the regularization, it should be non-increasing. -Xiangrui On Wed, Nov 26, 2014 at 9:53 AM, Sean Owen wrote: > I also modified the example to tr

Re: ALS failure with size > Integer.MAX_VALUE

2014-12-02 Thread Xiangrui Meng
Hi Bharath, You can try setting a small item blocks in this case. 1200 is definitely too large for ALS. Please try 30 or even smaller. I'm not sure whether this could solve the problem because you have 100 items connected with 10^8 users. There is a JIRA for this issue: https://issues.apache.org/

Re: printing mllib.linalg.vector

2014-12-15 Thread Xiangrui Meng
you can use the default toString method to get the string representation. if you want to customized, check the indices/values fields. -Xiangrui On Fri, Dec 5, 2014 at 7:32 AM, debbie wrote: > Basic question: > > What is the best way to loop through one of these and print their > components? Conve

Re: MLlib(Logistic Regression) + Spark Streaming.

2014-12-15 Thread Xiangrui Meng
If you want to train offline and predict online, you can use the current LR implementation to train a model and then apply model.predict on the dstream. -Xiangrui On Sun, Dec 7, 2014 at 6:30 PM, Nasir Khan wrote: > I am new to spark. > Lets say i want to develop a machine learning model. which tr

Re: MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-15 Thread Xiangrui Meng
Is it possible that after filtering the feature dimension changed? This may happen if you use LIBSVM format but didn't specify the number of features. -Xiangrui On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak wrote: > Hi All, > > > I was able to run LinearRegressionwithSGD for a largeer dataset (> 2

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Xiangrui Meng
Please check the number of partitions after sc.textFile. Use sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai wrote: > You just need to use the latest master code without any configuration > to get performance improvement from my PR. > > Since

Re: Stack overflow Error while executing spark SQL

2014-12-15 Thread Xiangrui Meng
Could you post the full stacktrace? It seems to be some recursive call in parsing. -Xiangrui On Tue, Dec 9, 2014 at 7:44 PM, wrote: > Hi > > > > I am getting Stack overflow Error > > Exception in main java.lang.stackoverflowerror > > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.sc

Re: Building Desktop application for ALS-MlLib/ Training ALS

2014-12-15 Thread Xiangrui Meng
On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal wrote: > > > Hi, > > > > I am a new bee in spark and scala world > > > > I have been trying to implement Collaborative filtering using MlLib supplied > out of the box with Spark and Scala > > > > I have 2 problems > > > > 1. The best model was

Re: ALS failure with size > Integer.MAX_VALUE

2014-12-15 Thread Xiangrui Meng
;ll try out setting a smaller number of item blocks. And >> yes, I've been following the JIRA for the new ALS implementation. I'll try >> it out when it's ready for testing. . >> >> On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng wrote: >>> >>

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-18 Thread Xiangrui Meng
Hi Jay, Please try increasing executor memory (if the available memory is more than 2GB) and reduce numBlocks in ALS. The current implementation stores all subproblems in memory and hence the memory requirement is significant when k is large. You can also try reducing k and see whether the problem

Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng
Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install pac

Re: Interpreting MLLib's linear regression o/p

2014-12-22 Thread Xiangrui Meng
Did you check the indices in the LIBSVM data and the master file? Do they match? -Xiangrui On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak wrote: > Hi All, > I use LIBSVM format to specify my input feature vector, which used 1-based > index. When I run regression the o/p is 0-indexed based. I have

Re: MLLib beginner question

2014-12-22 Thread Xiangrui Meng
How big is the dataset you want to use in prediction? -Xiangrui On Mon, Dec 22, 2014 at 1:47 PM, boci wrote: > Hi! > > I want to try out spark mllib in my spark project, but I got a little > problem. I have training data (external file), but the real data com from > another rdd. How can I do that

Re: MLlib + Streaming

2014-12-23 Thread Xiangrui Meng
We have streaming linear regression (since v1.1) and k-means (v1.2) in MLlib. You can check the user guide: http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering -Xiangrui On Tue, D

Re: retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Xiangrui Meng
Sean's PR may be relevant to this issue (https://github.com/apache/spark/pull/3702). As a workaround, you can try to truncate the raw scores to 4 digits (e.g., 0.5643215 -> 0.5643) before sending it to BinaryClassificationMetrics. This may not work well if he score distribution is very skewed. See

Re: Using TF-IDF from MLlib

2014-12-29 Thread Xiangrui Meng
Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella wrote: > Here is w

Re: weights not changed with different reg param

2014-12-29 Thread Xiangrui Meng
Could you post your code? It sounds like a bug. One thing to check is that wheher you set regType, which is None by default. -Xiangrui On Tue, Dec 23, 2014 at 3:36 PM, Thomas Kwan wrote: > Hi there > > We are on mllib 1.1.1, and trying different regularization parameters. We > noticed that the re

Re: MLLib beginner question

2014-12-29 Thread Xiangrui Meng
vice?) > > b0c1 > > -- > Skype: boci13, Hangout: boci.b...@gmail.com > > On Tue, Dec 23, 2014 at 1:35 AM, Xiangrui Meng wrote: >> >> How big is the dataset you want to use in prediction? -Xiangrui >> &g

Re: SVDPlusPlus Recommender in MLLib

2014-12-29 Thread Xiangrui Meng
There is an SVD++ implementation in GraphX. It would be nice if you can compare its performance vs. Mahout. -Xiangrui On Wed, Dec 24, 2014 at 6:46 AM, Prafulla Wani wrote: > hi , > > Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is > implemented in Mahout from this paper -

Re: MLLIB and Openblas library in non-default dir

2015-01-05 Thread Xiangrui Meng
It might be hard to do that with spark-submit, because the executor JVMs may be already up and running before a user runs spark-submit. You can try to use `System.setProperty` to change the property at runtime, though it doesn't seem to be a good solution. -Xiangrui On Fri, Jan 2, 2015 at 6:28 AM,

Re: python API for gradient boosting?

2015-01-05 Thread Xiangrui Meng
I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-5094. Hopefully someone would work on it and make it available in the 1.3 release. -Xiangrui On Sun, Jan 4, 2015 at 6:58 PM, Christopher Thom wrote: > Hi, > > > > I wonder if anyone knows when a python API will be added for Grad

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Xiangrui Meng
How big is your dataset, and what is the vocabulary size? -Xiangrui On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen wrote: > Hi, > > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup > usage. Here is the jstack output: > > "main" prio=10 tid=0x40112800 nid=0x46f2 runnable

Re: MLLIB and Openblas library in non-default dir

2015-01-06 Thread Xiangrui Meng
rk-submit script > Am I correct? > > > > On Mon, Jan 5, 2015 at 10:35 PM, Xiangrui Meng wrote: >> >> It might be hard to do that with spark-submit, because the executor >> JVMs may be already up and running before a user runs spark-submit. >> You can try to

Re: [MLLib] storageLevel in ALS

2015-01-06 Thread Xiangrui Meng
Which Spark version are you using? We made this configurable in 1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L202 -Xiangrui On Tue, Jan 6, 2015 at 12:57 PM, Fernando O. wrote: > Hi, >I was doing a tests with ALS and I

Re: confidence/probability for prediction in MLlib

2015-01-06 Thread Xiangrui Meng
This is addressed in https://issues.apache.org/jira/browse/SPARK-4789. In the new pipeline API, we can simply output two columns, one for the best predicted class, and the other for probabilities or confidence scores for each class. -Xiangrui On Tue, Jan 6, 2015 at 11:43 AM, Jianguo Li wrote: > H

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-06 Thread Xiangrui Meng
Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch wrote: > Hi All, > > Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in > local mode and not on distributed mode. Null pointer exception has been > th

Re: calculating the mean of SparseVector RDD

2015-01-07 Thread Xiangrui Meng
There is some serialization overhead. You can try https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 . -Xiangrui On Wed, Jan 7, 2015 at 9:42 AM, rok wrote: > I have an RDD of SparseVectors and I'd like to calculate the means returning > a dense vector. I've tried doing

Re: Discrepancy in PCA values

2015-01-08 Thread Xiangrui Meng
The Julia code is computing the SVD of the Gram matrix. PCA should be applied to the covariance matrix. -Xiangrui On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara wrote: > Hi All, > > I tried to do PCA for the Iris dataset > [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib > [http://spark.a

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-09 Thread Xiangrui Meng
On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng wrote: >> >> Could you attach the executor log? That may help identify the root >> cause. -Xiangrui >> >> On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch >> wrote: >> > Hi All, >> > >> &

Re: Discrepancy in PCA values

2015-01-09 Thread Xiangrui Meng
*X ; > > Thanks, > Upul > > On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng wrote: >> >> The Julia code is computing the SVD of the Gram matrix. PCA should be >> applied to the covariance matrix. -Xiangrui >> >> On Thu, Jan 8, 2015 at 8:27 AM, Upul Band

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-09 Thread Xiangrui Meng
Do you have more than 2 billion users/products? If not, you can pair each user/product id with an integer (check RDD.zipWithUniqueId), use them in ALS, and then join the original bigInt IDs back after training. -Xiangrui On Fri, Jan 9, 2015 at 5:12 PM, nishanthps wrote: > Hi, > > The userId's and

Re: Zipping RDDs of equal size not possible

2015-01-09 Thread Xiangrui Meng
"sample 2 * n tuples, split them into two parts, balance the sizes of these parts by filtering some tuples out" How do you guarantee that the two RDDs have the same size? -Xiangrui On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke <1wil...@informatik.uni-hamburg.de> wrote: > Hi Spark community, > >

Re: OptionalDataException during Naive Bayes Training

2015-01-09 Thread Xiangrui Meng
How big is your data? Did you see other error messages from executors? It seems to me like a shuffle communication error. This thread may be relevant: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccalrnvjuvtgae_ag1rqey_cod1nmrlfpesxgsb7g8r21h0bm...@mail.gmail.com%3E -Xiangru

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Xiangrui Meng
gt; required: 8 > > is there an easy/obvious fix? > > > On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng wrote: >> >> There is some serialization overhead. You can try >> >> https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 >> .

Re: Discrepancy in PCA values

2015-01-12 Thread Xiangrui Meng
t; > Now PCA calculated using Julia and R is identical, but still I can see a > small > difference between PCA values given by Spark and other two. > > Thanks, > Upul > > On Sat, Jan 10, 2015 at 11:17 AM, Xiangrui Meng wrote: >> >> You need to subtract mean values t

Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Xiangrui Meng
ryoException: Buffer overflow. Available: 5, > required: 8 > > Just calling colStats doesn't actually compute those statistics, does it? It > looks like the computation is only carried out once you call the .mean() > method. > > > > On Sat, Jan 10, 2015 at 7:04 AM, Xiangru

Re: including the spark-mllib in build.sbt

2015-01-12 Thread Xiangrui Meng
I don't know the root cause. Could you try including only libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.1.1" It should be sufficient because mllib depends on core. -Xiangrui On Mon, Jan 12, 2015 at 2:27 PM, Jianguo Li wrote: > Hi, > > I am trying to build my own scala project

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-14 Thread Xiangrui Meng
. Best, Xiangrui On Wed, Jan 14, 2015 at 1:04 PM, Nishanth P S wrote: > Yes, we are close to having more 2 billion users. In this case what is the > best way to handle this. > > Thanks, > Nishanth > > On Fri, Jan 9, 2015 at 9:50 PM, Xiangrui Meng wrote: >> >> Do y

Re: Using a RowMatrix inside a map

2015-01-14 Thread Xiangrui Meng
Yes, you can only use RowMatrix.multiply() within the driver. We are working on distributed block matrices and linear algebra operations on top of it, which would fit your use cases well. It may take several PRs to finish. You can find the first one here: https://github.com/apache/spark/pull/3200 -

Re: Pig on Spark

2014-03-10 Thread Xiangrui Meng
Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak wrote: > Hi Mayur, > We are planning to upgrade our distribution MR1> MR2 (YARN) and the goal is > to get SPROK set up next month. I

Re: possible bug in Spark's ALS implementation...

2014-03-11 Thread Xiangrui Meng
Hi Michael, I can help check the current implementation. Would you please go to https://spark-project.atlassian.net/browse/SPARK and create a ticket about this issue with component "MLlib"? Thanks! Best, Xiangrui On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman wrote: > Hi, > > I'm implementing

Re: possible bug in Spark's ALS implementation...

2014-03-11 Thread Xiangrui Meng
Line 376 should be correct as it is computing \sum_i (c_i - 1) x_i x_i^T, = \sum_i (alpha * r_i) x_i x_i^T. Are you computing some metrics to tell which recommendation is better? -Xiangrui On Tue, Mar 11, 2014 at 6:38 PM, Xiangrui Meng wrote: > Hi Michael, > > I can help check th

Re: possible bug in Spark's ALS implementation...

2014-03-14 Thread Xiangrui Meng
Hi Michael, Thanks for looking into the details! Computing X first and computing Y first can deliver different results, because the initial objective values could differ by a lot. But the algorithm should converge after a few iterations. It is hard to tell which should go first. After all, the def

Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Xiangrui Meng
The factor matrix Y is used twice in implicit ALS computation, one to compute global Y^T Y, and another to compute local Y_i^T C_i Y_i. -Xiangrui On Sun, Mar 16, 2014 at 1:18 PM, Matei Zaharia wrote: > On Mar 14, 2014, at 5:52 PM, Michael Allman wrote: > > I also found that the product and user

Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Xiangrui Meng
Hi Michael, I made couple changes to implicit ALS. One gives faster construction of YtY (https://github.com/apache/spark/pull/161), which was merged into master. The other caches intermediate matrix factors properly (https://github.com/apache/spark/pull/165). They should give you the same result a

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Xiangrui Meng
Sorry, the link was wrong. Should be https://github.com/apache/spark/pull/131 -Xiangrui On Tue, Mar 18, 2014 at 10:20 AM, Michael Allman wrote: > Hi Xiangrui, > > I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can > you explain? > > Also, thanks for addressing the issue

Re: Feed KMeans algorithm with a row major matrix

2014-03-18 Thread Xiangrui Meng
Hi Jaonary, With the current implementation, you need to call Array.slice to make each row an Array[Double] and cache the result RDD. There is a plan to support block-wise input data and I will keep you informed. Best, Xiangrui On Tue, Mar 18, 2014 at 2:46 AM, Jaonary Rabarisoa wrote: > Dear Al

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Xiangrui Meng
Glad to hear the speed-up. Wish we can improve the implementation further in the future. -Xiangrui On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman wrote: > I just ran a runtime performance comparison between 0.9.0-incubating and your > als branch. I saw a 1.5x improvement in performance. > > > >

Re: Kmeans example reduceByKey slow

2014-03-23 Thread Xiangrui Meng
Hi Tsai, Could you share more information about the machine you used and the training parameters (runs, k, and iterations)? It can help solve your issues. Thanks! Best, Xiangrui On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote: > Hi, > > At the reduceBuyKey stage, it takes a few minutes befo

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
e first iteration. K=50. Here's the code I use: > http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. > > Thanks! > > > > On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng wrote: > >> Hi Tsai, >> >> Could you share more informati

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
> Does the size of the input data matters for the example? Currently I have 50M > rows. What is a reasonable size to demonstrate the capability of Spark? > > > > > > On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng wrote: > >> K = 50 is certainly a large number for k-m

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
quot; here is the app/driver/spark-shell? > > Thanks! > > On 25 Mar, 2014, at 1:03 am, Xiangrui Meng wrote: > >> Number of rows doesn't matter much as long as you have enough workers >> to distribute the work. K-means has complexity O(n * d * k), where n >> is

Re: ArrayIndexOutOfBoundsException in ALS.implicit

2014-03-28 Thread Xiangrui Meng
Hi bearrito, This is a known issue (https://spark-project.atlassian.net/browse/SPARK-1281) and it should be easy to fix by switching to a hash partitioner. CC'ed dev list in case someone volunteers to work on it. Best, Xiangrui On Thu, Mar 27, 2014 at 8:38 PM, bearrito wrote: > Usage of negati

Re: Issue with zip and partitions

2014-04-01 Thread Xiangrui Meng
>From API docs: "Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* and the *same number of elements in each partition* (e.g. one was made through a map on the

Re: ui broken in latest 1.0.0

2014-04-07 Thread Xiangrui Meng
This is fixed in https://github.com/apache/spark/pull/281. Please try again with the latest master. -Xiangrui On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers wrote: > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago > (apr 5) that the "application detail ui" no longer show

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
Fix storage UI bug > > > > On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers wrote: >> >> got it thanks >> >> >> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng wrote: >>> >>> This is fixed in https://github.com/apache/spark/pull/281. Ple

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
This is because the tasks associated with re-using the RDD do not >>> report the RDD's blocks as updated (which is correct). On stage submit, >>> however, we overwrite any existing >>> >>> Author: Andrew Or >>> >>> Closes #281 from

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Xiangrui Meng
After sbt/sbt gen-diea, do not import as an SBT project but choose "open project" and point it to the spark folder. -Xiangrui On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen wrote: > I let IntelliJ read the Maven build directly and that works fine. > -- > Sean Owen | Director, Data Science | London >

Re: SVD under spark/mllib/linalg

2014-04-11 Thread Xiangrui Meng
It was moved to mllib.linalg.distributed.RowMatrix. With RowMatrix, you can compute column summary statistics, gram matrix, covariance, SVD, and PCA. We will provide multiplication for distributed matrices, but not in v1.0. -Xiangrui On Fri, Apr 11, 2014 at 9:12 PM, wxhsdp wrote: > Hi, all > the

Re: checkpointing without streaming?

2014-04-21 Thread Xiangrui Meng
Checkpoint clears dependencies. You might need checkpoint to cut a long lineage in iterative algorithms. -Xiangrui On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll wrote: > I'm trying to understand when I would want to checkpoint an RDD rather than > just persist to disk. > > Every reference I can

Re: checkpointing without streaming?

2014-04-21 Thread Xiangrui Meng
sist and > replicate my RDD to avoid re-computation, if that's my goal. What advantage > does checkpointing provide over disk persistence with replication? > > > On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng wrote: >> >> Checkpoint clears dependencies. You might nee

Re: error in mllib lr example code

2014-04-23 Thread Xiangrui Meng
The doc is for 0.9.1. You are running a later snapshot, which added sparse vectors. Try LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x => x.toDouble)). The examples are updated in the master branch. You can also check the examples there. -Xiangrui On Wed, Apr 23, 2014 at 9

Re: skip lines in spark

2014-04-23 Thread Xiangrui Meng
If the first partition doesn't have enough records, then it may not drop enough lines. Try rddData.zipWithIndex().filter(_._2 >= 10L).map(_._1) It might trigger a job. Best, Xiangrui On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai wrote: > Hi Chengi, > > If you just want to skip first n lines in RDD,

Re: Spark hangs when i call parallelize + count on a ArrayList having 40k elements

2014-04-23 Thread Xiangrui Meng
How big is each entry, and how much memory do you have on each executor? You generated all data on driver and sc.parallelize(bytesList) will send the entire dataset to a single executor. You may run into I/O or memory issues. If the entries are generated, you should create a simple RDD sc.paralleli

Re: skip lines in spark

2014-04-23 Thread Xiangrui Meng
; >> What happens, if I have to drop so many records that the number exceeds >> partition 0.. ?? >> How do i handle that case? >> >> >> >> >> On Wed, Apr 23, 2014 at 9:51 AM, Xiangrui Meng wrote: >>> >>> If the first partition

Re: Hadoop—streaming

2014-04-23 Thread Xiangrui Meng
PipedRDD is an RDD[String]. If you know how to parse each result line into (key, value) pairs, then you can call reduce after. piped.map(x => (key, value)).reduceByKey((v1, v2) => v) -Xiangrui On Wed, Apr 23, 2014 at 2:09 AM, zhxfl <291221...@qq.com> wrote: > Hello,we know Hadoop-streaming is us

Re: ArrayIndexOutOfBoundsException in ALS.implicit

2014-04-23 Thread Xiangrui Meng
Hi bearrito, this issue was fixed by Tor in https://github.com/apache/spark/pull/407. You can either try the master branch or wait for the 1.0 release. -Xiangrui On Fri, Mar 28, 2014 at 12:19 AM, Xiangrui Meng wrote: > Hi bearrito, > > This is a known issue > (https://spark-project.a

Re: Failed to run count?

2014-04-23 Thread Xiangrui Meng
Which spark version are you using? Could you also include the worker logs? -Xiangrui On Wed, Apr 23, 2014 at 3:19 PM, Ian Ferreira wrote: > I am getting this cryptic error running LinearRegressionwithSGD > > Data sample > LabeledPoint(39.0, [144.0, 1521.0, 20736.0, 59319.0, 2985984.0]) > > 14/04

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King wrote: > ./spark-shell: line 153: 17654 Killed > $FWDIR/bin/spark-class org.apache.spark.repl.Main "$@" > > > Any ideas?

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread Xiangrui Meng
Is your Spark cluster running? Try to start with generating simple RDDs and counting. -Xiangrui On Thu, Apr 24, 2014 at 11:38 AM, John King wrote: > I receive this error: > > Traceback (most recent call last): > > File "", line 1, in > > File > "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/ml

Re: spark mllib to jblas calls..and comparison with VW

2014-04-24 Thread Xiangrui Meng
The data array in RDD is passed by reference to jblas, so data copying in this stage. However, if jblas uses the native interface, there is a copying overhead. I think jblas uses java implementation for at least Level 1 BLAS, and calling native interface for Level 2 & 3. -Xiangrui On Thu, Apr 24,

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread Xiangrui Meng
and mapping. Just > received this error when trying to classify. > > > On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng wrote: >> >> Is your Spark cluster running? Try to start with generating simple >> RDDs and counting. -Xiangrui >> >> On Thu, Apr 24,

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
Do you mind sharing more code and error messages? The information you provided is too little to identify the problem. -Xiangrui On Thu, Apr 24, 2014 at 1:55 PM, John King wrote: > Last command was: > > val model = new NaiveBayes().run(points) > > > > On Thu, Apr 24, 2014

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
} > >val vector = new SparseVector(2357815, indices.toArray, > featValues.toArray) > >return LabeledPoint(values(0).toDouble, vector) > > } > > > val data = sc.textFile("data.txt") > > val empty = data

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
entioned in the error have anything to do with it? > > > On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng wrote: >> >> I don't see anything wrong with your code. Could you do points.count() >> to see how many training examples you have? Also, make sure you don&#

Re: Running out of memory Naive Bayes

2014-04-26 Thread Xiangrui Meng
How many labels does your dataset have? -Xiangrui On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: > Which version of mllib are you using? For Spark 1.0, mllib will > support sparse feature vector which will improve performance a lot > when computing the distance between points and centroid. > > S

Re: Running out of memory Naive Bayes

2014-04-27 Thread Xiangrui Meng
partitions and giving driver more ram and see whether it can help? -Xiangrui On Sun, Apr 27, 2014 at 3:33 PM, John King wrote: > I'm already using the SparseVector class. > > ~200 labels > > > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng wrote: >> >> How many label

Re: Running out of memory Naive Bayes

2014-04-27 Thread Xiangrui Meng
; cache in HDFS. > > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng wrote: >> >> Eve

Re: running SparkALS

2014-04-28 Thread Xiangrui Meng
Hi Diana, SparkALS is an example implementation of ALS. It doesn't call the ALS algorithm implemented in MLlib. M, U, and F are used to generate synthetic data. I'm updating the examples. In the meantime, you can take a look at the updated MLlib guide: http://50.17.120.186:4000/mllib-collaborativ

Re: Spark LIBLINEAR

2014-05-12 Thread Xiangrui Meng
Hi Chieh-Yen, Great to see the Spark implementation of LIBLINEAR! We will definitely consider adding a wrapper in MLlib to support it. Is the source code on github? Deb, Spark LIBLINEAR uses BSD license, which is compatible with Apache. Best, Xiangrui On Sun, May 11, 2014 at 10:29 AM, Debasish

Re: Turn BLAS on MacOSX

2014-05-12 Thread Xiangrui Meng
Those are warning messages instead of errors. You need to add netlib-java:all to use native BLAS/LAPACK. But it won't work if you include netlib-java:all in an assembly jar. It has to be a separate jar when you submit your job. For SGD, we only use level-1 BLAS, so I don't think native code is call

Re: Accuracy in mllib BinaryClassificationMetrics

2014-05-12 Thread Xiangrui Meng
Hi Deb, feel free to add accuracy along with precision and recall. -Xiangrui On Mon, May 12, 2014 at 1:26 PM, Debasish Das wrote: > Hi, > > I see precision and recall but no accuracy in mllib.evaluation.binary. > > Is it already under development or it needs to be added ? > > Thanks. > Deb >

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Xiangrui Meng
You have a long lineage that causes the StackOverflow error. Try rdd.checkPoint() and rdd.count() for every 20~30 iterations. checkPoint can cut the lineage. -Xiangrui On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan wrote: > Dear Sparkers: > > I am using Python spark of version 0.9.0 to implement so

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Xiangrui Meng
e RDD when it is materialized & it only > materializes in the end, then it runs out of stack. > > Regards > Mayur > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi > > > > On Tue, May 13, 2014 at 11:40 AM, Xiangru

Re: Reading from .bz2 files with Spark

2014-05-13 Thread Xiangrui Meng
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes the problem you described, but it does contain several fixes to bzip2 format. -Xiangrui On Wed, May 7, 2014 at 9:19 PM, Andrew Ash wrote: > Hi all, > > Is anyone reading and writing to .bz2 files stored in HDFS from Spark with

<    1   2   3   4   5   6   >