Re: MLLib Linear regression
Did you test different regularization parameters and step sizes? In the combination that works, I don't see A + D. Did you test that combination? Are there any linear dependency between A's columns and D's columns? -Xiangrui On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak ssti...@live.com wrote: BTW, one detail: When number of iterations is 100 all weights are zero or below and the indices are only from set A. When number of iterations is 150 I see 30+ non-zero weights (when sorted by weight) and indices are distributed across al sets. however MSE is high (5.xxx) and the result does not match the domain knowledge. When number of iterations is 400 I see 30+ non-zero weights (when sorted by weight) and indices are distributed across al sets. however MSE is high (6.xxx) and the result does not match the domain knowledge. Any help will be highly appreciated. From: ssti...@live.com To: user@spark.apache.org Subject: MLLib Linear regression Date: Tue, 7 Oct 2014 13:41:03 -0700 Hi All, I have following classes of features: class A: 15000 features class B: 170 features class C: 900 features Class D: 6000 features. I use linear regression (over sparse data). I get excellent results with low RMSE (~0.06) for the following combinations of classes: 1. A + B + C 2. B + C + D 3. A + B 4. A + C 5. B + D 6. C + D 7. D Unfortunately, when I use A + B + C + D (all the features) I get results that don't make any sense -- all weights are zero or below and the indices are only from set A. I also get high MSE. I changed the number of iterations from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there any other parameters that I can play with? Any insight on what could be wrong? Is it somehow it is not able to scale up to 22K features? (I highly doubt that). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib Linear regression
The proper step size partially depends on the Lipschitz constant of the objective. You should let the machine try different combinations of parameters and select the best. We are working with people from AMPLab to make hyperparameter tunning easier in MLlib 1.2. For the theory, Nesterov's book Introductory Lectures on Convex Optimization is a good one. We didn't use line search in the current implementation of LinearRegression, which we should definitely add that option in the future. Best, Xiangrui On Wed, Oct 8, 2014 at 7:21 AM, Sameer Tilak ssti...@live.com wrote: Hi Xiangrui, Changing the default step size to 0.01 made a huge difference. The results make sense when I use A + B + C + D. MSE is ~0.07 and the outcome matches the domain knowledge. I was wondering is there any documentation on the parameters and when/how to vary them. Date: Tue, 7 Oct 2014 15:11:39 -0700 Subject: Re: MLLib Linear regression From: men...@gmail.com To: ssti...@live.com CC: user@spark.apache.org Did you test different regularization parameters and step sizes? In the combination that works, I don't see A + D. Did you test that combination? Are there any linear dependency between A's columns and D's columns? -Xiangrui On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak ssti...@live.com wrote: BTW, one detail: When number of iterations is 100 all weights are zero or below and the indices are only from set A. When number of iterations is 150 I see 30+ non-zero weights (when sorted by weight) and indices are distributed across al sets. however MSE is high (5.xxx) and the result does not match the domain knowledge. When number of iterations is 400 I see 30+ non-zero weights (when sorted by weight) and indices are distributed across al sets. however MSE is high (6.xxx) and the result does not match the domain knowledge. Any help will be highly appreciated. From: ssti...@live.com To: user@spark.apache.org Subject: MLLib Linear regression Date: Tue, 7 Oct 2014 13:41:03 -0700 Hi All, I have following classes of features: class A: 15000 features class B: 170 features class C: 900 features Class D: 6000 features. I use linear regression (over sparse data). I get excellent results with low RMSE (~0.06) for the following combinations of classes: 1. A + B + C 2. B + C + D 3. A + B 4. A + C 5. B + D 6. C + D 7. D Unfortunately, when I use A + B + C + D (all the features) I get results that don't make any sense -- all weights are zero or below and the indices are only from set A. I also get high MSE. I changed the number of iterations from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there any other parameters that I can play with? Any insight on what could be wrong? Is it somehow it is not able to scale up to 22K features? (I highly doubt that). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DIMSUM item similarity tests
please re-try with --driver-memory 10g . The default is 256m. -Xiangrui On Thu, Oct 9, 2014 at 2:33 AM, Clive Cox clive@rummble.com wrote: Hi, I'm trying out the DIMSUM item similarity from github master commit 69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is: Num items : 8860 Number of users : 5138702 Implicit 1.0 values Running item similarity with threshold :0.5 I have a 2 slave spark cluster on EC2 with m3.xlarge (13G each) I'm running out of heap space: Exception in thread handle-read-write-executor-1 java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.network.nio.Message$.create(Message.scala:90) while Spark is doing: org.apache.spark.rdd.RDD.reduce(RDD.scala:865) org.apache.spark.mllib.rdd.RDDFunctions.treeAggregate(RDDFunctions.scala:111) org.apache.spark.mllib.linalg.distributed.RowMatrix.computeColumnSummaryStatistics(RowMatrix.scala:379) org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:483) The spark UI said the shuffle read on this task at that point had used: 162.6 MB I run spark submit from the master like below: ./spark/bin/spark-submit --executor-memory 13G --master spark://ec2 Just wanted to check this is expected as the matrix doesn't seem excessively big. Is there some memory setting I am missing? Thanks, Clive - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLUtil.kfold generates overlapped training and validation set?
1. No. 2. The seed per partition is fixed. So it should generate non-overlapping subsets. 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1. Best, Xiangrui On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all When we use MLUtils.kfold to generate training and validation set for cross validation we found that there is overlapped part in two sets…. from the code, it does sampling for twice for the same dataset @Experimental def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): Array[(RDD[T], RDD[T])] = { val numFoldsF = numFolds.toFloat (1 to numFolds).map { fold = val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold / numFoldsF, complement = false) val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed) val training = new PartitionwiseSampledRDD(rdd, sampler.cloneComplement(), true, seed) (training, validation) }.toArray } the sampler is complement, there is still possibility to generate overlapped training and validation set because the sampling method looks like : override def sample(items: Iterator[T]): Iterator[T] = { items.filter { item = val x = rng.nextDouble() (x = lb x ub) ^ complement } } I’m not a machine learning guy, so I guess I must fall into one of the following three situations 1. does it mean actually we allow overlapped training and validation set ? (counter intuitive to me) 2. I had some misunderstanding on the code? 3. it’s a bug? Anyone can explain it to me? Best, -- Nan Zhu - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: TF-IDF in Spark 1.1.0
You cannot recover the document from the TF-IDF vector, because HashingTF is not reversible. You can assign each document a unique ID, and join back the result after training. HasingTF can transform individual record: val docs: RDD[(String, Seq[String])] = ... val tf = new HashingTF() val tfWithId: RDD[(String, Vector)] = docs.mapValues(tf.transform) ... Best, Xiangrui On Tue, Oct 14, 2014 at 9:15 AM, Burke Webster burke.webs...@gmail.com wrote: I'm following the Mllib example for TF-IDF and ran into a problem due to my lack of knowledge of Scala and spark. Any help would be greatly appreciated. Following the Mllib example I could do something like this: import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.feature.IDF val sc: SparkContext = ... val documents: RDD[Seq[String]] = sc.textFile(...).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hasingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) As a result I would have an RDD containing the TF-IDF vectors for the input documents. My question is how do I map the vector back to the original input document? My end goal is to compute document similarity using cosine similarity. From what I can tell, I can compute TF-IDF, apply the L2 norm, and then compute the dot-product. Has anybody done this? Currently, my example looks more like this: import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.feature.IDF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext val sc: SparkContext = ... // input is sequence file of the form (docid: Text, content: Text) val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”) val docs: RDD[(String, Seq[String])] = data.mapValues(v = v.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[(String, Vector)] = hashingTF.?? I'm trying to maintain some linking from the document identifier to it's eventual vertex representation. I'm I going about this incorrectly? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark KMeans hangs at reduceByKey / collectAsMap
What is the feature dimension? I saw you used 100 partitions. How many cores does your cluster have? -Xiangrui On Tue, Oct 14, 2014 at 1:51 PM, Ray ray-w...@outlook.com wrote: Hi guys, An interesting thing, for the input dataset which has 1.5 million vectors, if set the KMeans's k_value = 100 or k_value = 50, it hangs as mentioned above. However, if decrease k_value = 10, the same error still appears in the log but the application finished successfully, without observable hanging. Hopefully this provides more information. Thanks. Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib - Does LogisticRegressionModel.clearThreshold() no longer work?
LBFGS is better. If you data is easily separable, LR might return values very close or equal to either 0.0 or 1.0. It is rare but it may happen. -Xiangrui On Tue, Oct 14, 2014 at 3:18 PM, Aris arisofala...@gmail.com wrote: Wow...I just tried LogisticRegressionWithLBFGS, and using clearThreshold() DOES IN FACT work. It appears the the LogsticRegressionWithSGD returns a model whose method is broken!! On Tue, Oct 14, 2014 at 3:14 PM, Aris arisofala...@gmail.com wrote: Hi folks, When I am predicting Binary 1/0 responses with LogsticRegressionWithSGD, it returns a LogisticRegressionModel. In Spark 1.0.X I was using the clearThreshold method on the model to get the raw predicted probabilities when I ran the predict() method... It appears now that rather than getting a realistic probability that is between 0.0 and 1.0, I am only getting back predictions of 0.0 OR 1.0...never anything in between. The API says that clearThreshold is experimental ...it was working before! Is it broken now? Thanks! Aris - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark KMeans hangs at reduceByKey / collectAsMap
Just ran a test on mnist8m (8m x 784) with k = 100 and numIter = 50. It worked fine. Ray, the error log you posted is after cluster termination, which is not the root cause. Could you search your log and find the real cause? On the executor tab screenshot, I saw only 200MB is used. Did you cache the input data? If yes, could you check the storage tab of Spark WebUI and see how the data is distributed across executors. -Xiangrui On Tue, Oct 14, 2014 at 4:26 PM, DB Tsai dbt...@dbtsai.com wrote: I saw similar bottleneck in reduceByKey operation. Maybe we can implement treeReduceByKey to reduce the pressure on single executor reducing the particular key. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Oct 15, 2014 at 12:16 AM, Burak Yavuz bya...@stanford.edu wrote: Hi Ray, The reduceByKey / collectAsMap does a lot of calculations. Therefore it can take a very long time if: 1) The parameter number of runs is set very high 2) k is set high (you have observed this already) 3) data is not properly repartitioned It seems that it is hanging, but there is a lot of calculation going on. Did you use a different value for the number of runs? If you look at the storage tab, does the data look balanced among executors? Best, Burak - Original Message - From: Ray ray-w...@outlook.com To: u...@spark.incubator.apache.org Sent: Tuesday, October 14, 2014 2:58:03 PM Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap Hi Xiangrui, The input dataset has 1.5 million sparse vectors. Each sparse vector has a dimension(cardinality) of 9153 and has less than 15 nonzero elements. Yes, if I set num-executors = 200, from the hadoop cluster scheduler, I can see the application got 201 vCores. From the spark UI, I can see it got 201 executors (as shown below). http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_core.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_executor.png Thanks. Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark KMeans hangs at reduceByKey / collectAsMap
I used k-means||, which is the default. And it took less than 1 minute to finish. 50 iterations took less than 25 minutes on a cluster of 9 m3.2xlarge EC2 nodes. Which deploy mode did you use? Is it yarn-client? -Xiangrui On Tue, Oct 14, 2014 at 6:03 PM, Ray ray-w...@outlook.com wrote: Hi Xiangrui, Thanks for the guidance. I read the log carefully and found the root cause. KMeans, by default, uses KMeans++ as the initialization mode. According to the log file, the 70-minute hanging is actually the computing time of Kmeans++, as pasted below: 14/10/14 14:48:18 INFO DAGScheduler: Stage 20 (collectAsMap at KMeans.scala:293) finished in 2.233 s 14/10/14 14:48:18 INFO SparkContext: Job finished: collectAsMap at KMeans.scala:293, took 85.590020124 s 14/10/14 14:48:18 INFO ShuffleBlockManager: Could not find files for shuffle 5 for deleting 14/10/14 *14:48:18* INFO ContextCleaner: Cleaned shuffle 5 14/10/14 15:50:41 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/10/14 15:50:41 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS *14/10/14 15:54:36 INFO LocalKMeans: Local KMeans++ converged in 11 iterations. 14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913 seconds.* 14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at KMeans.scala:190 14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at KMeans.scala:190) 14/10/14 15:54:37 INFO DAGScheduler: Got job 16 (collectAsMap at KMeans.scala:190) with 100 output partitions (allowLocal=false) 14/10/14 15:54:37 INFO DAGScheduler: Final stage: Stage 22(collectAsMap at KMeans.scala:190) I now use random as the Kmeans initialization mode, and other confs remain the same. This time, it just finished quickly~~ In your test on mnis8m, did you use KMeans++ as initialization mode? How long it takes? Thanks again for your help. Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16450.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark1.0 principal component analysis
computePrincipalComponents returns a local matrix X, whose columns are the principal components (ordered), while those column vectors are in the same feature space as the input feature vectors. -Xiangrui On Thu, Oct 16, 2014 at 2:39 AM, al123 ant.lay...@hotmail.co.uk wrote: Hi, I don't think anybody answered this question... fintis wrote How do I match the principal components to the actual features since there is some sorting? Would anybody be able to shed a little light on it since I too am struggling with this? Many thanks!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249p16556.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib libsvm format
Yes. where the indices are one-based and **in ascending order**. -Xiangrui On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I have a question regarding the ordering of indices. The document says that the indices indices are one-based and in ascending order. However, do the indices within a row need to be sorted in ascending order? Sparse data It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format: label index1:value1 index2:value2 ... where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based. For example, I have have indices ranging rom 1 to 1000 is this as a libsvm data file OK? 1110:1.0 80:0.5 310:0.0 0 890:0.5 20:0.0 200:0.5 400:1.0 82:0.0 and so on: OR do I need to sort them as: 1 80:0.5 110:1.0 310:0.0 0 20:0.082:0.0200:0.5 400:1.0 890:0.5 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: create a Row Matrix
Please check out the example code: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/TallSkinnySVD.scala -Xiangrui On Tue, Oct 21, 2014 at 5:34 AM, viola viola.wiersc...@siemens.com wrote: Hi, I am VERY new to spark and mllib and ran into a couple of problems while trying to reproduce some examples. I am aware that this is a very simple question but could somebody please give me an example - how to create a RowMatrix in scala with the following entries: [1 2 3 4]? I would like to apply an SVD on it. Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Read a TextFile(1 record contains 4 lines) into a RDD
If your file is not very large, try sc.wholeTextFiles(...).values.flatMap(_.split(\n).grouped(4).map(_.mkString(\n))) -Xiangrui On Sat, Oct 25, 2014 at 12:57 AM, Parthus peng.wei@gmail.com wrote: Hi, It might be a naive question, but I still wish that somebody could help me handle it. I have a textFile, in which every 4 lines represent a record. Since SparkContext.textFile() API deems of one line as a record, it does not fit into my case. I know that SparkContext.hadoopFile or newAPIHadoopFile API can read a file in an arbitrary format, but I do not know how to use them. I think that there must be some API which can easily solve this problem, but I am kind of a bad googler and cannot find it by myself online. Would it be possible for somebody to tell me how to use the API? I run Spark based on Hadoop 1.2.1 rather than Hadoop 2.x. I wish that I could get several lines of code which actually works, if possible. Thanks very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Read-a-TextFile-1-record-contains-4-lines-into-a-RDD-tp17256.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: deploying a model built in mllib
We are working on the pipeline features, which would make this procedure much easier in MLlib. This is still a WIP and the main JIRA is at: https://issues.apache.org/jira/browse/SPARK-1856 Best, Xiangrui On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani chirag.lakh...@gmail.com wrote: Hello, I have been prototyping a text classification model that my company would like to eventually put into production. Our technology stack is currently Java based but we would like to be able to build our models in Spark/MLlib and then export something like a PMML file which can be used for model scoring in real-time. I have been using scikit learn where I am able to take the training data convert the text data into a sparse data format and then take the other features and use the dictionary vectorizer to do one-hot encoding for the other categorical variables. All of those things seem to be possible in mllib but I am still puzzled about how that can be packaged in such a way that the incoming data can be first made into feature vectors and then evaluated as well. Are there any best practices for this type of thing in Spark? I hope this is clear but if there are any confusions then please let me know. Thanks, Chirag - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0
Could you save the data before ALS and try to reproduce the problem? You might try reducing the number of partitions and not using Kryo serialization, just to narrow down the issue. -Xiangrui On Mon, Oct 27, 2014 at 1:29 PM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Burak. I always see this error. I'm running the CDH 5.2 version of Spark 1.1.0. I load my data from HDFS. By the time it hits the recommender it had gone through many spark operations. On Oct 27, 2014 4:03 PM, Burak Yavuz bya...@stanford.edu wrote: Hi, I've come across this multiple times, but not in a consistent manner. I found it hard to reproduce. I have a jira for it: SPARK-3080 Do you observe this error every single time? Where do you load your data from? Which version of Spark are you running? Figuring out the similarities may help in pinpointing the bug. Thanks, Burak - Original Message - From: Ilya Ganelin ilgan...@gmail.com To: user user@spark.apache.org Sent: Monday, October 27, 2014 11:36:46 AM Subject: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0 Hello all - I am attempting to run MLLib's ALS algorithm on a substantial test vector - approx. 200 million records. I have resolved a few issues I've had with regards to garbage collection, KryoSeralization, and memory usage. I have not been able to get around this issue I see below however: java.lang. ArrayIndexOutOfBoundsException: 6106 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS. scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) I do not have any negative indices or indices that exceed Int-Max. I have partitioned the input data into 300 partitions and my Spark config is below: .set(spark.executor.memory, 14g) .set(spark.storage.memoryFraction, 0.8) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryo.registrator, MyRegistrator) .set(spark.core.connection.ack.wait.timeout,600) .set(spark.akka.frameSize,50) .set(spark.yarn.executor.memoryOverhead,1024) Does anyone have any suggestions as to why i'm seeing the above error or how to get around it? It may be possible to upgrade to the latest version of Spark but the mechanism for doing so in our environment isn't obvious yet. -Ilya Ganelin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to import mllib.rdd.RDDFunctions into the spark-shell
FYI, there is a PR to make mllib.rdd.RDDFunctions public: https://github.com/apache/spark/pull/2907 -Xiangrui On Tue, Oct 28, 2014 at 5:18 AM, Yanbo Liang yanboha...@gmail.com wrote: Yes, it can import org.apache.spark.mllib.rdd.RDDFunctions but you can not use any method in this class or even new an object of this class. So I infer that if you import org.apache.spark.mllib.rdd.RDDFunctions._, it may call some method of that object. 2014-10-28 17:29 GMT+08:00 Stephen Boesch java...@gmail.com: HI Yanbo, That is not the issue: notice that importing the object is fine: scala import org.apache.spark.mllib.rdd.RDDFunctions import org.apache.spark.mllib.rdd.RDDFunctions scala import org.apache.spark.mllib.rdd.RDDFunctions._ console:11: error: object RDDFunctions in package rdd cannot be accessed in package org.apache.spark.mllib.rdd import org.apache.spark.mllib.rdd.RDDFunctions._ It has to do with the implicits. 2014-10-28 2:25 GMT-07:00 Yanbo Liang yanboha...@gmail.com: Because that org.apache.spark.mllib.rdd.RDDFunctions._ is mllib private class, it can only be called by function in mllib. 2014-10-28 17:09 GMT+08:00 Stephen Boesch java...@gmail.com: I seem to recall there were some specific requirements on how to import the implicits. Here is the issue: scala import org.apache.spark.mllib.rdd.RDDFunctions._ console:10: error: object RDDFunctions in package rdd cannot be accessed in package org.apache.spark.mllib.rdd import org.apache.spark.mllib.rdd.RDDFunctions._ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0
Hi Ilya, Let's move the discussion to the JIRA page. I saw couple users reporting this issue but I have never seen it myself. Best, Xiangrui On Tue, Oct 28, 2014 at 8:50 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all - I've simplified the code so now I'm literally feeding in 200 million ratings directly to ALS.train. Nothing else is happening in the program. I've also tried with both the regular serializer and the KryoSerializer. With Kryo, I get the same ArrayIndex exceptions. With the regular serializer I get the following error stack: 14/10/28 10:43:14 WARN TaskSetManager: Lost task 119.0 in stage 10.0 (TID 2282, innovationdatanode07.cof.ds.capitalone.com): java.io.FileNotFoundException: /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1414016059040_0715/spark-local-20141028102246-dde7/06/shuffle_7_119_8 (No such file or directory) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192) org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67) org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 14/10/28 10:43:14 INFO TaskSetManager: Starting task 119.1 in stage 10.0 (TID 2303, innovationdatanode07.cof.ds.capitalone.com, PROCESS_LOCAL, 5642 bytes) 14/10/28 10:43:14 WARN TaskSetManager: Lost task 119.1 in stage 10.0 (TID 2303, innovationdatanode07.cof.ds.capitalone.com): java.io.FileNotFoundException: /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1414016059040_0715/spark-local-20141028102246-dde7/23/shuffle_8_90_119 (No such file or directory) java.io.RandomAccessFile.open(Native Method) java.io.RandomAccessFile.init(RandomAccessFile.java:241) org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:93) org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:116) org.apache.spark.shuffle.FileShuffleBlockManager.getBytes(FileShuffleBlockManager.scala:190) org.apache.spark.storage.BlockManager.getLocalShuffleFromDisk(BlockManager.scala:361) org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1$$anonfun$apply$10.apply(BlockFetcherIterator.scala:208) org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1$$anonfun$apply$10.apply(BlockFetcherIterator.scala:208) org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.next(BlockFetcherIterator.scala:258) org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.next(BlockFetcherIterator.scala:77) . This is an issue I referenced in the past here: https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=1cad=rjauact=8ved=0CB4QFjAAurl=https%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fincubator-spark-user%2F201410.mbox%2F%253CCAM-S9zS-%2B-MSXVcohWEhjiAEKaCccOKr_N5e0HPXcNgnxZd%3DHw%40mail.gmail.com%253Eei=97FPVIfyCsbgsASL94CoDQusg=AFQjCNEQ6gUlwpr6KzlcZVd0sQeCSdjQgQsig2=Ne7pL_Z94wN4g9BwSutsXQ -Ilya Ganelin On Mon, Oct 27, 2014 at 6:12 PM, Xiangrui Meng men...@gmail.com wrote: Could you save the data before ALS and try to reproduce the problem? You might try reducing the number of partitions and not using Kryo serialization, just to narrow down the issue. -Xiangrui On Mon, Oct 27, 2014 at 1:29 PM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Burak. I always see this error. I'm running the CDH 5.2 version of Spark 1.1.0. I load my data from HDFS. By the time it hits the recommender it had gone through many spark operations. On Oct 27, 2014 4:03 PM, Burak Yavuz bya...@stanford.edu wrote: Hi, I've come across this multiple times, but not in a consistent manner. I found it hard to reproduce. I have a jira for it: SPARK-3080 Do you observe this error every single time? Where do you load your data from? Which version of Spark are you running? Figuring out the similarities may help in pinpointing the bug. Thanks, Burak - Original Message
Re: issue on applying SVM to 5 million examples.
DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib: libsvm - default value initialization
You can remove 0.5 from all non-zeros. -Xiangrui On Wed, Oct 29, 2014 at 9:20 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I have my sparse data in libsvm format. val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, mllib/data/sample_libsvm_data.txt) I am running Linear regression. Let us say that my data has following entry: 1 1:0 4:1 I think it will assume 0 for indices 2 and 3, right? I would like to make default values to be 0.5 instead of 0. Is it possible? If not, I will have to switch to dense data and it will significantly increase the data size for me. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: issue on applying SVM to 5 million examples.
Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration. -Xiangrui On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote: Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I don't have a chance to check the load balance or app manager. Below is my code. There are 18 features for each record and I am using the Scala API. import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import java.util.Calendar object BenchmarkClassification { def main(args: Array[String]) { // Load and parse the data file val conf = new SparkConf() .setAppName(SVM) .set(spark.executor.memory, 8g) // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g) val sc = new SparkContext(conf) val data = sc.textFile(args(0)) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } val testData = sc.textFile(args(1)) val testParsedData = testData .map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x = x.toDouble))) } // Run training algorithm to build the model val numIterations = 20 val model = SVMWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error // val labelAndPreds = testParsedData.map { point = // val prediction = model.predict(point.features) // (point.label, prediction) // } // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / testParsedData.count // println(Training Error = + trainErr) println(Calendar.getInstance().getTime()) } } Thanks, Best, Peng On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote: DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Prediction using Classification with text attributes in Apache Spark MLLib
This operation requires two transformers: 1) Indexer, which maps string features into categorical features 2) OneHotEncoder, which flatten categorical features into binary features We are working on the new dataset implementation, so we can easily express those transformations. Sorry for late! If you want a quick and dirty solution, you can try hashing: val rdd: RDD[(Double, Array[String])] = ... val training = rdd.mapValues { factors = val indices = mutable.Set.empty[Int] factors.view.zipWithIndex.foreach { (f, idx) = indices += math.abs(f.## ^ idx) % 10 } Vectors.sparse(10, indices.toSeq.map(x = (x, 1.0))) } It creates a training dataset with all binary features, with a chance of collision. You can use it in SVM, LR, or DecisionTree. Best, Xiangrui On Sun, Nov 2, 2014 at 9:20 AM, ashu ashutosh.triv...@iiitb.org wrote: Hi, Sorry to bounce back the old thread. What is the state now? Is this problem solved. How spark handle categorical data now? Regards, Ashutosh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: is spark a good fit for sequential machine learning algorithms?
Many ML algorithms are sequential because they were not designed to be parallel. However, ML is not driven by algorithms in practice, but by data and applications. As datasets getting bigger and bigger, some algorithms got revised to work in parallel, like SGD and matrix factorization. MLlib tries to implement those scalable algorithms that can handle large-scale datasets. That being said, even with sequential ML algorithms, Spark is helpful. Because in practice we need to test multiple sets of parameters and select the best one. Though the algorithm is sequential, the training part is embarrassingly parallel. We can broadcast the whole dataset, and then train model 1 on node 1, model 2 on node 2, etc. Cross validation also falls into this category. -Xiangrui On Mon, Nov 3, 2014 at 1:55 PM, ll duy.huynh@gmail.com wrote: i'm struggling with implementing a few algorithms with spark. hope to get help from the community. most of the machine learning algorithms today are sequential, while spark is all about parallelism. it seems to me that using spark doesn't actually help much, because in most cases you can't really paralellize a sequential algorithm. there must be some strong reasons why mllib was created and so many people claim spark is ideal for machine learning. what are those reasons? what are some specific examples when how to use spark to implement sequential machine learning algorithms? any commen/feedback/answer is much appreciated. thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-a-good-fit-for-sequential-machine-learning-algorithms-tp18000.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pass unique ID to mllib algorithms pyspark
The proposed new set of APIs (SPARK-3573, SPARK-3530) will address this issue. We carry over extra columns with training and prediction and then leverage on Spark SQL's execution plan optimization to decide which columns are really needed. For the current set of APIs, we can add `predictOnValues` to models, which carries over the input keys. StreamingKMeans and StreamingLinearRegression implement this method. -Xiangrui On Tue, Nov 4, 2014 at 2:30 AM, jamborta jambo...@gmail.com wrote: Hi all, There are a few algorithms in pyspark where the prediction part is implemented in scala (e.g. ALS, decision trees) where it is not very easy to manipulate the prediction methods. I think it is a very common scenario that the user would like to generate prediction for a datasets, so that each predicted value is identifiable (e.g. have a unique id attached to it). this is not possible in the current implementation as predict functions take a feature vector and return the predicted values where, I believe, the order is not guaranteed, so there is no way to join it back with the original data the predictions are generated from. Is there a way around this at the moment? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sparse x sparse matrix multiplication
local matrix-matrix multiplication or distributed? On Tue, Nov 4, 2014 at 11:58 PM, ll duy.huynh@gmail.com wrote: what is the best way to implement a sparse x sparse matrix multiplication with spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sparse-x-sparse-matrix-multiplication-tp18163.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Matrix multiplication in spark
We are working on distributed block matrices. The main JIRA is at: https://issues.apache.org/jira/browse/SPARK-3434 The goal is to support basic distributed linear algebra, (dense first and then sparse). -Xiangrui On Wed, Nov 5, 2014 at 12:23 AM, ll duy.huynh@gmail.com wrote: @sowen.. i am looking for distributed operations, especially very large sparse matrix x sparse matrix multiplication. what is the best way to implement this in spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-tp12562p18164.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sparse x sparse matrix multiplication
You can use breeze for local sparse-sparse matrix multiplication and then define an RDD of sub-matrices RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix) and then use join and aggregateByKey to implement this feature, which is the same as in MapReduce. -Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe
Which Spark version did you use? Could you check the WebUI and attach the error message on executors? -Xiangrui On Wed, Nov 5, 2014 at 8:23 AM, rok rokros...@gmail.com wrote: yes, the training set is fine, I've verified it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18195.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MatrixFactorizationModel predict(Int, Int) API
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066 The easiest case is when one side is small. If both sides are large, this is a super-expensive operation. We can do block-wise cross product and then find top-k for each user. Best, Xiangrui On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com wrote: model.recommendProducts can only be called from the master then ? I have a set of 20% users on whom I am performing the test...the 20% users are in a RDD...if I have to collect them all to master node and then call model.recommendProducts, that's a issue... Any idea how to optimize this so that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote: ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map`. -Xiangrui On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote: I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product} require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Status of MLLib exporting models to PMML
Vincenzo sent a PR and included k-means as an example. Sean is helping review it. PMML standard is quite large. So we may start with simple model export, like linear methods, then move forward to tree-based. -Xiangrui On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote: Hello Spark and MLLib folks, So a common problem in the real world of using machine learning is that some data analysis use tools like R, but the more data engineers out there will use more advanced systems like Spark MLLib or even Python Scikit Learn. In the real world, I want to have a system where multiple different modeling environments can learn from data / build models, represent the models in a common language, and then have a layer which just takes the model and run model.predict() all day long -- scores the models in other words. It looks like the project openscoring.io and jpmml-evaluator are some amazing systems for this, but they fundamentally use PMML as the model representation here. I have read some JIRA tickets that Xiangrui Meng is interested in getting PMML implemented to export MLLib models, is that happening? Further, would something like Manish Amde's boosted ensemble tree methods be representable in PMML? Thank you!! Aris - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib Decision Tress algorithm hangs, others fine
Could you provide more information? For example, spark version, dataset size (number of instances/number of features), cluster size, error messages from both the drive and the executor. -Xiangrui On Mon, Nov 10, 2014 at 11:28 AM, tsj tsj...@gmail.com wrote: Hello all, I have some text data that I am running different algorithms on. I had no problems with LibSVM and Naive Bayes on the same data, but when I run Decision Tree, the execution hangs in the middle of DecisionTree.trainClassifier(). The only difference from the example given on the site is that I am using 6 categories instead of 2, and the input is text that is transformed to labeled points using TF-IDF. It halts shortly after this log output: spark.SparkContext: Job finished: collect at DecisionTree.scala:1347, took 1.019579676 s Any ideas as to what could be causing this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tress-algorithm-hangs-others-fine-tp18515.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: scala.MatchError
I think you need a Java bean class instead of a normal class. See example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html (switch to the java tab). -Xiangrui On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, This is my Instrument java constructor. public Instrument(Issue issue, Issuer issuer, Issuing issuing) { super(); this.issue = issue; this.issuer = issuer; this.issuing = issuing; } I am trying to create javaschemaRDD JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, Instrument.class); Remarks: Instrument, Issue, Issuer, Issuing all are java classes distData is holding List Instrument I am getting the following error. Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: scala.MatchError: class sample.spark.test.Issue (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188) at org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90) at sample.spark.test.SparkJob.main(SparkJob.java:33) ... 5 more Please help me. Regards, Naveen. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLIB usage: BLAS dependency warning
Could you try jar tf on the assembly jar and grep netlib-native_system-linux-x86_64.so? -Xiangrui On Tue, Nov 11, 2014 at 7:11 PM, jpl jlefe...@soe.ucsc.edu wrote: Hi, I am having trouble using the BLAS libs with the MLLib functions. I am using org.apache.spark.mllib.clustering.KMeans (on a single machine) and running the Spark-shell with the kmeans example code (from https://spark.apache.org/docs/latest/mllib-clustering.html) which runs successfully but I get the following warning in the log: WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS I compiled spark 1.1.0 with mvn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Pnetlib-lgpl -DskipTests clean package If anyone could please clarify the steps to get the dependencies correctly installed and visible to spark (from https://spark.apache.org/docs/latest/mllib-guide.html), that would be greatly appreciated. Using yum, I installed blas.x86_64, lapack.x86_64, gcc-gfortran.x86_64, libgfortran.x86_64 and then downloaded Breeze and built that successfully with Maven. I verified that I do have /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3 present on the machine and ldconf -p shows these listed. I also tried adding /usr/lib/ to spark.executor.extraLibraryPath and I verified it is present in the Spark webUI environment tab. I downloaded and compiled jblas with mvn clean install, which creates jblas-1.2.4-SNAPSHOT.jar, and then also tried adding that to spark.executor.extraClassPath but I still get the same WARN message. Maybe there are a few simple steps that I am missing? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-usage-BLAS-dependency-warning-tp18660.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLIB usage: BLAS dependency warning
That means the -Pnetlib-lgpl option didn't work. Could you use sbt to build the assembly jar and see whether the .so file is inside the assembly jar? Which system and Java version are you using? -Xiangrui On Wed, Nov 12, 2014 at 2:22 PM, jpl jlefe...@soe.ucsc.edu wrote: Hi Xiangrui, thank you very much for your response. I looked for the .so as you suggested. It is not here: $ jar tf assembly/target/spark-assembly_2.10-1.1.0-dist/spark-assembly-1.1.0-hadoop2.4.0.jar | grep netlib-native_system-linux-x86_64.so or here: $ jar tf assembly/target/spark-assembly_2.10-1.1.0-dist/spark-mllib_2.10-1.1.0.jar | grep netlib-native_system-linux-x86_64.so However, I do find it here: $ jar tf /root/.m2/repository/com/github/fommil/netlib/netlib-native_system-linux-x86_64/1.1/netlib-native_system-linux-x86_64-1.1-natives.jar | grep netlib-native_system-linux-x86_64.so Am I not building it correctly? Should I just add the above jar to the Spark classpath (if so, where exactly do I add that, I tried adding to .extraClassPath but did not help)? Thanks a lot, jeff -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-usage-BLAS-dependency-warning-tp18660p18775.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No module named pyspark - latest built
You need to use maven to include python files. See https://github.com/apache/spark/pull/1223 . -Xiangrui On Wed, Nov 12, 2014 at 4:48 PM, jamborta jambo...@gmail.com wrote: I have figured out that building the fat jar with sbt does not seem to included the pyspark scripts using the following command: sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean publish-local assembly however the maven command works OK: mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests clean package am I running the correct sbt command? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2
If you use Kryo serialier, you need to register mutable.BitSet and Rating: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102 The JIRA was marked resolved because chill resolved the problem in v0.4.0 and we have this workaround. -Xiangrui On Fri, Nov 14, 2014 at 12:41 AM, aaronlin aaron...@kkbox.com wrote: Hi folks, Although spark-1977 said that this problem is resolved in 1.0.2, but I will have this problem while running the script in AWS EC2 via spark-c2.py. I checked spark-1977 and found that twitter.chill resolve the problem in v.0.4.0 not v.0.3.6, but spark depends on twitter.chill v0.3.6 based on maven page. For more information, you can check the following pages - https://github.com/twitter/chill - http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.0.2 Can anyone give me advises? Thanks -- Aaron Lin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Client application that calls Spark and receives an MLlib model Scala Object and then predicts without Spark installed on hadoop
If Spark is not installed on the client side, you won't be able to deserialize the model. Instead of serializing the model object, you may serialize the model weights array and implement predict on the client side. -Xiangrui On Fri, Nov 14, 2014 at 2:54 PM, xiaoyan yu xiaoyan...@gmail.com wrote: I had the same need as those documented back to July archived at http://qnalist.com/questions/5013193/client-application-that-calls-spark-and-receives-an-mllib-model-scala-object-not-just-result. I wonder if anyone would like to share any successful stories. Thanks, Xiaoyan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: repartition combined with zipWithIndex get stuck
This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote: Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines work as expected: scala sc.parallelize(1 to 10).repartition(10).count() res0: Long = 10 scala sc.parallelize(1 to 10).zipWithIndex.count() res1: Long = 10 But this statement get stuck and doesn't return: scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at Option.scala:120 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at Option.scala:120) with 3 output partitions (allowLocal=false) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at Option.scala:120) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4 (ParallelCollectionRDD[7] at parallelize at console:13), which has no missing parents 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called with curMem=7616, maxMem=138938941 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1096.0 B, free 132.5 MB) Am I doing something wrong here or is it a bug? Is there some work around? Thanks, Lev. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: repartition combined with zipWithIndex get stuck
I think I understand where the bug is now. I created a JIRA (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR soon. -Xiangrui On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote: This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote: Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines work as expected: scala sc.parallelize(1 to 10).repartition(10).count() res0: Long = 10 scala sc.parallelize(1 to 10).zipWithIndex.count() res1: Long = 10 But this statement get stuck and doesn't return: scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at Option.scala:120 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at Option.scala:120) with 3 output partitions (allowLocal=false) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at Option.scala:120) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4 (ParallelCollectionRDD[7] at parallelize at console:13), which has no missing parents 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called with curMem=7616, maxMem=138938941 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1096.0 B, free 132.5 MB) Am I doing something wrong here or is it a bug? Is there some work around? Thanks, Lev. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: repartition combined with zipWithIndex get stuck
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround: val a = sc.parallelize(1 to 10).zipWithIndex() a.partitions // call .partitions explicitly a.repartition(10).count() Thanks for reporting the bug! -Xiangrui On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng men...@gmail.com wrote: I think I understand where the bug is now. I created a JIRA (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR soon. -Xiangrui On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote: This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote: Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines work as expected: scala sc.parallelize(1 to 10).repartition(10).count() res0: Long = 10 scala sc.parallelize(1 to 10).zipWithIndex.count() res1: Long = 10 But this statement get stuck and doesn't return: scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at Option.scala:120 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at Option.scala:120) with 3 output partitions (allowLocal=false) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at Option.scala:120) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4 (ParallelCollectionRDD[7] at parallelize at console:13), which has no missing parents 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called with curMem=7616, maxMem=138938941 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1096.0 B, free 132.5 MB) Am I doing something wrong here or is it a bug? Is there some work around? Thanks, Lev. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLIB KMeans Exception
How many features and how many partitions? You set kmeans_clusters to 1. If the feature dimension is large, it would be really expensive. You can check the WebUI and see task failures there. The stack trace you posted is from the driver. Btw, the total memory you have is 64GB * 10, so you can cache about 300GB of data under the default setting, which is not enough for the 1.2TB data you want to process. -Xiangrui On Thu, Nov 20, 2014 at 5:57 AM, Alan Prando a...@scanboo.com.br wrote: Hi Folks! I'm running a Python Spark job on a cluster with 1 master and 10 slaves (64G RAM and 32 cores each machine). This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and call Kmeans method as following: # SLAVE CODE - Reading features from HDFS def get_features_from_images_hdfs(self, timestamp): def shallow(lista): for row in lista: for col in row: yield col features = self.sc.textFile(hdfs://999.999.99:/FOLDER/) return features.map(lambda row: eval(row)[1]).mapPartitions(shallow) # SLAVE CODE - Extract centroids with Kmeans def extract_centroids_on_slaves(self, features, kmeans_clusters, kmeans_max_iterations, kmeans_mode): #Error line clusters = KMeans.train( features, kmeans_clusters, maxIterations=kmeans_max_iterations, runs=1, initializationMode=kmeans_mode ) return clusters.clusterCenters # MASTER CODE - Main features = get_features_from_images_hdfs(kwargs.get(timestamp)) kmeans_clusters = 1 kmeans_max_interations = 13 kmeans_mode = random centroids = extract_centroids_on_slaves( features, kmeans_clusters, kmeans_max_interations, kmeans_mode ) centroids_rdd = sc.parallelize(centroids) I'm getting the following exception when I call KMeans.train: 14/11/20 13:19:34 INFO TaskSetManager: Starting task 2539.0 in stage 0.0 (TID 2327, ip-172-31-7-120.ec2.internal, NODE_LOCAL, 1649 bytes) 14/11/20 13:19:34 WARN TaskSetManager: Lost task 2486.0 in stage 0.0 (TID 2257, ip-172-31-7-120.ec2.internal): java.io.IOException: Filesystem closed org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:765) org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:783) org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844) java.io.DataInputStream.read(DataInputStream.java:100) org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246) org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:220) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:189) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1314) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_4_2174 in memory on ip-172-31-7-121.ec2.internal:57211 (size: 5.3 MB, free: 23.0 GB) 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_2_2349 in memory on ip-172-31-7-124.ec2.internal:56258 (size: 47.6 MB, free: 23.5 GB) 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_2_2386 in memory on ip-172-31-7-124.ec2.internal:56258 (size: 46.0 MB, free: 23.5 GB) 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_2_2341 in memory on ip-172-31-7-124.ec2.internal:56258 (size: 47.3 MB, free: 23.4 GB) 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_4_2279 in memory on ip-172-31-7-124.ec2.internal:56258 (size: 5.1 MB, free: 23.4 GB) 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_2_2324 in memory on ip-172-31-7-124.ec2.internal:56258 (size: 46.1 MB, free: 23.4 GB) 14/11/20 13:19:34 INFO TaskSetManager: Starting task 2525.0 in stage 0.0 (TID 2328,
Re: Python Logistic Regression error
The data is in LIBSVM format. So this line won't work: values = [float(s) for s in line.split(' ')] Please use the util function in MLUtils to load it as an RDD of LabeledPoint. http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point from pyspark.mllib.util import MLUtils examples = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt) -Xiangrui On Sun, Nov 23, 2014 at 11:38 AM, Venkat, Ankam ankam.ven...@centurylink.com wrote: Can you please suggest sample data for running the logistic_regression.py? I am trying to use a sample data file at https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt I am running this on CDH5.2 Quickstart VM. [cloudera@quickstart mllib]$ spark-submit logistic_regression.py lr.txt 3 But, getting below error. 14/11/23 11:23:55 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 14/11/23 11:23:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/11/23 11:23:55 INFO TaskSchedulerImpl: Cancelling stage 0 14/11/23 11:23:55 INFO DAGScheduler: Failed to run runJob at PythonRDD.scala:296 Traceback (most recent call last): File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 50, in module model = LogisticRegressionWithSGD.train(points, iterations) File /usr/lib/spark/python/pyspark/mllib/classification.py, line 110, in train initialWeights) File /usr/lib/spark/python/pyspark/mllib/_common.py, line 430, in _regression_train_wrapper initial_weights = _get_initial_weights(initial_weights, data) File /usr/lib/spark/python/pyspark/mllib/_common.py, line 415, in _get_initial_weights initial_weights = _convert_vector(data.first().features) File /usr/lib/spark/python/pyspark/rdd.py, line 1127, in first rs = self.take(1) File /usr/lib/spark/python/pyspark/rdd.py, line 1109, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /usr/lib/spark/python/pyspark/context.py, line 770, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.139.145): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /usr/lib/spark/python/pyspark/worker.py, line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File /usr/lib/spark/python/pyspark/serializers.py, line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /usr/lib/spark/python/pyspark/serializers.py, line 127, in dump_stream for obj in iterator: File /usr/lib/spark/python/pyspark/serializers.py, line 185, in _batched for item in iterator: File /usr/lib/spark/python/pyspark/rdd.py, line 1105, in takeUpToNumLeft yield next(iterator) File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 37, in parsePoint values = [float(s) for s in line.split(' ')] ValueError: invalid literal for float(): 1:0.4551273600657362 Regards, Venkat This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Mllib native netlib-java/OpenBLAS
Try building Spark with -Pnetlib-lgpl, which includes the JNI library in the Spark assembly jar. This is the simplest approach. If you want to include it as part of your project, make sure the library is inside the assembly jar or you specify it via `--jars` with spark-submit. -Xiangrui On Mon, Nov 24, 2014 at 8:51 AM, agg212 alexander_galaka...@brown.edu wrote: Hi, i'm trying to improve performance for Spark's Mllib, and I am having trouble getting native netlib-java libraries installed/recognized by Spark. I am running on a single machine, Ubuntu 14.04 and here is what I've tried: sudo apt-get install libgfortran3 sudo apt-get install libatlas3-base libopenblas-base (this is how netlib-java's website says to install it) I also double checked and it looks like the libraries are linked correctly in /usr/lib (see below): /usr/lib/libblas.so.3 - /etc/alternatives/libblas.so.3 /usr/lib/liblapack.so.3 - /etc/alternatives/liblapack.so.3 The Dependencies section on Spark's Mllib website also says to include com.github.fommil.netlib:all:1.1.2 as a dependency. I therefore tried adding this to my sbt file like so: libraryDependencies += com.github.fommil.netlib % all % 1.1.2 After all this, i'm still seeing the following error message. Does anyone have more detailed installation instructions? 14/11/24 16:49:29 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/11/24 16:49:29 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is spark streaming +MlLib for online learning?
In 1.2, we added streaming k-means: https://github.com/apache/spark/pull/2942 . -Xiangrui On Mon, Nov 24, 2014 at 5:25 PM, Joanne Contact joannenetw...@gmail.com wrote: Thank you Tobias! On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com wrote: I seemed to read somewhere that spark is still batch learning, but spark streaming could allow online learning. Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently can do online learning only for linear regression https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html#streaming-linear-regression, as far as I know. Tobias - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: K-means clustering
There is a simple example here: https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py . You can take advantage of sparsity by computing the distance via inner products: http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2 -Xiangrui On Tue, Nov 25, 2014 at 2:39 AM, amin mohebbi aminn_...@yahoo.com.invalid wrote: I have generated a sparse matrix by python, which has the size of 4000*174000 (.pkl), the following is a small part of this matrix : (0, 45) 1 (0, 413) 1 (0, 445) 1 (0, 107) 4 (0, 80) 2 (0, 352) 1 (0, 157) 1 (0, 191) 1 (0, 315) 1 (0, 395) 4 (0, 282) 3 (0, 184) 1 (0, 403) 1 (0, 169) 1 (0, 267) 1 (0, 148) 1 (0, 449) 1 (0, 241) 1 (0, 303) 1 (0, 364) 1 (0, 257) 1 (0, 372) 1 (0, 73) 1 (0, 64) 1 (0, 427) 1 : : (2, 399) 1 (2, 277) 1 (2, 229) 1 (2, 255) 1 (2, 409) 1 (2, 355) 1 (2, 391) 1 (2, 28) 1 (2, 384) 1 (2, 86) 1 (2, 285) 2 (2, 166) 1 (2, 165) 1 (2, 419) 1 (2, 367) 2 (2, 133) 1 (2, 61) 1 (2, 434) 1 (2, 51) 1 (2, 423) 1 (2, 398) 1 (2, 438) 1 (2, 389) 1 (2, 26) 1 (2, 455) 1 I am new in Spark and would like to cluster this matrix by k-means algorithm. Can anyone explain to me what kind of problems I might be faced. Please note that I do not want to use Mllib and would like to write my own k-means. Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MlLib Colaborative filtering factors
It is data-dependent, and hence needs hyper-parameter tuning, e.g., grid search. The first batch is certainly expensive. But after you figure out a small range for each parameter that fits your data, following batches should be not that expensive. There is an example from AMPCamp: http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html -Xiangrui On Tue, Nov 25, 2014 at 4:28 AM, Saurabh Agrawal saurabh.agra...@markit.com wrote: HI, I am trying to execute Collaborative filtering using MlLib. Can somebody please suggest how to calculate the following 1. Rank 2. Iterations 3. Lambda I understand these are adjustment factors and they help reduce the MSE in turn defining accuracy of algorithm but then is it all hit and trial or is there a definitive way to calculate them? Thanks!! Regards, Saurabh Agrawal This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by Markit is prohibited. This email is subject to all waivers and other terms at the following link: http://www.markit.com/en/about/legal/email-disclaimer.page Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide. MarkitSERV Limited has its registered office located at Level 4, Ropemaker Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated by the Financial Conduct Authority with registration number 207294 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: why MatrixFactorizationModel private?
Besides API stability concerns, models constructed directly from users rather than returned by ALS may not work well. The userFeatures and productFeatures are both with partitioners so we can perform quick lookup for prediction. If you save userFeatures and productFeatures and load them back, it is very likely the partitioning info is missing. That being said, we will try to address model export/import in v1.3: https://issues.apache.org/jira/browse/SPARK-4587 . -Xiangrui On Tue, Nov 25, 2014 at 8:26 AM, jamborta jambo...@gmail.com wrote: Hi all, seems that all the mllib models are declared accessible in the package, except MatrixFactorizationModel, which is declared private to mllib. Any reason why? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RMSE in MovieLensALS increases or stays stable as iterations increase.
The training RMSE may increase due to regularization. Squared loss only represents part of the global loss. If you watch the sum of the squared loss and the regularization, it should be non-increasing. -Xiangrui On Wed, Nov 26, 2014 at 9:53 AM, Sean Owen so...@cloudera.com wrote: I also modified the example to try 1, 5, 9, ... iterations as you did, and also ran with the same default parameters. I used the sample_movielens_data.txt file. Is that what you're using? My result is: Iteration 1 Test RMSE = 1.426079653593016 Train RMSE = 1.5013155094216357 Iteration 5 Test RMSE = 1.405598012724468 Train RMSE = 1.4847078708333596 Iteration 9 Test RMSE = 1.4055990901261632 Train RMSE = 1.484713206769993 Iteration 13 Test RMSE = 1.4055990999738366 Train RMSE = 1.4847132332994588 Iteration 17 Test RMSE = 1.40559910003368 Train RMSE = 1.48471323345531 Iteration 21 Test RMSE = 1.4055991000342158 Train RMSE = 1.4847132334567061 Iteration 25 Test RMSE = 1.4055991000342174 Train RMSE = 1.4847132334567108 Train error is higher than test error, consistently, which could be underfitting. A higher rank=50 gets a reasonable result: Iteration 1 Test RMSE = 1.5981883186995312 Train RMSE = 1.4841671360432005 Iteration 5 Test RMSE = 1.5745145659678204 Train RMSE = 1.4672341345080382 Iteration 9 Test RMSE = 1.5745147110505406 Train RMSE = 1.4672385714907996 Iteration 13 Test RMSE = 1.5745147108258577 Train RMSE = 1.4672385929631868 Iteration 17 Test RMSE = 1.5745147108246424 Train RMSE = 1.4672385930428344 Iteration 21 Test RMSE = 1.5745147108246367 Train RMSE = 1.4672385930431973 Iteration 25 Test RMSE = 1.5745147108246367 Train RMSE = 1.467238593043199 I'm not sure what the difference is. I looked at your modifications and they seem very similar. Is it the data you're using? On Wed, Nov 26, 2014 at 3:34 PM, Kostas Kloudas kklou...@gmail.com wrote: For the training I am using the code in the MovieLensALS example with trainImplicit set to false and for the training RMSE I use the val rmseTr = computeRmse(model, training, params.implicitPrefs). The computeRmse() method is provided in the MovieLensALS class. Thanks a lot, Kostas On Nov 26, 2014, at 2:41 PM, Sean Owen so...@cloudera.com wrote: How are you computing RMSE? and how are you training the model -- not with trainImplicit right? I wonder if you are somehow optimizing something besides RMSE. On Wed, Nov 26, 2014 at 2:36 PM, Kostas Kloudas kklou...@gmail.com wrote: Once again, the error even with the training dataset increases. The results are: Running 1 iterations For 1 iter.: Test RMSE = 1.2447121194304893 Training RMSE = 1.2394166987104076 (34.751317636 s). Running 5 iterations For 5 iter.: Test RMSE = 1.3253957117600659 Training RMSE = 1.3206317416138509 (37.69311802304 s). Running 9 iterations For 9 iter.: Test RMSE = 1.3255293380139364 Training RMSE = 1.3207661218210436 (41.046175661 s). Running 13 iterations For 13 iter.: Test RMSE = 1.3255295352665748 Training RMSE = 1.3207663201865092 (47.763619515 s). Running 17 iterations For 17 iter.: Test RMSE = 1.32552953555787 Training RMSE = 1.3207663204794406 (59.68236110305 s). Running 21 iterations For 21 iter.: Test RMSE = 1.3255295355583026 Training RMSE = 1.3207663204798756 (57.210578232 s). Running 25 iterations For 25 iter.: Test RMSE = 1.325529535558303 Training RMSE = 1.3207663204798765 (65.785485882 s). Thanks a lot, Kostas On Nov 26, 2014, at 12:04 PM, Nick Pentreath nick.pentre...@gmail.com wrote: copying user group - I keep replying directly vs reply all :) On Wed, Nov 26, 2014 at 2:03 PM, Nick Pentreath nick.pentre...@gmail.com wrote: ALS will be guaranteed to decrease the squared error (therefore RMSE) in each iteration, on the training set. This does not hold for the test set / cross validation. You would expect the test set RMSE to stabilise as iterations increase, since the algorithm converges - but not necessarily to decrease. On Wed, Nov 26, 2014 at 1:57 PM, Kostas Kloudas kklou...@gmail.com wrote: Hi all, I am getting familiarized with Mllib and a thing I noticed is that running the MovieLensALS example on the movieLens dataset for increasing number of iterations does not decrease the rmse. The results for 0.6% training set and 0.4% test are below. For training set to 0.8%, the results are almost identical. Shouldn’t it be normal to see a decreasing error? Especially going from 1 to 5 iterations. Running 1 iterations Test RMSE for 1 iter. = 1.2452964343277886 (52.75712592704 s). Running 5 iterations Test RMSE for 5 iter. = 1.3258973764470259 (61.183927666 s). Running 9 iterations Test RMSE for 9 iter. = 1.3260308117704385 (61.8494887581 s). Running 13 iterations Test RMSE for 13 iter. = 1.3260310099809915 (73.799510125 s). Running 17 iterations Test RMSE for 17 iter. = 1.3260310102735398 (77.5651218531 s). Running 21 iterations Test RMSE for 21 iter.
Re: ALS failure with size Integer.MAX_VALUE
Hi Bharath, You can try setting a small item blocks in this case. 1200 is definitely too large for ALS. Please try 30 or even smaller. I'm not sure whether this could solve the problem because you have 100 items connected with 10^8 users. There is a JIRA for this issue: https://issues.apache.org/jira/browse/SPARK-3735 which I will try to implement in 1.3. I'll ping you when it is ready. Best, Xiangrui On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Yes, the issue appears to be due to the 2GB block size limitation. I am hence looking for (user, product) block sizing suggestions to work around the block size limitation. On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote: (It won't be that, since you see that the error occur when reading a block from disk. I think this is an instance of the 2GB block size limitation.) On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi Bharath – I’m unsure if this is your problem but the MatrixFactorizationModel in MLLIB which is the underlying component for ALS expects your User/Product fields to be integers. Specifically, the input to ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am wondering if perhaps one of your identifiers exceeds MAX_INT, could you write a quick check for that? I have been running a very similar use case to yours (with more constrained hardware resources) and I haven’t seen this exact problem but I’m sure we’ve seen similar issues. Please let me know if you have other questions. From: Bharath Ravi Kumar reachb...@gmail.com Date: Thursday, November 27, 2014 at 1:30 PM To: user@spark.apache.org user@spark.apache.org Subject: ALS failure with size Integer.MAX_VALUE We're training a recommender with ALS in mllib 1.1 against a dataset of 150M users and 4.5K items, with the total number of training records being 1.2 Billion (~30GB data). The input data is spread across 1200 partitions on HDFS. For the training, rank=10, and we've configured {number of user data blocks = number of item data blocks}. The number of user/item blocks was varied between 50 to 1200. Irrespective of the block size (e.g. at 1200 blocks each), there are atleast a couple of tasks that end up shuffle reading 9.7G each in the aggregate stage (ALS.scala:337) and failing with the following exception: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108) at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124) at org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: printing mllib.linalg.vector
you can use the default toString method to get the string representation. if you want to customized, check the indices/values fields. -Xiangrui On Fri, Dec 5, 2014 at 7:32 AM, debbie debbielarso...@hotmail.com wrote: Basic question: What is the best way to loop through one of these and print their components? Convert them to an array? Thanks Deb - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib(Logistic Regression) + Spark Streaming.
If you want to train offline and predict online, you can use the current LR implementation to train a model and then apply model.predict on the dstream. -Xiangrui On Sun, Dec 7, 2014 at 6:30 PM, Nasir Khan nasirkhan.onl...@gmail.com wrote: I am new to spark. Lets say i want to develop a machine learning model. which trained on normal method in MLlib. I want to use that model with classifier Logistic regression and predict the streaming data coming from a file or socket. Streaming data - Logistic Regression - binary label prediction. Is it possible? since there is no streaming logistic regression algo like streaming linear regression. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Logistic-Regression-Spark-Streaming-tp20564.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException
Is it possible that after filtering the feature dimension changed? This may happen if you use LIBSVM format but didn't specify the number of features. -Xiangrui On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I was able to run LinearRegressionwithSGD for a largeer dataset ( 2GB sparse). I have now filtered the data and I am running regression on a subset of it (~ 200 MB). I see this error, which is strange since it was running fine with the superset data. Is this a formatting issue (which I doubt) or is this some other issue in data preparation? I confirmed that there is no empty line in my dataset. Any help with this will be highly appreciated. 14/12/08 20:32:03 WARN TaskSetManager: Lost TID 5 (task 3.0:1) 14/12/08 20:32:03 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 150323 at breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:231) at breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:216) at breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60) at breeze.linalg.VectorOps$$anon$178.apply(Vector.scala:391) at breeze.linalg.NumericOps$class.dot(NumericOps.scala:83) at breeze.linalg.DenseVector.dot(DenseVector.scala:47) at org.apache.spark.mllib.optimization.LeastSquaresGradient.compute(Gradient.scala:125) at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:180) at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:179) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838) at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838) at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why KMeans with mllib is so slow ?
Please check the number of partitions after sc.textFile. Use sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai dbt...@dbtsai.com wrote: You just need to use the latest master code without any configuration to get performance improvement from my PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 8, 2014 at 7:53 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: After some investigation, I learned that I can't compare kmeans in mllib with another kmeans implementation directly. The kmeans|| initialization step takes more time than the algorithm implemented in julia for example. There is also the ability to run multiple runs of kmeans algorithm in mllib even by default the number of runs is 1. DB Tsai can you please tell me the configuration you took for the improvement you mention in your pull request. I'd like to run the same benchmark on mnist8m on my computer. Cheers; On Fri, Dec 5, 2014 at 10:34 PM, DB Tsai dbt...@dbtsai.com wrote: Also, are you using the latest master in this experiment? A PR merged into the master couple days ago will spend up the k-means three times. See https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 5, 2014 at 9:36 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: The code is really simple : object TestKMeans { def main(args: Array[String]) { val conf = new SparkConf() .setAppName(Test KMeans) .setMaster(local[8]) .set(spark.executor.memory, 8g) val sc = new SparkContext(conf) val numClusters = 500; val numIterations = 2; val data = sc.textFile(sample.csv).map(x = Vectors.dense(x.split(',').map(_.toDouble))) data.cache() val clusters = KMeans.train(data, numClusters, numIterations) println(clusters.clusterCenters.size) val wssse = clusters.computeCost(data) println(serror : $wssse) } } For the testing purpose, I was generating a sample random data with julia and store it in a csv file delimited by comma. The dimensions is 248000 x 384. In the target application, I will have more than 248k data to cluster. On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu dav...@databricks.com wrote: Could you post you script to reproduce the results (also how to generate the dataset)? That will help us to investigate it. On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hmm, here I use spark on local mode on my laptop with 8 cores. The data is on my local filesystem. Event thought, there an overhead due to the distributed computation, I found the difference between the runtime of the two implementations really, really huge. Is there a benchmark on how well the algorithm implemented in mllib performs ? On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen so...@cloudera.com wrote: Spark has much more overhead, since it's set up to distribute the computation. Julia isn't distributed, and so has no such overhead in a completely in-core implementation. You generally use Spark when you have a problem large enough to warrant distributing, or, your data already lives in a distributed store like HDFS. But it's also possible you're not configuring the implementations the same way, yes. There's not enough info here really to say. On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm trying to a run clustering with kmeans algorithm. The size of my data set is about 240k vectors of dimension 384. Solving the problem with the kmeans available in julia (kmean++) http://clusteringjl.readthedocs.org/en/latest/kmeans.html take about 8 minutes on a single core. Solving the same problem with spark kmean|| take more than 1.5 hours with 8 cores Either they don't implement the same algorithm either I don't understand how the kmeans in spark works. Is my data not big enough to take full advantage of spark ? At least, I expect to the same runtime. Cheers, Jao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Stack overflow Error while executing spark SQL
Could you post the full stacktrace? It seems to be some recursive call in parsing. -Xiangrui On Tue, Dec 9, 2014 at 7:44 PM, jishnu.prat...@wipro.com wrote: Hi I am getting Stack overflow Error Exception in main java.lang.stackoverflowerror scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) while executing the following code sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) The complete code is from github https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala import com.google.gson.{GsonBuilder, JsonParser} import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.mllib.clustering.KMeans /** * Examine the collected tweets and trains a model based on them. */ object ExamineAndTrain { val jsonParser = new JsonParser() val gson = new GsonBuilder().setPrettyPrinting().create() def main(args: Array[String]) { // Process program arguments and set properties /*if (args.length 3) { System.err.println(Usage: + this.getClass.getSimpleName + tweetInput outputModelDir numClusters numIterations) System.exit(1) } * */ val outputModelDir=C:\\MLModel val tweetInput=C:\\MLInput val numClusters=10 val numIterations=20 //val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters), Utils.IntParam(numIterations)) = args val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4]) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) // Pretty print some of the tweets. val tweets = sc.textFile(tweetInput) println(Sample JSON Tweets---) for (tweet - tweets.take(5)) { println(gson.toJson(jsonParser.parse(tweet))) } val tweetTable = sqlContext.jsonFile(tweetInput).cache() tweetTable.registerTempTable(tweetTable) println(--Tweet table Schema---) tweetTable.printSchema() println(Sample Tweet Text-) sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) println(--Sample Lang, Name, text---) sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT 1000).collect().foreach(println) println(--Total count by languages Lang, count(*)---) sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println) println(--- Training the model and persist it) val texts = sqlContext.sql(SELECT text from tweetTable).map(_.head.toString) // Cache the vectors RDD since it will be used for all the KMeans iterations. val vectors = texts.map(Utils.featurize).cache() vectors.count() // Calls an action on the RDD to populate the vectors cache. val model = KMeans.train(vectors, numClusters, numIterations) sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir) val some_tweets = texts.take(100) println(Example tweets from the clusters) for (i - 0 until numClusters) { println(s\nCLUSTER $i:) some_tweets.foreach { t = if (model.predict(Utils.featurize(t)) == i) { println(t) } } } } } Thanks Regards Jishnu Menath Prathap - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Building Desktop application for ALS-MlLib/ Training ALS
On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal saurabh.agra...@markit.com wrote: Hi, I am a new bee in spark and scala world I have been trying to implement Collaborative filtering using MlLib supplied out of the box with Spark and Scala I have 2 problems 1. The best model was trained with rank = 20 and lambda = 5.0, and numIter = 10, and its RMSE on the test set is 25.718710831912485. The best model improves the baseline by 18.29%. Is there a scientific way in which RMSE could be brought down? What is a descent acceptable value for RMSE? The grid search approach used in the AMPCamp tutorial is pretty standard. Whether an RMSE is good or not really depends on your dataset. 2. I picked up the Collaborative filtering algorithm from http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html and executed the given code with my dataset. Now, I want to build a desktop application around it. a. What is the best language to do this Java/ Scala? Any possibility to do this using C#? We support Java/Scala/Python. Start with the one your are most familiar with. C# is not supported. b. Can somebody please share any relevant documents/ source or any helper links to help me get started on this? For ALS, you can check the API documentation. Your help is greatly appreciated Thanks!! Regards, Saurabh Agrawal This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by Markit is prohibited. This email is subject to all waivers and other terms at the following link: http://www.markit.com/en/about/legal/email-disclaimer.page Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide. MarkitSERV Limited has its registered office located at Level 4, Ropemaker Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated by the Financial Conduct Authority with registration number 207294 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: ALS failure with size Integer.MAX_VALUE
Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi Xiangrui, The block size limit was encountered even with reduced number of item blocks as you had expected. I'm wondering if I could try the new implementation as a standalone library against a 1.1 deployment. Does it have dependencies on any core API's in the current master? Thanks, Bharath On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar reachb...@gmail.com wrote: Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And yes, I've been following the JIRA for the new ALS implementation. I'll try it out when it's ready for testing. . On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote: Hi Bharath, You can try setting a small item blocks in this case. 1200 is definitely too large for ALS. Please try 30 or even smaller. I'm not sure whether this could solve the problem because you have 100 items connected with 10^8 users. There is a JIRA for this issue: https://issues.apache.org/jira/browse/SPARK-3735 which I will try to implement in 1.3. I'll ping you when it is ready. Best, Xiangrui On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Yes, the issue appears to be due to the 2GB block size limitation. I am hence looking for (user, product) block sizing suggestions to work around the block size limitation. On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote: (It won't be that, since you see that the error occur when reading a block from disk. I think this is an instance of the 2GB block size limitation.) On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi Bharath – I’m unsure if this is your problem but the MatrixFactorizationModel in MLLIB which is the underlying component for ALS expects your User/Product fields to be integers. Specifically, the input to ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am wondering if perhaps one of your identifiers exceeds MAX_INT, could you write a quick check for that? I have been running a very similar use case to yours (with more constrained hardware resources) and I haven’t seen this exact problem but I’m sure we’ve seen similar issues. Please let me know if you have other questions. From: Bharath Ravi Kumar reachb...@gmail.com Date: Thursday, November 27, 2014 at 1:30 PM To: user@spark.apache.org user@spark.apache.org Subject: ALS failure with size Integer.MAX_VALUE We're training a recommender with ALS in mllib 1.1 against a dataset of 150M users and 4.5K items, with the total number of training records being 1.2 Billion (~30GB data). The input data is spread across 1200 partitions on HDFS. For the training, rank=10, and we've configured {number of user data blocks = number of item data blocks}. The number of user/item blocks was varied between 50 to 1200. Irrespective of the block size (e.g. at 1200 blocks each), there are atleast a couple of tasks that end up shuffle reading 9.7G each in the aggregate stage (ALS.scala:337) and failing with the following exception: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108) at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124) at org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space
Hi Jay, Please try increasing executor memory (if the available memory is more than 2GB) and reduce numBlocks in ALS. The current implementation stores all subproblems in memory and hence the memory requirement is significant when k is large. You can also try reducing k and see whether the problem is still there. I made a PR that improves the ALS implementation, which generates subproblems one by one. You can try that as well. https://github.com/apache/spark/pull/3720 Best, Xiangrui On Wed, Dec 17, 2014 at 6:57 PM, buring qyqb...@gmail.com wrote: I am not sure this can help you. I have 57 million rating,about 4million user and 4k items. I used 7-14 total-executor-cores,executal-memory 13g,cluster have 4 nodes,each have 4cores,max memory 16g. I found set as follows may help avoid this problem: conf.set(spark.shuffle.memoryFraction,0.65) //default is 0.2 conf.set(spark.storage.memoryFraction,0.3)//default is 0.6 I have to set rank value under 40, otherwise occure this problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-java-lang-OutOfMemoryError-Java-heap-space-tp20584p20755.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Announcing Spark Packages
Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. I’d like to invite you to contribute and use Spark Packages and provide feedback! As a disclaimer: Spark Packages is a community index maintained by Databricks and (by design) will include packages outside of the ASF Spark project. We are excited to help showcase and support all of the great work going on in the broader Spark community! Cheers, Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Interpreting MLLib's linear regression o/p
Did you check the indices in the LIBSVM data and the master file? Do they match? -Xiangrui On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I use LIBSVM format to specify my input feature vector, which used 1-based index. When I run regression the o/p is 0-indexed based. I have a master lookup file that maps back these indices to what they stand or. However, I need to add offset of 2 and not 1 to the regression outcome during the mapping. So for example to map the index of 800 from the regression output file, I look for 802 in my master lookup file and then things make sense. I can understand adding offset of 1, but not sure why adding offset 2 is working fine. Have others seem something like this as well? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib beginner question
How big is the dataset you want to use in prediction? -Xiangrui On Mon, Dec 22, 2014 at 1:47 PM, boci boci.b...@gmail.com wrote: Hi! I want to try out spark mllib in my spark project, but I got a little problem. I have training data (external file), but the real data com from another rdd. How can I do that? I try to simple using same SparkContext to boot rdd (first I create rdd using sc.textFile() and after NaiveBayes.train... After that I want to fetch the real data using same context and internal the map using the predict. But My application never exit (I think stucked or something). Why not work this solution? Thanks b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib + Streaming
We have streaming linear regression (since v1.1) and k-means (v1.2) in MLlib. You can check the user guide: http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering -Xiangrui On Tue, Dec 23, 2014 at 10:01 AM, Gianmarco De Francisci Morales g...@apache.org wrote: Hi, I have recently seen a demo of Spark where different pieces were put together (training via MLlib + deploying on Spark Streaming). I was wondering if MLlib currently works to directly train on Streaming. And, if so, what are the semantics of the algorithms? If not, would it be interesting to have ML algorithms developed for the streaming setting? Thanks, -- Gianmarco - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: retry in combineByKey at BinaryClassificationMetrics.scala
Sean's PR may be relevant to this issue (https://github.com/apache/spark/pull/3702). As a workaround, you can try to truncate the raw scores to 4 digits (e.g., 0.5643215 - 0.5643) before sending it to BinaryClassificationMetrics. This may not work well if he score distribution is very skewed. See discussion on https://issues.apache.org/jira/browse/SPARK-4547 -Xiangrui On Tue, Dec 23, 2014 at 9:00 AM, Thomas Kwan thomas.k...@manage.com wrote: Hi there, We are using mllib 1.1.1, and doing Logistics Regression with a dataset of about 150M rows. The training part usually goes pretty smoothly without any retries. But during the prediction stage and BinaryClassificationMetrics stage, I am seeing retries with error of fetch failure. The prediction part is just as follows: val predictionAndLabel = testRDD.map { point = val prediction = model.predict(point.features) (prediction, point.label) } ... val metrics = new BinaryClassificationMetrics(predictionAndLabel) The fetch failure happened with the following stack trace: org.apache.spark.rdd.PairRDDFunctions.combineByKey(PairRDDFunctions.scala:515) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3$lzycompute(BinaryClassificationMetrics.scala:101) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3(BinaryClassificationMetrics.scala:96) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:98) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:98) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:142) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:50) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:60) com.manage.ml.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:14) ... We are doing this in the yarn-client mode. 32 executors, 16G executor memory, and 12 cores as the spark-submit settings. I wonder if anyone has suggestion on how to debug this. thanks in advance thomas - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using TF-IDF from MLlib
Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote: Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: weights not changed with different reg param
Could you post your code? It sounds like a bug. One thing to check is that wheher you set regType, which is None by default. -Xiangrui On Tue, Dec 23, 2014 at 3:36 PM, Thomas Kwan thomas.k...@manage.com wrote: Hi there We are on mllib 1.1.1, and trying different regularization parameters. We noticed that the regParam dont affect the weights at all. Is setting the reg param via the optimizer the right thing to do? Do we need to set our own updater? Anyone else seeing the same behaviour? thanks again thomas - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib beginner question
b0c1, did you apply model.predict to a DStream? Maybe it would help understand your question better if you can post your code. -Xiangrui On Tue, Dec 23, 2014 at 11:54 AM, boci boci.b...@gmail.com wrote: Xiangrui: Hi, I want to using this with streaming and with job too. I using kafka (streaming) and elasticsearch (job) as source and want to calculate sentiment value from the input text. Simon: great, you have any doc how can I embed into my application without using the http interface? (how can I direct call the service?) b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com On Tue, Dec 23, 2014 at 1:35 AM, Xiangrui Meng men...@gmail.com wrote: How big is the dataset you want to use in prediction? -Xiangrui On Mon, Dec 22, 2014 at 1:47 PM, boci boci.b...@gmail.com wrote: Hi! I want to try out spark mllib in my spark project, but I got a little problem. I have training data (external file), but the real data com from another rdd. How can I do that? I try to simple using same SparkContext to boot rdd (first I create rdd using sc.textFile() and after NaiveBayes.train... After that I want to fetch the real data using same context and internal the map using the predict. But My application never exit (I think stucked or something). Why not work this solution? Thanks b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SVDPlusPlus Recommender in MLLib
There is an SVD++ implementation in GraphX. It would be nice if you can compare its performance vs. Mahout. -Xiangrui On Wed, Dec 24, 2014 at 6:46 AM, Prafulla Wani prafulla.w...@gmail.com wrote: hi , Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is implemented in Mahout from this paper - http://research.yahoo.com/files/kdd08koren.pdf Regards, Prafulla. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to use BigInteger for userId and productId in collaborative Filtering?
Hi Nishanth, Just found out where you work:) We had some discussion in https://issues.apache.org/jira/browse/SPARK-2465 . Having long IDs will increase the communication cost, which may not worth the benefit. Not many companies have more than 1 billion users. If they do, maybe they can mirror the implementation for their use cases. I can suggest several possible solutions: 1. Hash user IDs into integers before training. If the collision rate is high and it is crucial for your business, you can recompute user features from product features by solving least squares after training. This works when the product IDs could be mapped to integers. 2. Make type aliases in ALS, so that you can easily mirror the implementation to use long IDs and track future changes. 3. Make ALS implementation use generic ID types. This would be the best solution, but it requires some refactoring of the code. Best, Xiangrui On Wed, Jan 14, 2015 at 1:04 PM, Nishanth P S nishant...@gmail.com wrote: Yes, we are close to having more 2 billion users. In this case what is the best way to handle this. Thanks, Nishanth On Fri, Jan 9, 2015 at 9:50 PM, Xiangrui Meng men...@gmail.com wrote: Do you have more than 2 billion users/products? If not, you can pair each user/product id with an integer (check RDD.zipWithUniqueId), use them in ALS, and then join the original bigInt IDs back after training. -Xiangrui On Fri, Jan 9, 2015 at 5:12 PM, nishanthps nishant...@gmail.com wrote: Hi, The userId's and productId's in my data are bigInts, what is the best way to run collaborative filtering on this data. Should I modify MLlib's implementation to support more types? or is there an easy way. Thanks!, Nishanth -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-BigInteger-for-userId-and-productId-in-collaborative-Filtering-tp21072.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using a RowMatrix inside a map
Yes, you can only use RowMatrix.multiply() within the driver. We are working on distributed block matrices and linear algebra operations on top of it, which would fit your use cases well. It may take several PRs to finish. You can find the first one here: https://github.com/apache/spark/pull/3200 -Xiangrui On Wed, Jan 14, 2015 at 2:06 PM, Alex Minnaar aminn...@verticalscope.com wrote: I am working with a RowMatrix and I noticed in the multiply() method that the local matrix with which it is being multiplied is being distributed to all of the rows of the RowMatrix. If this is the case, then is it impossible to multiply a row matrix within a map operation? Because this would essentially be creating RDDs within RDDs. For example, If you had an RDD of local matrices and you wanted to perform a map operation where each local matrix is multiplied with a distributed matrix. This does not seem possible since it would require distributing each local matrix in the map when multiplication occurs (i.e. creating an RDD in each element of the original RDD). If this is true then does it mean you can only multiply a RowMatrix within the driver i.e. you cannot parallelize RowMatrix multiplications? Thanks, Alex - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?
The assumption of implicit feedback model is that the unobserved ratings are more likely to be negative. So you may want to add some negatives for evaluation. Otherwise, the input ratings are all 1 and the test ratings are all 1 as well. The baseline predictor, which uses the average rating (that is 1), could easily give you an RMSE of 0.0. -Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to create distributed matrixes from hive tables.
You can get a SchemaRDD from the Hive table, map it into a RDD of Vectors, and then construct a RowMatrix. The transformations are lazy, so there is no external storage requirement for intermediate data. -Xiangrui On Sun, Jan 18, 2015 at 4:07 AM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, We have large datasets with data format for Spark MLLib matrix, but there are pre-computed by Hive and stored inside Hive, my question is can we create a distributed matrix such as IndexedRowMatrix directlly from Hive tables, avoiding reading data from Hive tables and feed them into an empty Matrix. Regards - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Saving a mllib model in Spark SQL
You can save the cluster centers as a SchemaRDD of two columns (id: Int, center: Array[Double]). When you load it back, you can construct the k-means model from its cluster centers. -Xiangrui On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian lian.cs@gmail.com wrote: This is because KMeanModel is neither a built-in type nor a user defined type recognized by Spark SQL. I think you can write your own UDT version of KMeansModel in this case. You may refer to o.a.s.mllib.linalg.Vector and o.a.s.mllib.linalg.VectorUDT as an example. Cheng On 1/20/15 5:34 AM, Divyansh Jain wrote: Hey people, I have run into some issues regarding saving the k-means mllib model in Spark SQL by converting to a schema RDD. This is what I am doing: case class Model(id: String, model: org.apache.spark.mllib.clustering.KMeansModel) import sqlContext.createSchemaRDD val rowRdd = sc.makeRDD(Seq(id, model)).map(p = Model(id, model)) This is the error that I get : scala.MatchError: org.apache.spark.mllib.classification.ClassificationModel (of class scala.reflect.internal.Types$TypeRef$anon$6) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:53) at org.apache.spark.sql.catalyst.ScalaReflection$anonfun$schemaFor$1.apply(ScalaReflection.scala:64) at org.apache.spark.sql.catalyst.ScalaReflection$anonfun$schemaFor$1.apply(ScalaReflection.scala:62) at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:62) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:50) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:44) at org.apache.spark.sql.execution.ExistingRdd$.fromProductRdd(basicOperators.scala:229) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:94) Any help would be appreciated. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Saving-a-mllib-model-in-Spark-SQL-tp21264.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Stepsize with Linear Regression
The best step size depends on the condition number of the problem. You can try some conditioning heuristics first, e.g., normalizing the columns, and then try a common step size like 0.01. We should implement line search for linear regression in the future, as in LogisticRegressionWithLBFGS. Line search can determine the step size automatically for you. -Xiangrui On Tue, Feb 10, 2015 at 8:56 AM, Rishi Yadav ri...@infoobjects.com wrote: Are there any thumbrules how to set stepsize with gradient descent. I am using it for Linear Regression but I am sure it applies in general to gradient descent. I am at present deriving a number which fits closest to training data set response variable values. I am sure there is a better way to do it. Thanks and Regards, Rishi @meditativesoul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: high GC in the Kmeans algorithm
Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using YourKit or VisualVM. -Xiangrui On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote: I just want to make the best use of CPU, and test the performance of spark if there is a lot of task in a single node. On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote: Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote: I give 50GB to the executor, so it seem that there is no reason the memory is not enough. On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote: Meaning, you have 128GB per machine but how much memory are you giving the executors? On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote: What do you mean? Yes,I an see there is some data put in the memory from the web ui. On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com wrote: Are you actually using that memory for executors? On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote: Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I reduce the dimension to 10^4, then the gc is small. So why gc is so high when the dimension is larger? or this is the reason caused by MLlib? -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ -- Best Wishes! Li Hu(李浒) | Graduate Student Institute for Interdisciplinary Information Sciences(IIIS) Tsinghua University, China Email: lihu...@gmail.com Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: feeding DataFrames into predictive algorithms
Hey Sandy, The work should be done by a VectorAssembler, which combines multiple columns (double/int/vector) into a vector column, which becomes the features column for regression. We can going to create JIRAs for each of these standard feature transformers. It would be great if you can help implement some of them. Best, Xiangrui On Wed, Feb 11, 2015 at 7:55 PM, Patrick Wendell pwend...@gmail.com wrote: I think there is a minor error here in that the first example needs a tail after the seq: df.map { row = (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double])) }.toDataFrame(label, features) On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust mich...@databricks.com wrote: It sounds like you probably want to do a standard Spark map, that results in a tuple with the structure you are looking for. You can then just assign names to turn it back into a dataframe. Assuming the first column is your label and the rest are features you can do something like this: val df = sc.parallelize( (1.0, 2.3, 2.4) :: (1.2, 3.4, 1.2) :: (1.2, 2.3, 1.2) :: Nil).toDataFrame(a, b, c) df.map { row = (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double])) }.toDataFrame(label, features) df: org.apache.spark.sql.DataFrame = [label: double, features: arraydouble] If you'd prefer to stick closer to SQL you can define a UDF: val createArray = udf((a: Double, b: Double) = Seq(a, b)) df.select('a as 'label, createArray('b,'c) as 'features) df: org.apache.spark.sql.DataFrame = [label: double, features: arraydouble] We'll add createArray as a first class member of the DSL. Michael On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hey All, I've been playing around with the new DataFrame and ML pipelines APIs and am having trouble accomplishing what seems like should be a fairly basic task. I have a DataFrame where each column is a Double. I'd like to turn this into a DataFrame with a features column and a label column that I can feed into a regression. So far all the paths I've gone down have led me to internal APIs or convoluted casting in and out of RDD[Row] and DataFrame. Is there a simple way of accomplishing this? any assistance (lookin' at you Xiangrui) much appreciated, Sandy - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLib usage on Spark Streaming
JavaDStream.foreachRDD (https://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function)) and Statistics.corr (https://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/mllib/stat/Statistics.html#corr(org.apache.spark.rdd.RDD)) should be good starting points. -Xiangrui On Mon, Feb 16, 2015 at 6:39 AM, Spico Florin spicoflo...@gmail.com wrote: Hello! I'm newbie to Spark and I have the following case study: 1. Client sending at 100ms the following data: {uniqueId, timestamp, measure1, measure2 } 2. Each 30 seconds I would like to correlate the data collected in the window, with some predefined double vector pattern for each given key. The predefined pattern has 300 records. The data should be also sorted by timestamp. 3. When the correlation is greater than a predefined threshold (e.g 0.9) I would like to emit an new message containing {uniqueId, doubleCorrelationValue} 4. For the correlation I would like to use MLlib 5. As a programming language I would like to muse Java 7. Can you please give me some suggestions on how to create the skeleton for the above scenario? Thanks. Regards, Florin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Large Similarity Job failing
The complexity of DIMSUM is independent of the number of rows but still have quadratic dependency on the number of columns. 1.5M columns may be too large to use DIMSUM. Try to increase the threshold and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am running brute force similarity from RowMatrix on a job with 5M x 1.5M sparse matrix with 800M entries. With 200M entries the job run fine but with 800M I am getting exceptions like too many files open and no space left on device... Seems like I need more nodes or use dimsum sampling ? I am running on 10 nodes where ulimit on each node is set at 65K...Memory is not an issue since I can cache the dataset before similarity computation starts. I tested the same job on YARN with Spark 1.1 and Spark 1.2 stable. Both the jobs failed with FetchFailed msgs. Thanks. Deb - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [POWERED BY] Radius Intelligence
Thanks! I added Radius to https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. -Xiangrui On Tue, Feb 10, 2015 at 12:02 AM, Alexis Roos alexis.r...@gmail.com wrote: Also long due given our usage of Spark .. Radius Intelligence: URL: radius.com Description: Spark, MLLib Using Scala, Spark and MLLib for Radius Marketing and Sales intelligence platform including data aggregation, data processing, data clustering, data analysis and predictive modeling of all US businesses. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unknown sample in Naive Baye's
If there exists a sample that doesn't not belong to A/B/C, it means that there exists another class D or Unknown besides A/B/C. You should have some of these samples in the training set in order to let naive Bayes learn the priors. -Xiangrui On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet jatinpr...@gmail.com wrote: Hi, I am using MLlib's Naive Baye's classifier to classify textual data. I am accessing the posterior probabilities through a hack for each class. Once I have trained the model, I want to remove documents whose confidence of classification is low. Say for a document, if the highest class probability is lesser than a pre-defined threshold(separate for each class), categorize this document as 'unknown'. Say there are three classes A, B and C with thresholds 0.35, 0.32 and 0.33 respectively defined after training and testing. If I score a sample that belongs to neither of the three categories, I wish to classify it as 'unknown'. But the issue is I can get a probability higher than these thresholds for a document that doesn't belong to the trained categories. Is there any technique which I can apply to segregate documents that belong to untrained classes with certain degree of confidence? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Naive Bayes model fails after a few predictions
Could you share the error log? What do you mean by 500 instead of 200? If this is the number of files, try to use `repartition` before calling naive Bayes, which works the best when the number of partitions matches the number of cores, or even less. -Xiangrui On Tue, Feb 10, 2015 at 10:34 PM, rkgurram rkgur...@gmail.com wrote: Further I have tried HttpBroadcast but that too does not work. It is almost like there is a MemoryLeak because if I increase the input files to 500 instead of 200 the system crashes early. The code is as follows logger.info(Training the model Fold:[+ fold +]) logger.info(Step 1: Split the input into Training and Testing sets) val splits = labeledPointRDD.randomSplit(Array(0.6, 0.4), seed = 11L) logger.info(Step 1: splits successful...) val training = splits(0) val test = splits(1) status = ModelStatus.IN_TRAINING //logger.info(Fold:[ + fold + ] Training count: + training.count() + Testing/Verification count: + test.count()) logger.info(Step 2: Train the NB classifier) model = NaiveBayes.train(training, lambda = 1.0) logger.info(Step 2: NB model training complete Fold:[ + fold + ]) logger.info(Step 3: Testing/Verification of the model) status = ModelStatus.IN_VERIFICATION val predictionAndLabel = test.map(p = (model.predict(p.features), p.label)) val arry = predictionAndLabel.filter(x = x._1 == x._2) val accuracy = 1.0 * predictionAndLabel.filter(x = x._1 == x._2).count() / test.count() logger.info(Step 3: Testing complete) status = ModelStatus.INITIALIZED logger.info(Fold[+ fold +] Accuracy:[ + accuracy + ] Model Status:[ + status + ]) -Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Bayes-model-fails-after-a-few-predictions-tp21592p21593.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: WARN from Similarity Calculation
It may be caused by GC pause. Did you check the GC time in the Spark UI? -Xiangrui On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms exceeds 45000ms Do I need to increase the default 45 s to larger values for cases where we are doing blocked operation or long compute in the mapPartitions ? Thanks. Deb - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unknown sample in Naive Baye's
If you know there are data doesn't belong to any existing category, put them into the training set and make a new category for them. It won't help much if instances from this unknown category are all outliers. In that case, lower the thresholds and tune the parameters to get a lower error rate. -Xiangrui On Thu, Feb 19, 2015 at 8:58 AM, Jatinpreet Singh jatinpr...@gmail.com wrote: Hi Xiangrui, Thanks for the answer. The problem is that in my application, I can not stop user from scoring any type of sample against trained model. So, even if the class of a completely unknown sample has not been trained, the model will put it in one of the categories with high priority. I wish to eliminate this with come kind of probability threshold. Is this possible in any way with Naive Baye's? Can changing the classification algorithm help in this regard? I appreciate any help on this. Thanks, Jatin On Wed, Feb 18, 2015 at 3:07 AM, Xiangrui Meng men...@gmail.com wrote: If there exists a sample that doesn't not belong to A/B/C, it means that there exists another class D or Unknown besides A/B/C. You should have some of these samples in the training set in order to let naive Bayes learn the priors. -Xiangrui On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet jatinpr...@gmail.com wrote: Hi, I am using MLlib's Naive Baye's classifier to classify textual data. I am accessing the posterior probabilities through a hack for each class. Once I have trained the model, I want to remove documents whose confidence of classification is low. Say for a document, if the highest class probability is lesser than a pre-defined threshold(separate for each class), categorize this document as 'unknown'. Say there are three classes A, B and C with thresholds 0.35, 0.32 and 0.33 respectively defined after training and testing. If I score a sample that belongs to neither of the three categories, I wish to classify it as 'unknown'. But the issue is I can get a probability higher than these thresholds for a document that doesn't belong to the trained categories. Is there any technique which I can apply to segregate documents that belong to untrained classes with certain degree of confidence? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Regards, Jatinpreet Singh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLIB and Openblas library in non-default dir
It might be hard to do that with spark-submit, because the executor JVMs may be already up and running before a user runs spark-submit. You can try to use `System.setProperty` to change the property at runtime, though it doesn't seem to be a good solution. -Xiangrui On Fri, Jan 2, 2015 at 6:28 AM, xhudik xhu...@gmail.com wrote: Hi I have compiled OpenBlas library into nonstandard directory and I want to inform Spark app about it via: -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so which is a standard option in netlib-java (https://github.com/fommil/netlib-java) I tried 2 ways: 1. via *--conf* parameter /bin/spark-submit -v --class org.apache.spark.examples.mllib.LinearRegression *--conf -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so* examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar data/mllib/sample_libsvm_data.txt/ 2. via *--driver-java-options* parameter /bin/spark-submit -v *--driver-java-options -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so* --class org.apache.spark.examples.mllib.LinearRegression examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar data/mllib/sample_libsvm_data.txt / How can I force spark-submit to propagate info about non-standard placement of openblas library to netlib-java lib? thanks, Tomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-and-Openblas-library-in-non-default-dir-tp20943.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: python API for gradient boosting?
I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-5094. Hopefully someone would work on it and make it available in the 1.3 release. -Xiangrui On Sun, Jan 4, 2015 at 6:58 PM, Christopher Thom christopher.t...@quantium.com.au wrote: Hi, I wonder if anyone knows when a python API will be added for Gradient Boosted Trees? I see that java and scala APIs were added for the 1.2 release, and would love to be able to build GBMs in pyspark too. cheers chris Christopher Thom QUANTIUM Level 25, 8 Chifley, 8-12 Chifley Square Sydney NSW 2000 T: +61 2 8222 3577 F: +61 2 9292 6444 W: quantium.com.au linkedin.com/company/quantium facebook.com/QuantiumAustralia twitter.com/QuantiumAU The contents of this email, including attachments, may be confidential information. If you are not the intended recipient, any use, disclosure or copying of the information is unauthorised. If you have received this email in error, we would be grateful if you would notify us immediately by email reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message from your system. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Driver hangs on running mllib word2vec
How big is your dataset, and what is the vocabulary size? -Xiangrui On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi, When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup usage. Here is the jstack output: main prio=10 tid=0x40112800 nid=0x46f2 runnable [0x4162e000] java.lang.Thread.State: RUNNABLE at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778) at java.io.DataOutputStream.writeInt(DataOutputStream.java:182) at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225) at java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) at com.baidu.inf.WordCount$.main(WordCount.scala:31) at com.baidu.inf.WordCount.main(WordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- Best Regards - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OptionalDataException during Naive Bayes Training
How big is your data? Did you see other error messages from executors? It seems to me like a shuffle communication error. This thread may be relevant: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccalrnvjuvtgae_ag1rqey_cod1nmrlfpesxgsb7g8r21h0bm...@mail.gmail.com%3E -Xiangrui On Fri, Jan 9, 2015 at 3:19 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes, during Naive Baye's training, I get OptionalDataException at line, map at NaiveBayes.scala:109 I am getting following exception on the console, java.io.OptionalDataException: java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1371) java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) java.util.HashMap.readObject(HashMap.java:1394) sun.reflect.GeneratedMethodAccessor626.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:483) java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) What could be the reason behind this? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OptionalDataException-during-Naive-Bayes-Training-tp21059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to use BigInteger for userId and productId in collaborative Filtering?
Do you have more than 2 billion users/products? If not, you can pair each user/product id with an integer (check RDD.zipWithUniqueId), use them in ALS, and then join the original bigInt IDs back after training. -Xiangrui On Fri, Jan 9, 2015 at 5:12 PM, nishanthps nishant...@gmail.com wrote: Hi, The userId's and productId's in my data are bigInts, what is the best way to run collaborative filtering on this data. Should I modify MLlib's implementation to support more types? or is there an easy way. Thanks!, Nishanth -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-BigInteger-for-userId-and-productId-in-collaborative-Filtering-tp21072.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Discrepancy in PCA values
You need to subtract mean values to obtain the covariance matrix (http://en.wikipedia.org/wiki/Covariance_matrix). On Fri, Jan 9, 2015 at 6:41 PM, Upul Bandara upulband...@gmail.com wrote: Hi Xiangrui, Thanks for the reply. Julia code is also using the covariance matrix: (1/n)*X'*X ; Thanks, Upul On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng men...@gmail.com wrote: The Julia code is computing the SVD of the Gram matrix. PCA should be applied to the covariance matrix. -Xiangrui On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com wrote: Hi All, I tried to do PCA for the Iris dataset [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib [http://spark.apache.org/docs/1.1.1/mllib-dimensionality-reduction.html]. Also, PCA was calculated in Julia using following method: Sigma = (1/numRow(X))*X'*X ; [U, S, V] = svd(Sigma); Ureduced = U(:, 1:k); Z = X*Ureduced; However, I'm seeing a little difference between values given by MLLib and the method shown above . Does anyone have any idea about this difference? Additionally, I have attached two visualizations, related to two approaches. Thanks, Upul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: calculating the mean of SparseVector RDD
colStats() computes the mean values along with several other summary statistics, which makes it slower. How is the performance if you don't use kryo? -Xiangrui On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar rokros...@gmail.com wrote: thanks for the suggestion -- however, looks like this is even slower. With the small data set I'm using, my aggregate function takes ~ 9 seconds and the colStats.mean() takes ~ 1 minute. However, I can't get it to run with the Kyro serializer -- I get the error: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5, required: 8 is there an easy/obvious fix? On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng men...@gmail.com wrote: There is some serialization overhead. You can try https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 . -Xiangrui On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote: I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector. I've tried doing this with the following (using pyspark, spark v1.2.0): def aggregate_partition_values(vec1, vec2) : vec1[vec2.indices] += vec2.values return vec1 def aggregate_combined_vectors(vec1, vec2) : if all(vec1 == vec2) : # then the vector came from only one partition return vec1 else: return vec1 + vec2 means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values, aggregate_combined_vectors) means = means / nvals This turns out to be really slow -- and doesn't seem to depend on how many vectors there are so there seems to be some overhead somewhere that I'm not understanding. Is there a better way of doing this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Zipping RDDs of equal size not possible
sample 2 * n tuples, split them into two parts, balance the sizes of these parts by filtering some tuples out How do you guarantee that the two RDDs have the same size? -Xiangrui On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke 1wil...@informatik.uni-hamburg.de wrote: Hi Spark community, I have a problem with zipping two RDDs of the same size and same number of partitions. The error message says that zipping is only allowed on RDDs which are partitioned into chunks of exactly the same sizes. How can I assure this? My workaround at the moment is to repartition both RDDs to only one partition but that obviously does not scale. This problem originates from my problem to draw n random tuple pairs (Tuple, Tuple) from an RDD[Tuple]. What I do is to sample 2 * n tuples, split them into two parts, balance the sizes of these parts by filtering some tuples out and zipping them together. I would appreciate to read better approaches for both problems. Thanks in advance, Niklas - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Discrepancy in PCA values
Could you compare V directly and tell us more about the difference you saw? The column of V should be the same subject to signs. For example, the first column of V could be either [0.8, -0.6, 0.0] or [-0.8, 0.6, 0.0]. -Xiangrui On Sat, Jan 10, 2015 at 8:08 PM, Upul Bandara upulband...@gmail.com wrote: Hi Xiangrui, Thanks a lot for you answer. So I fixed my Julia code, also calculated PCA using R as well. R Code: - data - read.csv('/home/upul/Desktop/iris.csv'); X - data[,1:4] pca - prcomp(X, center = TRUE, scale=FALSE) transformed - predict(pca, newdata = X) Julia Code (Fixed) -- data = readcsv(/home/upul/temp/iris.csv); X = data[:,1:end-1]; meanX = mean(X,1); m,n = size(X); X = X - repmat(x, m,1); u,s,v = svd(X); transformed = X*v; Now PCA calculated using Julia and R is identical, but still I can see a small difference between PCA values given by Spark and other two. Thanks, Upul On Sat, Jan 10, 2015 at 11:17 AM, Xiangrui Meng men...@gmail.com wrote: You need to subtract mean values to obtain the covariance matrix (http://en.wikipedia.org/wiki/Covariance_matrix). On Fri, Jan 9, 2015 at 6:41 PM, Upul Bandara upulband...@gmail.com wrote: Hi Xiangrui, Thanks for the reply. Julia code is also using the covariance matrix: (1/n)*X'*X ; Thanks, Upul On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng men...@gmail.com wrote: The Julia code is computing the SVD of the Gram matrix. PCA should be applied to the covariance matrix. -Xiangrui On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com wrote: Hi All, I tried to do PCA for the Iris dataset [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib [http://spark.apache.org/docs/1.1.1/mllib-dimensionality-reduction.html]. Also, PCA was calculated in Julia using following method: Sigma = (1/numRow(X))*X'*X ; [U, S, V] = svd(Sigma); Ureduced = U(:, 1:k); Z = X*Ureduced; However, I'm seeing a little difference between values given by MLLib and the method shown above . Does anyone have any idea about this difference? Additionally, I have attached two visualizations, related to two approaches. Thanks, Upul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: calculating the mean of SparseVector RDD
No, colStats() computes all summary statistics in one pass and store the values. It is not lazy. On Mon, Jan 12, 2015 at 4:42 AM, Rok Roskar rokros...@gmail.com wrote: This was without using Kryo -- if I use kryo, I got errors about buffer overflows (see above): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5, required: 8 Just calling colStats doesn't actually compute those statistics, does it? It looks like the computation is only carried out once you call the .mean() method. On Sat, Jan 10, 2015 at 7:04 AM, Xiangrui Meng men...@gmail.com wrote: colStats() computes the mean values along with several other summary statistics, which makes it slower. How is the performance if you don't use kryo? -Xiangrui On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar rokros...@gmail.com wrote: thanks for the suggestion -- however, looks like this is even slower. With the small data set I'm using, my aggregate function takes ~ 9 seconds and the colStats.mean() takes ~ 1 minute. However, I can't get it to run with the Kyro serializer -- I get the error: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5, required: 8 is there an easy/obvious fix? On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng men...@gmail.com wrote: There is some serialization overhead. You can try https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 . -Xiangrui On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote: I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector. I've tried doing this with the following (using pyspark, spark v1.2.0): def aggregate_partition_values(vec1, vec2) : vec1[vec2.indices] += vec2.values return vec1 def aggregate_combined_vectors(vec1, vec2) : if all(vec1 == vec2) : # then the vector came from only one partition return vec1 else: return vec1 + vec2 means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values, aggregate_combined_vectors) means = means / nvals This turns out to be really slow -- and doesn't seem to depend on how many vectors there are so there seems to be some overhead somewhere that I'm not understanding. Is there a better way of doing this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: including the spark-mllib in build.sbt
I don't know the root cause. Could you try including only libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1 It should be sufficient because mllib depends on core. -Xiangrui On Mon, Jan 12, 2015 at 2:27 PM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, I am trying to build my own scala project using sbt. The project is dependent on both spark-score and spark-mllib. I included the following two dependencies in my build.sbt file libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1 libraryDependencies += org.apache.spark %% spark-core % 1.1.1 However, when I run the package command in sbt, I got an error message indicating that object mllib is not a member of package org.apache.spark. Did I do anything wrong? Thanks, Jianguo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: TF-IDF from spark-1.1.0 not working on cluster mode
This is worker log, not executor log. The executor log can be found in folders like /newdisk2/rta/rtauser/workerdir/app-20150109182514-0001/0/ . -Xiangrui On Fri, Jan 9, 2015 at 5:03 AM, Priya Ch learnings.chitt...@gmail.com wrote: Please find the attached worker log. I could see stream closed exception On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote: Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in local mode and not on distributed mode. Null pointer exception has been thrown. Is this a bug in spark-1.1.0 ? Following is the code: def main(args:Array[String]) { val conf=new SparkConf val sc=new SparkContext(conf) val documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf= hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) val rdd=tfidf.map { vec = println(vector is+vec) (10) } rdd.saveAsTextFile(/home/padma/usecase) } Exception thrown: 15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167] with ID 0 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM 15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection from [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888] 15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130] 15/01/06 12:36:12 INFO network.SendingConnection: Connected to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1 GB) 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB) 15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException: org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) Thanks, Padma Ch - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLIB and Openblas library in non-default dir
spark-submit may not share the same JVM with Spark master and executors. On Tue, Jan 6, 2015 at 11:40 AM, Tomas Hudik xhu...@gmail.com wrote: thanks Xiangrui I'll try it. BTW: spark-submit is a standalone program (bin/spark-submit). Therefore, JVM has to be executed after spark-submit script Am I correct? On Mon, Jan 5, 2015 at 10:35 PM, Xiangrui Meng men...@gmail.com wrote: It might be hard to do that with spark-submit, because the executor JVMs may be already up and running before a user runs spark-submit. You can try to use `System.setProperty` to change the property at runtime, though it doesn't seem to be a good solution. -Xiangrui On Fri, Jan 2, 2015 at 6:28 AM, xhudik xhu...@gmail.com wrote: Hi I have compiled OpenBlas library into nonstandard directory and I want to inform Spark app about it via: -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so which is a standard option in netlib-java (https://github.com/fommil/netlib-java) I tried 2 ways: 1. via *--conf* parameter /bin/spark-submit -v --class org.apache.spark.examples.mllib.LinearRegression *--conf -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so* examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar data/mllib/sample_libsvm_data.txt/ 2. via *--driver-java-options* parameter /bin/spark-submit -v *--driver-java-options -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so* --class org.apache.spark.examples.mllib.LinearRegression examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar data/mllib/sample_libsvm_data.txt / How can I force spark-submit to propagate info about non-standard placement of openblas library to netlib-java lib? thanks, Tomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-and-Openblas-library-in-non-default-dir-tp20943.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [MLLib] storageLevel in ALS
Which Spark version are you using? We made this configurable in 1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L202 -Xiangrui On Tue, Jan 6, 2015 at 12:57 PM, Fernando O. fot...@gmail.com wrote: Hi, I was doing a tests with ALS and I noticed that if I persist the inner RDDs from a MatrixFactorizationModel the RDD is not replicated, it seems like the storagelevel is hardcoded to MEMORY_AND_DISK, do you think it makes sense to make that configurable? [image: Inline image 1]
Re: confidence/probability for prediction in MLlib
This is addressed in https://issues.apache.org/jira/browse/SPARK-4789. In the new pipeline API, we can simply output two columns, one for the best predicted class, and the other for probabilities or confidence scores for each class. -Xiangrui On Tue, Jan 6, 2015 at 11:43 AM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, A while ago, somebody asked about getting a confidence value of a prediction with MLlib's implementation of Naive Bayes's classification. I was wondering if there is any plan in the near future for the predict function to return both a label and a confidence/probability? Or could the private variables in the various machine learning models be exposed so we could write our own functions which return both? Having a confidence/probability could be very useful in real application. For one thing, you can choose to trust the predicted label only if it has a high confidence level. Also, if you want to combine the results from multiple classifiers, the confidence/probability could be used as some kind of weight for combining. Thanks, Jianguo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: TF-IDF from spark-1.1.0 not working on cluster mode
Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in local mode and not on distributed mode. Null pointer exception has been thrown. Is this a bug in spark-1.1.0 ? Following is the code: def main(args:Array[String]) { val conf=new SparkConf val sc=new SparkContext(conf) val documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf= hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) val rdd=tfidf.map { vec = println(vector is+vec) (10) } rdd.saveAsTextFile(/home/padma/usecase) } Exception thrown: 15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167] with ID 0 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM 15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection from [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888] 15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130] 15/01/06 12:36:12 INFO network.SendingConnection: Connected to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1 GB) 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB) 15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException: org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) Thanks, Padma Ch - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib: feature standardization
`mean()` and `variance()` are not defined in `Vector`. You can use the mean and variance implementation from commons-math3 (http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html) if you don't want to implement them. -Xiangrui On Fri, Feb 6, 2015 at 12:50 PM, SK skrishna...@gmail.com wrote: Hi, I have a dataset in csv format and I am trying to standardize the features before using k-means clustering. The data does not have any labels but has the following format: s1, f12,f13,... s2, f21,f22,... where s is a string id, and f is a floating point feature value. To perform feature standardization, I need to compute the mean and variance/std deviation of the features values in each element of the RDD (i.e each row). However, the summary Statistics library in Spark MLLib provides only a colStats() method that provides column-wise mean and variance. I tried to compute the mean and variance per row, using the code below but got a compilation error that there is no mean() or variance() method for a tuple or Vector object. Is there a Spark library to compute the row-wise mean and variance for an RDD, where each row (i.e. element) of the RDD is a Vector or tuple of N feature values? thanks My code for standardization is as follows: //read the data val data=sc.textFile(file_name) .map(_.split(,)) // extract the features. For this example I am using only 2 features, but the data has more features val features = data.map(d= Vectors.dense(d(1).toDouble, d(2).toDouble)) val std_features = features.map(f= { val fmean = f.mean() // Error: NO MEAN() for a Vector or Tuple object val fstd= scala.math.sqrt(f.variance())// Error: NO variance() for a Vector or Tuple object for (i - 0 to f.length) // standardize the features { var fs = 0.0 if (fstd 0.0) fs = (f(i) - fmean)/fstd fs } } ) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-feature-standardization-tp21539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Number of goals to win championship
Logistic regression outputs probabilities if the data fits the model assumption. Otherwise, you might need to calibrate its output to correctly read it. You may be interested in reading this: http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression/. We have isotonic regression implemented in Spark 1.3. Another problem with your input is that the dataset is too small. Try to put more points and see the result. Also, use LogisticRegressionWithLBFGS, which is better than the SGD implementation. -Xiangrui On Thu, Feb 5, 2015 at 10:40 AM, jvuillermet jeremy.vuiller...@gmail.com wrote: I want to find the minimum number of goals for a player that likely allows its team to win the championship. My data : goals win/loose 25 1 5 0 10 1 20 0 After some reading and courses, I think I need a Logistic Regression model to get those datas. I create my LabeledPoint with those data (1/0 being the label) and use val model = LogisticRegressionWithSGD.train model.clearTreshold() I then try some model.predict(Vectors.dense(10)) but don't understand the output. All the results are 0.5 and I'm not even sure how to use the predicted value. Am I using the good model ? How do I read the predicted value ? What do I need more to find a goal number from which it's likely your team will win the championship or say (3/4 chances to win it) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-of-goals-to-win-championship-tp21519.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: no option to add intercepts for StreamingLinearAlgorithm
No particular reason. We didn't add it in the first version. Let's add it in 1.4. -Xiangrui On Thu, Feb 5, 2015 at 3:44 PM, jamborta jambo...@gmail.com wrote: hi all, just wondering if there is a reason why it is not possible to add intercepts for streaming regression models? I understand that run method in the underlying GeneralizedLinearModel does not take intercept as a parameter either. Any reason for that? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/no-option-to-add-intercepts-for-StreamingLinearAlgorithm-tp21526.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [MLlib] Performance issues when building GBM models
Could you check the Spark UI and see whether there are RDDs being kicked out during the computation? We cache the residual RDD after each iteration. If we don't have enough memory/disk, it gets recomputed and results something like `t(n) = t(n-1) + const`. We might cache the features multiple times, which could be improved. -Xiangrui On Sun, Feb 8, 2015 at 5:32 PM, Christopher Thom christopher.t...@quantium.com.au wrote: Hi All, I wonder if anyone else has some experience building a Gradient Boosted Trees model using spark/mllib? I have noticed when building decent-size models that the process slows down over time. We observe that the time to build tree n is approximately a constant time longer than the time to build tree n-1 i.e. t(n) = t(n-1) + const. The implication is that the total build time goes as something like N^2, where N is the total number of trees. I would expect that the algorithm should be approximately linear in total time (i.e. each boosting iteration takes roughly the same time to complete). So I have a couple of questions: 1. Is this behaviour expected, or consistent with what others are seeing? 2. Does anyone know if there a tuning parameters (e.g. in the boosting strategy, or tree stategy) that may be impacting this? All aspects of the build seem to slow down as I go. Here's a random example culled from the logs, from the beginning and end of the model build: 15/02/09 17:22:11 INFO scheduler.DAGScheduler: Job 42 finished: count at DecisionTreeMetadata.scala:111, took 0.077957 s 15/02/09 19:44:01 INFO scheduler.DAGScheduler: Job 7954 finished: count at DecisionTreeMetadata.scala:111, took 5.495166 s Any thoughts or advice, or even suggestions on where to dig for more info would be welcome. thanks chris Christopher Thom QUANTIUM Level 25, 8 Chifley, 8-12 Chifley Square Sydney NSW 2000 T: +61 2 8222 3577 F: +61 2 9292 6444 W: quantium.com.auwww.quantium.com.au linkedin.com/company/quantiumwww.linkedin.com/company/quantium facebook.com/QuantiumAustraliawww.facebook.com/QuantiumAustralia twitter.com/QuantiumAUwww.twitter.com/QuantiumAU The contents of this email, including attachments, may be confidential information. If you are not the intended recipient, any use, disclosure or copying of the information is unauthorised. If you have received this email in error, we would be grateful if you would notify us immediately by email reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message from your system. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org