from:"Xiangrui Meng"

Re: MLLib Linear regression

2014-10-07 Thread Xiangrui Meng

Did you test different regularization parameters and step sizes? In
the combination that works, I don't see A + D. Did you test that
combination? Are there any linear dependency between A's columns and
D's columns? -Xiangrui

On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak ssti...@live.com wrote:
 BTW, one detail:

 When number of iterations is 100 all weights are zero or below and the
 indices are only from set A.

 When  number of iterations is 150 I see 30+ non-zero weights (when sorted by
 weight) and indices are distributed across al sets. however MSE is high
 (5.xxx) and the result does not match the domain knowledge.

 When  number of iterations is 400 I see 30+ non-zero weights (when sorted by
 weight) and indices are distributed across al sets. however MSE is high
 (6.xxx) and the result does not match the domain knowledge.

 Any help will be highly appreciated.


 
 From: ssti...@live.com
 To: user@spark.apache.org
 Subject: MLLib Linear regression
 Date: Tue, 7 Oct 2014 13:41:03 -0700


 Hi All,
 I have following classes of features:

 class A: 15000 features
 class B: 170 features
 class C: 900 features
 Class D:  6000 features.

 I use linear regression (over sparse data). I get excellent results with low
 RMSE (~0.06) for the following combinations of classes:
 1. A + B + C
 2. B + C + D
 3. A + B
 4. A + C
 5. B + D
 6. C + D
 7. D

 Unfortunately, when I use A + B + C + D (all the features) I get results
 that don't make any sense -- all weights are zero or below and the indices
 are only from set A. I also get high MSE. I changed the number of iterations
 from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there any
 other parameters that I can play with? Any insight on what could be wrong?
 Is it somehow it is not able to scale up to 22K features? (I highly doubt
 that).




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib Linear regression

2014-10-08 Thread Xiangrui Meng

The proper step size partially depends on the Lipschitz constant of
the objective. You should let the machine try different combinations
of parameters and select the best. We are working with people from
AMPLab to make hyperparameter tunning easier in MLlib 1.2. For the
theory, Nesterov's book Introductory Lectures on Convex Optimization
is a good one.

We didn't use line search in the current implementation of
LinearRegression, which we should definitely add that option in the
future.

Best,
Xiangrui

On Wed, Oct 8, 2014 at 7:21 AM, Sameer Tilak ssti...@live.com wrote:
 Hi Xiangrui,
 Changing the default step size to 0.01 made a huge difference. The results
 make sense when I use A + B + C + D. MSE is ~0.07 and the outcome matches
 the domain knowledge.

 I was wondering is there any documentation on the parameters and when/how to
 vary them.

 Date: Tue, 7 Oct 2014 15:11:39 -0700
 Subject: Re: MLLib Linear regression
 From: men...@gmail.com
 To: ssti...@live.com
 CC: user@spark.apache.org


 Did you test different regularization parameters and step sizes? In
 the combination that works, I don't see A + D. Did you test that
 combination? Are there any linear dependency between A's columns and
 D's columns? -Xiangrui

 On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak ssti...@live.com wrote:
  BTW, one detail:
 
  When number of iterations is 100 all weights are zero or below and the
  indices are only from set A.
 
  When number of iterations is 150 I see 30+ non-zero weights (when sorted
  by
  weight) and indices are distributed across al sets. however MSE is high
  (5.xxx) and the result does not match the domain knowledge.
 
  When number of iterations is 400 I see 30+ non-zero weights (when sorted
  by
  weight) and indices are distributed across al sets. however MSE is high
  (6.xxx) and the result does not match the domain knowledge.
 
  Any help will be highly appreciated.
 
 
  
  From: ssti...@live.com
  To: user@spark.apache.org
  Subject: MLLib Linear regression
  Date: Tue, 7 Oct 2014 13:41:03 -0700
 
 
  Hi All,
  I have following classes of features:
 
  class A: 15000 features
  class B: 170 features
  class C: 900 features
  Class D: 6000 features.
 
  I use linear regression (over sparse data). I get excellent results with
  low
  RMSE (~0.06) for the following combinations of classes:
  1. A + B + C
  2. B + C + D
  3. A + B
  4. A + C
  5. B + D
  6. C + D
  7. D
 
  Unfortunately, when I use A + B + C + D (all the features) I get results
  that don't make any sense -- all weights are zero or below and the
  indices
  are only from set A. I also get high MSE. I changed the number of
  iterations
  from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there
  any
  other parameters that I can play with? Any insight on what could be
  wrong?
  Is it somehow it is not able to scale up to 22K features? (I highly
  doubt
  that).
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: DIMSUM item similarity tests

2014-10-09 Thread Xiangrui Meng

please re-try with --driver-memory 10g . The default is 256m. -Xiangrui

On Thu, Oct 9, 2014 at 2:33 AM, Clive Cox clive@rummble.com wrote:
 Hi,

  I'm trying out the DIMSUM item similarity from github master commit
 69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is:
 Num items : 8860
 Number of users : 5138702
 Implicit 1.0 values
 Running item similarity with threshold :0.5

 I have a 2 slave spark cluster on EC2 with m3.xlarge (13G each)

 I'm running out of heap space:

 Exception in thread handle-read-write-executor-1
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.network.nio.Message$.create(Message.scala:90)

 while Spark is doing:

 org.apache.spark.rdd.RDD.reduce(RDD.scala:865)
 org.apache.spark.mllib.rdd.RDDFunctions.treeAggregate(RDDFunctions.scala:111)
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeColumnSummaryStatistics(RowMatrix.scala:379)
 org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:483)

 The spark UI said the shuffle read on this task at that point had used:
 162.6 MB

 I run spark submit from the master like below:

 ./spark/bin/spark-submit --executor-memory 13G  --master spark://ec2

 Just wanted to check this is expected as the matrix doesn't seem excessively
 big. Is there some memory setting I am missing?

 Thanks,

  Clive


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLUtil.kfold generates overlapped training and validation set?

2014-10-10 Thread Xiangrui Meng

1. No.

2. The seed per partition is fixed. So it should generate
non-overlapping subsets.

3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1.

Best,
Xiangrui

On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
 Hi, all

 When we use MLUtils.kfold to generate training and validation set for cross
 validation

 we found that there is overlapped part in two sets….

 from the code, it does sampling for twice for the same dataset

  @Experimental
   def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int):
 Array[(RDD[T], RDD[T])] = {
 val numFoldsF = numFolds.toFloat
 (1 to numFolds).map { fold =
   val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold /
 numFoldsF,
 complement = false)
   val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed)
   val training = new PartitionwiseSampledRDD(rdd,
 sampler.cloneComplement(), true, seed)
   (training, validation)
 }.toArray
   }

 the sampler is complement, there is still possibility to generate overlapped
 training and validation set

 because the sampling method looks like :

 override def sample(items: Iterator[T]): Iterator[T] = {
 items.filter { item =
   val x = rng.nextDouble()
   (x = lb  x  ub) ^ complement
 }
   }

 I’m not a machine learning guy, so I guess I must fall into one of the
 following three situations

 1. does it mean actually we allow overlapped training and validation set ?
 (counter intuitive to me)

 2. I had some misunderstanding on the code?

 3. it’s a bug?

 Anyone can explain it to me?

 Best,

 --
 Nan Zhu


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: TF-IDF in Spark 1.1.0

2014-10-14 Thread Xiangrui Meng

You cannot recover the document from the TF-IDF vector, because
HashingTF is not reversible. You can assign each document a unique ID,
and join back the result after training. HasingTF can transform
individual record:

val docs: RDD[(String, Seq[String])] = ...

val tf = new HashingTF()
val tfWithId: RDD[(String, Vector)] = docs.mapValues(tf.transform)

...

Best,
Xiangrui

On Tue, Oct 14, 2014 at 9:15 AM, Burke Webster burke.webs...@gmail.com wrote:
 I'm following the Mllib example for TF-IDF and ran into a problem due to my
 lack of knowledge of Scala and spark.  Any help would be greatly
 appreciated.

 Following the Mllib example I could do something like this:

 import org.apache.spark.rdd.RDD
 import org.apache.spark.SparkContext
 import org.apache.spark.mllib.feature.HashingTF
 import org.apache.spark.mllib.linalg.Vector
 import org.apache.spark.mllib.feature.IDF

 val sc: SparkContext = ...
 val documents: RDD[Seq[String]] = sc.textFile(...).map(_.split( ).toSeq)

 val hashingTF = new HashingTF()
 val tf: RDD[Vector] = hasingTF.transform(documents)
 tf.cache()

 val idf = new IDF().fit(tf)
 val tfidf: RDD[Vector] = idf.transform(tf)

 As a result I would have an RDD containing the TF-IDF vectors for the input
 documents.  My question is how do I map the vector back to the original
 input document?

 My end goal is to compute document similarity using cosine similarity.  From
 what I can tell, I can compute TF-IDF, apply the L2 norm, and then compute
 the dot-product.  Has anybody done this?

 Currently, my example looks more like this:

 import org.apache.spark.SparkContext._
 import org.apache.spark.SparkConf
 import org.apache.spark.mllib.feature.HashingTF
 import org.apache.spark.mllib.feature.IDF
 import org.apache.spark.mllib.linalg.Vector
 import org.apache.spark.rdd.RDD
 import org.apache.spark.SparkContext

 val sc: SparkContext = ...

 // input is sequence file of the form (docid: Text, content: Text)
 val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”)

 val docs: RDD[(String, Seq[String])] = data.mapValues(v = v.split(
 ).toSeq)

 val hashingTF = new HashingTF()
 val tf: RDD[(String, Vector)] = hashingTF.??

 I'm trying to maintain some linking from the document identifier to it's
 eventual vertex representation.  I'm I going about this incorrectly?

 Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng

What is the feature dimension? I saw you used 100 partitions. How many
cores does your cluster have? -Xiangrui

On Tue, Oct 14, 2014 at 1:51 PM, Ray ray-w...@outlook.com wrote:
 Hi guys,

 An interesting thing, for the input dataset which has 1.5 million vectors,
 if set the KMeans's k_value = 100 or k_value = 50, it hangs as mentioned
 above. However, if decrease k_value  = 10, the same error still appears in
 the log but the application finished successfully, without observable
 hanging.

 Hopefully this provides more information.

 Thanks.

 Ray



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLlib - Does LogisticRegressionModel.clearThreshold() no longer work?

2014-10-14 Thread Xiangrui Meng

LBFGS is better. If you data is easily separable, LR might return
values very close or equal to either 0.0 or 1.0. It is rare but it may
happen. -Xiangrui

On Tue, Oct 14, 2014 at 3:18 PM, Aris arisofala...@gmail.com wrote:
 Wow...I just tried LogisticRegressionWithLBFGS, and using clearThreshold()
 DOES IN FACT work. It appears the the LogsticRegressionWithSGD returns a
 model whose method is broken!!

 On Tue, Oct 14, 2014 at 3:14 PM, Aris arisofala...@gmail.com wrote:

 Hi folks,

 When I am predicting Binary 1/0 responses with LogsticRegressionWithSGD,
 it returns a LogisticRegressionModel. In Spark 1.0.X I was using the
 clearThreshold method on the model to get the raw predicted probabilities
 when I ran the predict() method...

 It appears now that rather than getting a realistic probability that is
 between 0.0 and 1.0, I am only getting back predictions of 0.0 OR
 1.0...never anything in between.

 The API says that clearThreshold is experimental ...it was working
 before! Is it broken now?

 Thanks!

 Aris



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng

Just ran a test on mnist8m (8m x 784) with k = 100 and numIter = 50.
It worked fine. Ray, the error log you posted is after cluster
termination, which is not the root cause. Could you search your log
and find the real cause? On the executor tab screenshot, I saw only
200MB is used. Did you cache the input data? If yes, could you check
the storage tab of Spark WebUI and see how the data is distributed
across executors. -Xiangrui

On Tue, Oct 14, 2014 at 4:26 PM, DB Tsai dbt...@dbtsai.com wrote:
I saw similar bottleneck in reduceByKey operation. Maybe we can
implement treeReduceByKey to reduce the pressure on single executor
reducing the particular key.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

On Wed, Oct 15, 2014 at 12:16 AM, Burak Yavuz bya...@stanford.edu wrote:
Hi Ray,

The reduceByKey / collectAsMap does a lot of calculations. Therefore it can
take a very long time if:
1) The parameter number of runs is set very high
2) k is set high (you have observed this already)
3) data is not properly repartitioned
It seems that it is hanging, but there is a lot of calculation going on.

Did you use a different value for the number of runs?
If you look at the storage tab, does the data look balanced among executors?

Best,
Burak

- Original Message -
From: Ray ray-w...@outlook.com
To: u...@spark.incubator.apache.org
Sent: Tuesday, October 14, 2014 2:58:03 PM
Subject: Re: Spark KMeans hangs at reduceByKey / collectAsMap

Hi Xiangrui,

The input dataset has 1.5 million sparse vectors. Each sparse vector has a
dimension(cardinality) of 9153 and has less than 15 nonzero elements.

Yes, if I set num-executors = 200, from the hadoop cluster scheduler, I can
see the application got 201 vCores. From the spark UI, I can see it got 201
executors (as shown below).

http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_core.png

http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_executor.png

Thanks.

Ray

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng

I used k-means||, which is the default. And it took less than 1
minute to finish. 50 iterations took less than 25 minutes on a cluster
of 9 m3.2xlarge EC2 nodes. Which deploy mode did you use? Is it
yarn-client? -Xiangrui

On Tue, Oct 14, 2014 at 6:03 PM, Ray ray-w...@outlook.com wrote:
 Hi Xiangrui,

 Thanks for the guidance. I read the log carefully and found the root cause.

 KMeans, by default, uses KMeans++ as the initialization mode. According to
 the log file, the 70-minute hanging is actually the computing time of
 Kmeans++, as pasted below:

 14/10/14 14:48:18 INFO DAGScheduler: Stage 20 (collectAsMap at
 KMeans.scala:293) finished in 2.233 s
 14/10/14 14:48:18 INFO SparkContext: Job finished: collectAsMap at
 KMeans.scala:293, took 85.590020124 s
 14/10/14 14:48:18 INFO ShuffleBlockManager: Could not find files for shuffle
 5 for deleting
 14/10/14 *14:48:18* INFO ContextCleaner: Cleaned shuffle 5
 14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS
 *14/10/14 15:54:36 INFO LocalKMeans: Local KMeans++ converged in 11
 iterations.
 14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913
 seconds.*
 14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at
 KMeans.scala:190
 14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at
 KMeans.scala:190)
 14/10/14 15:54:37 INFO DAGScheduler: Got job 16 (collectAsMap at
 KMeans.scala:190) with 100 output partitions (allowLocal=false)
 14/10/14 15:54:37 INFO DAGScheduler: Final stage: Stage 22(collectAsMap at
 KMeans.scala:190)



 I now use random as the Kmeans initialization mode, and other confs remain
 the same. This time, it just finished quickly~~

 In your test on mnis8m, did you use KMeans++ as initialization mode? How
 long it takes?

 Thanks again for your help.

 Ray







 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16450.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark1.0 principal component analysis

2014-10-16 Thread Xiangrui Meng

computePrincipalComponents returns a local matrix X, whose columns are
the principal components (ordered), while those column vectors are in
the same feature space as the input feature vectors. -Xiangrui

On Thu, Oct 16, 2014 at 2:39 AM, al123 ant.lay...@hotmail.co.uk wrote:
 Hi,

 I don't think anybody answered this question...


 fintis wrote
 How do I match the principal components to the actual features since there
 is some sorting?

 Would anybody be able to shed a little light on it since I too am struggling
 with this?

 Many thanks!!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249p16556.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib libsvm format

2014-10-21 Thread Xiangrui Meng

Yes. where the indices are one-based and **in ascending order**. -Xiangrui

On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak ssti...@live.com wrote:
 Hi All,

 I have a question regarding the ordering of indices. The document says that
 the indices indices are one-based and in ascending order. However, do the
 indices within a row need to be sorted in ascending order?




 Sparse data

 It is very common in practice to have sparse training data. MLlib supports
 reading training examples stored in LIBSVM format, which is the default
 format used by LIBSVM and LIBLINEAR. It is a text format in which each line
 represents a labeled sparse feature vector using the following format:

 label index1:value1 index2:value2 ...

 where the indices are one-based and in ascending order. After loading, the
 feature indices are converted to zero-based.



 For example, I have have indices ranging rom 1 to 1000 is this as a libsvm
 data file OK?


 1110:1.0   80:0.5   310:0.0

 0 890:0.5  20:0.0   200:0.5   400:1.0  82:0.0

 and so on:


 OR do I need to sort them as:


 1  80:0.5   110:1.0   310:0.0

 0  20:0.082:0.0200:0.5   400:1.0  890:0.5

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: create a Row Matrix

2014-10-21 Thread Xiangrui Meng

Please check out the example code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/TallSkinnySVD.scala
-Xiangrui

On Tue, Oct 21, 2014 at 5:34 AM, viola viola.wiersc...@siemens.com wrote:
 Hi,

 I am VERY new to spark and mllib and ran into a couple of problems while
 trying to reproduce some examples. I am aware that this is a very simple
 question but could somebody please give me an example
 - how to create a RowMatrix in scala with the following entries:
 [1 2
 3 4]?
 I would like to apply an SVD on it.

 Thank you very much!





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Read a TextFile(1 record contains 4 lines) into a RDD

2014-10-25 Thread Xiangrui Meng

If your file is not very large, try

sc.wholeTextFiles(...).values.flatMap(_.split(\n).grouped(4).map(_.mkString(\n)))

-Xiangrui


On Sat, Oct 25, 2014 at 12:57 AM, Parthus peng.wei@gmail.com wrote:
 Hi,

 It might be a naive question, but I still wish that somebody could help me
 handle it.

 I have a textFile, in which every 4 lines represent a record. Since
 SparkContext.textFile() API deems of one line as a record, it does not fit
 into my case. I know that SparkContext.hadoopFile or newAPIHadoopFile API
 can read a file in an arbitrary format, but I do not know how to use them. I
 think that there must be some API which can easily solve this problem, but I
 am kind of a bad googler and cannot find it by myself online.

 Would it be possible for somebody to tell me how to use the API? I run Spark
 based on Hadoop 1.2.1 rather than Hadoop 2.x. I wish that I could get
 several lines of code which actually works, if possible.

 Thanks very much.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Read-a-TextFile-1-record-contains-4-lines-into-a-RDD-tp17256.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: deploying a model built in mllib

2014-10-27 Thread Xiangrui Meng

We are working on the pipeline features, which would make this
procedure much easier in MLlib. This is still a WIP and the main JIRA
is at:

https://issues.apache.org/jira/browse/SPARK-1856

Best,
Xiangrui

On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani
chirag.lakh...@gmail.com wrote:
 Hello,

 I have been prototyping a text classification model that my company would
 like to eventually put into production.  Our technology stack is currently
 Java based but we would like to be able to build our models in Spark/MLlib
 and then export something like a PMML file which can be used for model
 scoring in real-time.

 I have been using scikit learn where I am able to take the training data
 convert the text data into a sparse data format and then take the other
 features and use the dictionary vectorizer to do one-hot encoding for the
 other categorical variables.  All of those things seem to be possible in
 mllib but I am still puzzled about how that can be packaged in such a way
 that the incoming data can be first made into feature vectors and then
 evaluated as well.

 Are there any best practices for this type of thing in Spark?  I hope this
 is clear but if there are any confusions then please let me know.

 Thanks,

 Chirag

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Xiangrui Meng

Could you save the data before ALS and try to reproduce the problem?
You might try reducing the number of partitions and not using Kryo
serialization, just to narrow down the issue. -Xiangrui

On Mon, Oct 27, 2014 at 1:29 PM, Ilya Ganelin ilgan...@gmail.com wrote:
 Hi Burak.

 I always see this error. I'm running the CDH 5.2 version of Spark 1.1.0. I
 load my data from HDFS. By the time it hits the recommender it had gone
 through many spark operations.

 On Oct 27, 2014 4:03 PM, Burak Yavuz bya...@stanford.edu wrote:

 Hi,

 I've come across this multiple times, but not in a consistent manner. I
 found it hard to reproduce. I have a jira for it: SPARK-3080

 Do you observe this error every single time? Where do you load your data
 from? Which version of Spark are you running?
 Figuring out the similarities may help in pinpointing the bug.

 Thanks,
 Burak

 - Original Message -
 From: Ilya Ganelin ilgan...@gmail.com
 To: user user@spark.apache.org
 Sent: Monday, October 27, 2014 11:36:46 AM
 Subject: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

 Hello all - I am attempting to run MLLib's ALS algorithm on a substantial
 test vector - approx. 200 million records.

 I have resolved a few issues I've had with regards to garbage collection,
 KryoSeralization, and memory usage.

 I have not been able to get around this issue I see below however:


  java.lang.
  ArrayIndexOutOfBoundsException: 6106
 
 
  org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.
  scala:543)
  scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
  org.apache.spark.mllib.recommendation.ALS.org
  $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
 
 
  org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
 
 
  org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
 
 
  org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 
 
  org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
  scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 
  org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144)
 
 
  org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
 
 
  org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
 
 
  scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 
 
  scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 
  scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 
 
  scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 
  org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 I do not have any negative indices or indices that exceed Int-Max.

 I have partitioned the input data into 300 partitions and my Spark config
 is below:

 .set(spark.executor.memory, 14g)
   .set(spark.storage.memoryFraction, 0.8)
   .set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer)
   .set(spark.kryo.registrator, MyRegistrator)
   .set(spark.core.connection.ack.wait.timeout,600)
   .set(spark.akka.frameSize,50)
   .set(spark.yarn.executor.memoryOverhead,1024)

 Does anyone have any suggestions as to why i'm seeing the above error or
 how to get around it?
 It may be possible to upgrade to the latest version of Spark but the
 mechanism for doing so in our environment isn't obvious yet.

 -Ilya Ganelin



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Xiangrui Meng

FYI, there is a PR to make mllib.rdd.RDDFunctions public:
https://github.com/apache/spark/pull/2907 -Xiangrui

On Tue, Oct 28, 2014 at 5:18 AM, Yanbo Liang yanboha...@gmail.com wrote:
 Yes, it can import org.apache.spark.mllib.rdd.RDDFunctions but you can not
 use any method in this class or even new an object of this class. So I infer
 that if you import org.apache.spark.mllib.rdd.RDDFunctions._, it may call
 some method of that object.

 2014-10-28 17:29 GMT+08:00 Stephen Boesch java...@gmail.com:

 HI Yanbo,
That is not the issue: notice that importing the object is fine:

 scala import org.apache.spark.mllib.rdd.RDDFunctions
 import org.apache.spark.mllib.rdd.RDDFunctions

 scala import org.apache.spark.mllib.rdd.RDDFunctions._
 console:11: error: object RDDFunctions in package rdd cannot be accessed
 in package org.apache.spark.mllib.rdd
import org.apache.spark.mllib.rdd.RDDFunctions._


 It has to do with the implicits.



 2014-10-28 2:25 GMT-07:00 Yanbo Liang yanboha...@gmail.com:

 Because that org.apache.spark.mllib.rdd.RDDFunctions._ is mllib private
 class, it can only be called by function in mllib.

 2014-10-28 17:09 GMT+08:00 Stephen Boesch java...@gmail.com:

 I seem to recall there were some specific requirements on how to import
 the implicits.

 Here is the issue:

 scala import org.apache.spark.mllib.rdd.RDDFunctions._
 console:10: error: object RDDFunctions in package rdd cannot be
 accessed in package org.apache.spark.mllib.rdd
import org.apache.spark.mllib.rdd.RDDFunctions._





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-28 Thread Xiangrui Meng

Hi Ilya,

Let's move the discussion to the JIRA page. I saw couple users
reporting this issue but I have never seen it myself.

Best,
Xiangrui

On Tue, Oct 28, 2014 at 8:50 AM, Ilya Ganelin ilgan...@gmail.com wrote:
 Hi all - I've simplified the code so now I'm literally feeding in 200
 million ratings directly to ALS.train. Nothing else is happening in the
 program.
 I've also tried with both the regular serializer and the KryoSerializer.
 With Kryo, I get the same ArrayIndex exceptions.

 With the regular serializer I get the following error stack:

 14/10/28 10:43:14 WARN TaskSetManager: Lost task 119.0 in stage 10.0 (TID
 2282, innovationdatanode07.cof.ds.capitalone.com):
 java.io.FileNotFoundException:
 /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1414016059040_0715/spark-local-20141028102246-dde7/06/shuffle_7_119_8
 (No such file or directory)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)

 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)

 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)

 org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)

 org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 14/10/28 10:43:14 INFO TaskSetManager: Starting task 119.1 in stage 10.0
 (TID 2303, innovationdatanode07.cof.ds.capitalone.com, PROCESS_LOCAL, 5642
 bytes)
 14/10/28 10:43:14 WARN TaskSetManager: Lost task 119.1 in stage 10.0 (TID
 2303, innovationdatanode07.cof.ds.capitalone.com):
 java.io.FileNotFoundException:
 /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1414016059040_0715/spark-local-20141028102246-dde7/23/shuffle_8_90_119
 (No such file or directory)
 java.io.RandomAccessFile.open(Native Method)
 java.io.RandomAccessFile.init(RandomAccessFile.java:241)
 org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:93)
 org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:116)

 org.apache.spark.shuffle.FileShuffleBlockManager.getBytes(FileShuffleBlockManager.scala:190)

 org.apache.spark.storage.BlockManager.getLocalShuffleFromDisk(BlockManager.scala:361)

 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1$$anonfun$apply$10.apply(BlockFetcherIterator.scala:208)

 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1$$anonfun$apply$10.apply(BlockFetcherIterator.scala:208)

 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.next(BlockFetcherIterator.scala:258)

 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.next(BlockFetcherIterator.scala:77)
 .

 This is an issue I referenced in the past here:
 https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=1cad=rjauact=8ved=0CB4QFjAAurl=https%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fincubator-spark-user%2F201410.mbox%2F%253CCAM-S9zS-%2B-MSXVcohWEhjiAEKaCccOKr_N5e0HPXcNgnxZd%3DHw%40mail.gmail.com%253Eei=97FPVIfyCsbgsASL94CoDQusg=AFQjCNEQ6gUlwpr6KzlcZVd0sQeCSdjQgQsig2=Ne7pL_Z94wN4g9BwSutsXQ

 -Ilya Ganelin

 On Mon, Oct 27, 2014 at 6:12 PM, Xiangrui Meng men...@gmail.com wrote:

 Could you save the data before ALS and try to reproduce the problem?
 You might try reducing the number of partitions and not using Kryo
 serialization, just to narrow down the issue. -Xiangrui

 On Mon, Oct 27, 2014 at 1:29 PM, Ilya Ganelin ilgan...@gmail.com wrote:
  Hi Burak.
 
  I always see this error. I'm running the CDH 5.2 version of Spark 1.1.0.
  I
  load my data from HDFS. By the time it hits the recommender it had gone
  through many spark operations.
 
  On Oct 27, 2014 4:03 PM, Burak Yavuz bya...@stanford.edu wrote:
 
  Hi,
 
  I've come across this multiple times, but not in a consistent manner. I
  found it hard to reproduce. I have a jira for it: SPARK-3080
 
  Do you observe this error every single time? Where do you load your
  data
  from? Which version of Spark are you running?
  Figuring out the similarities may help in pinpointing the bug.
 
  Thanks,
  Burak
 
  - Original Message

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng

DId you cache the data and check the load balancing? How many
features? Which API are you using, Scala, Java, or Python? -Xiangrui

On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote:
 Watch the app manager it should tell you what's running and taking awhile...
 My guess it's a distinct function on the data.
 J

 Sent from my iPhone

 On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:

 Hi,



 Previous we have applied SVM algorithm in MLlib to 5 million records (600
 mb), it takes more than 25 minutes to finish.
 The spark version we are using is 1.0 and we were running this program on a
 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.

 The 5 million records only have two distinct records (One positive and one
 negative), others are all duplications.

 Any one has any idea on why it takes so long on this small data?



 Thanks,
 Best,

 Peng

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib: libsvm - default value initialization

2014-10-30 Thread Xiangrui Meng

You can remove 0.5 from all non-zeros. -Xiangrui

On Wed, Oct 29, 2014 at 9:20 PM, Sameer Tilak ssti...@live.com wrote:
 Hi All,
 I have my sparse data in libsvm format.

 val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
 mllib/data/sample_libsvm_data.txt)

 I am running Linear regression. Let us say that my data has following entry:
 1 1:0  4:1

 I think it will assume 0 for indices 2 and 3, right? I would like to make
 default values to be 0.5  instead of 0. Is it possible? If not, I will have
 to switch to dense data and it will significantly increase the data size for
 me.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng

Then caching should solve the problem. Otherwise, it is just loading
and parsing data from disk for each iteration. -Xiangrui

On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote:
 Thanks for all your help.
 I think I didn't cache the data. My previous cluster was expired and I don't
 have a chance to check the load balance or app manager.
 Below is my code.
 There are 18 features for each record and I am using the Scala API.

 import org.apache.spark.SparkConf
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import org.apache.spark.rdd._
 import org.apache.spark.mllib.classification.SVMWithSGD
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.linalg.Vectors
 import java.util.Calendar

 object BenchmarkClassification {
 def main(args: Array[String]) {
 // Load and parse the data file
 val conf = new SparkConf()
  .setAppName(SVM)
  .set(spark.executor.memory, 8g)
  // .set(spark.executor.extraJavaOptions, -Xms8g -Xmx8g)
val sc = new SparkContext(conf)
 val data = sc.textFile(args(0))
 val parsedData = data.map { line =
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
 x.toDouble)))
 }
 val testData = sc.textFile(args(1))
 val testParsedData = testData .map { line =
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =
 x.toDouble)))
 }

 // Run training algorithm to build the model
 val numIterations = 20
 val model = SVMWithSGD.train(parsedData, numIterations)

 // Evaluate model on training examples and compute training error
 // val labelAndPreds = testParsedData.map { point =
 //   val prediction = model.predict(point.features)
 //   (point.label, prediction)
 // }
 // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
 testParsedData.count
 // println(Training Error =  + trainErr)
 println(Calendar.getInstance().getTime())
 }
 }




 Thanks,
 Best,
 Peng

 On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng men...@gmail.com wrote:

 DId you cache the data and check the load balancing? How many
 features? Which API are you using, Scala, Java, or Python? -Xiangrui

 On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote:
  Watch the app manager it should tell you what's running and taking
  awhile...
  My guess it's a distinct function on the data.
  J
 
  Sent from my iPhone
 
  On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:
 
  Hi,
 
 
 
  Previous we have applied SVM algorithm in MLlib to 5 million records
  (600
  mb), it takes more than 25 minutes to finish.
  The spark version we are using is 1.0 and we were running this program
  on a
  4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
 
  The 5 million records only have two distinct records (One positive and
  one
  negative), others are all duplications.
 
  Any one has any idea on why it takes so long on this small data?
 
 
 
  Thanks,
  Best,
 
  Peng



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Prediction using Classification with text attributes in Apache Spark MLLib

2014-11-02 Thread Xiangrui Meng

This operation requires two transformers:

1) Indexer, which maps string features into categorical features
2) OneHotEncoder, which flatten categorical features into binary features

We are working on the new dataset implementation, so we can easily
express those transformations. Sorry for late! If you want a quick and
dirty solution, you can try hashing:

val rdd: RDD[(Double, Array[String])] = ...
val training = rdd.mapValues { factors =
val indices = mutable.Set.empty[Int]
factors.view.zipWithIndex.foreach { (f, idx) =
  indices += math.abs(f.## ^ idx) % 10
}
Vectors.sparse(10, indices.toSeq.map(x = (x, 1.0)))
}

It creates a training dataset with all binary features, with a chance
of collision. You can use it in SVM, LR, or DecisionTree.

Best,
Xiangrui

On Sun, Nov 2, 2014 at 9:20 AM, ashu ashutosh.triv...@iiitb.org wrote:
 Hi,
 Sorry to bounce back the old thread.
 What is the state now? Is this problem solved. How spark handle categorical
 data now?

 Regards,
 Ashutosh



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: is spark a good fit for sequential machine learning algorithms?

2014-11-03 Thread Xiangrui Meng

Many ML algorithms are sequential because they were not designed to be
parallel. However, ML is not driven by algorithms in practice, but by
data and applications. As datasets getting bigger and bigger, some
algorithms got revised to work in parallel, like SGD and matrix
factorization. MLlib tries to implement those scalable algorithms that
can handle large-scale datasets.

That being said, even with sequential ML algorithms, Spark is helpful.
Because in practice we need to test multiple sets of parameters and
select the best one. Though the algorithm is sequential, the training
part is embarrassingly parallel. We can broadcast the whole dataset,
and then train model 1 on node 1, model 2 on node 2, etc. Cross
validation also falls into this category.

-Xiangrui

On Mon, Nov 3, 2014 at 1:55 PM, ll duy.huynh@gmail.com wrote:
i'm struggling with implementing a few algorithms with spark. hope to get
help from the community.

most of the machine learning algorithms today are sequential, while spark
is all about parallelism. it seems to me that using spark doesn't
actually help much, because in most cases you can't really paralellize a
sequential algorithm.

there must be some strong reasons why mllib was created and so many people
claim spark is ideal for machine learning.

what are those reasons?

what are some specific examples when how to use spark to implement
sequential machine learning algorithms?

any commen/feedback/answer is much appreciated.

thanks!

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-a-good-fit-for-sequential-machine-learning-algorithms-tp18000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pass unique ID to mllib algorithms pyspark

2014-11-04 Thread Xiangrui Meng

The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
this issue. We carry over extra columns with training and prediction
and then leverage on Spark SQL's execution plan optimization to decide
which columns are really needed. For the current set of APIs, we can
add `predictOnValues` to models, which carries over the input keys.
StreamingKMeans and StreamingLinearRegression implement this method.
-Xiangrui

On Tue, Nov 4, 2014 at 2:30 AM, jamborta jambo...@gmail.com wrote:
Hi all,

There are a few algorithms in pyspark where the prediction part is
implemented in scala (e.g. ALS, decision trees) where it is not very easy to
manipulate the prediction methods.

I think it is a very common scenario that the user would like to generate
prediction for a datasets, so that each predicted value is identifiable
(e.g. have a unique id attached to it). this is not possible in the current
implementation as predict functions take a feature vector and return the
predicted values where, I believe, the order is not guaranteed, so there is
no way to join it back with the original data the predictions are generated
from.

Is there a way around this at the moment?

thanks,

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Xiangrui Meng

local matrix-matrix multiplication or distributed?

On Tue, Nov 4, 2014 at 11:58 PM, ll duy.huynh@gmail.com wrote:
 what is the best way to implement a sparse x sparse matrix multiplication
 with spark?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/sparse-x-sparse-matrix-multiplication-tp18163.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Matrix multiplication in spark

2014-11-05 Thread Xiangrui Meng

We are working on distributed block matrices. The main JIRA is at:

https://issues.apache.org/jira/browse/SPARK-3434

The goal is to support basic distributed linear algebra, (dense first
and then sparse).

-Xiangrui

On Wed, Nov 5, 2014 at 12:23 AM, ll duy.huynh@gmail.com wrote:
 @sowen.. i am looking for distributed operations, especially very large
 sparse matrix x sparse matrix multiplication.  what is the best way to
 implement this in spark?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-tp12562p18164.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Xiangrui Meng

You can use breeze for local sparse-sparse matrix multiplication and
then define an RDD of sub-matrices

RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)

and then use join and aggregateByKey to implement this feature, which
is the same as in MapReduce.

-Xiangrui

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread Xiangrui Meng

Which Spark version did you use? Could you check the WebUI and attach
the error message on executors? -Xiangrui

On Wed, Nov 5, 2014 at 8:23 AM, rok rokros...@gmail.com wrote:
 yes, the training set is fine, I've verified it.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18195.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng

There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066

The easiest case is when one side is small. If both sides are large,
this is a super-expensive operation. We can do block-wise cross
product and then find top-k for each user.

Best,
Xiangrui

On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com wrote:
 model.recommendProducts can only be called from the master then ? I have a
 set of 20% users on whom I am performing the test...the 20% users are in a
 RDD...if I have to collect them all to master node and then call
 model.recommendProducts, that's a issue...

 Any idea how to optimize this so that we can calculate MAP statistics on
 large samples of data ?


 On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote:

 ALS model contains RDDs. So you cannot put `model.recommendProducts`
 inside a RDD closure `userProductsRDD.map`. -Xiangrui

 On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  I reproduced the problem in mllib tests ALSSuite.scala using the
  following
  functions:
 
  val arrayPredict = userProductsRDD.map{case(user,product) =
 
   val recommendedProducts = model.recommendProducts(user,
  products)
 
   val productScore = recommendedProducts.find{x=x.product ==
  product}
 
require(productScore != None)
 
productScore.get
 
  }.collect
 
  arrayPredict.foreach { elem =
 
if (allRatings.get(elem.user, elem.product) != elem.rating)
 
fail(Prediction APIs don't match)
 
  }
 
  If the usage of model.recommendProducts is correct, the test fails with
  the
  same error I sent before...
 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task
  0 in
  stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
  316.0 (TID 79, localhost): scala.MatchError: null
 
  org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
 
  org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
 
  It is a blocker for me and I am debugging it. I will open up a JIRA if
  this
  is indeed a bug...
 
  Do I have to cache the models to make userFeatures.lookup(user).head to
  work
  ?
 
 
  On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Was user presented in training? We can put a check there and return
  NaN if the user is not included in the model. -Xiangrui
 
  On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
  wrote:
   Hi,
  
   I am testing MatrixFactorizationModel.predict(user: Int, product:
   Int)
   but
   the code fails on userFeatures.lookup(user).head
  
   In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has
   been
   called and in all the test-cases that API has been used...
  
   I can perhaps refactor my code to do the same but I was wondering
   whether
   people test the lookup(user) version of the code..
  
   Do I need to cache the model to make it work ? I think right now
   default
   is
   STORAGE_AND_DISK...
  
   Thanks.
   Deb
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Xiangrui Meng

Vincenzo sent a PR and included k-means as an example. Sean is helping
review it. PMML standard is quite large. So we may start with simple
model export, like linear methods, then move forward to tree-based.
-Xiangrui

On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote:
 Hello Spark and MLLib folks,

 So a common problem in the real world of using machine learning is that some
 data analysis use tools like R, but the more data engineers out there will
 use more advanced systems like Spark MLLib or even Python Scikit Learn.

 In the real world, I want to have a system where multiple different
 modeling environments can learn from data / build models, represent the
 models in a common language, and then have a layer which just takes the
 model and run model.predict() all day long -- scores the models in other
 words.

 It looks like the project openscoring.io and jpmml-evaluator are some
 amazing systems for this, but they fundamentally use PMML as the model
 representation here.

 I have read some JIRA tickets that Xiangrui Meng is interested in getting
 PMML implemented to export MLLib models, is that happening? Further, would
 something like Manish Amde's boosted ensemble tree methods be representable
 in PMML?

 Thank you!!
 Aris

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib Decision Tress algorithm hangs, others fine

2014-11-11 Thread Xiangrui Meng

Could you provide more information? For example, spark version,
dataset size (number of instances/number of features), cluster size,
error messages from both the drive and the executor. -Xiangrui

On Mon, Nov 10, 2014 at 11:28 AM, tsj tsj...@gmail.com wrote:
 Hello all,

 I have some text data that I am running different algorithms on.
 I had no problems with LibSVM and Naive Bayes on the same data,
 but when I run Decision Tree, the execution hangs in the middle
 of DecisionTree.trainClassifier(). The only difference from the example
 given on the site is that I am using 6 categories instead of 2, and the
 input is text that is transformed to labeled points using TF-IDF. It
 halts shortly after this log output:

 spark.SparkContext: Job finished: collect at DecisionTree.scala:1347, took
 1.019579676 s

 Any ideas as to what could be causing this?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tress-algorithm-hangs-others-fine-tp18515.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: scala.MatchError

2014-11-11 Thread Xiangrui Meng

I think you need a Java bean class instead of a normal class. See
example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
(switch to the java tab). -Xiangrui

On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala
npok...@spcapitaliq.com wrote:
 Hi,



 This is my Instrument java constructor.



 public Instrument(Issue issue, Issuer issuer, Issuing issuing) {

 super();

 this.issue = issue;

 this.issuer = issuer;

 this.issuing = issuing;

 }





 I am trying to create javaschemaRDD



 JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData,
 Instrument.class);



 Remarks:

 



 Instrument, Issue, Issuer, Issuing all are java classes



 distData is holding List Instrument 





 I am getting the following error.







 Exception in thread Driver java.lang.reflect.InvocationTargetException

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:483)

 at
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

 Caused by: scala.MatchError: class sample.spark.test.Issue (of class
 java.lang.Class)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)

 at sample.spark.test.SparkJob.main(SparkJob.java:33)

 ... 5 more



 Please help me.



 Regards,

 Naveen.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLIB usage: BLAS dependency warning

2014-11-11 Thread Xiangrui Meng

Could you try jar tf on the assembly jar and grep
netlib-native_system-linux-x86_64.so? -Xiangrui

On Tue, Nov 11, 2014 at 7:11 PM, jpl jlefe...@soe.ucsc.edu wrote:
 Hi,
 I am having trouble using the BLAS libs with the MLLib functions.  I am
 using org.apache.spark.mllib.clustering.KMeans (on a single machine) and
 running the Spark-shell with the kmeans example code (from
 https://spark.apache.org/docs/latest/mllib-clustering.html)  which runs
 successfully but I get the following warning in the log:

 WARN netlib.BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 WARN netlib.BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS

 I compiled spark 1.1.0 with mvn -Phadoop-2.4  -Dhadoop.version=2.4.0
 -Pnetlib-lgpl -DskipTests clean package

 If anyone could please clarify the steps to get the dependencies correctly
 installed and visible to spark (from
 https://spark.apache.org/docs/latest/mllib-guide.html), that would be
 greatly appreciated.  Using yum, I installed blas.x86_64, lapack.x86_64,
 gcc-gfortran.x86_64, libgfortran.x86_64 and then downloaded Breeze and built
 that successfully with Maven.  I verified that I do have
 /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3 present on the machine and
 ldconf -p shows these listed.

 I also tried adding /usr/lib/ to spark.executor.extraLibraryPath and I
 verified it is present in the Spark webUI environment tab.   I downloaded
 and compiled jblas with mvn clean install, which creates
 jblas-1.2.4-SNAPSHOT.jar, and then also tried adding that to
 spark.executor.extraClassPath but I still get the same WARN message. Maybe
 there are a few simple steps that I am missing?  Thanks a lot.




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-usage-BLAS-dependency-warning-tp18660.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread Xiangrui Meng

That means the -Pnetlib-lgpl option didn't work. Could you use sbt
to build the assembly jar and see whether the .so file is inside the
assembly jar? Which system and Java version are you using? -Xiangrui

On Wed, Nov 12, 2014 at 2:22 PM, jpl jlefe...@soe.ucsc.edu wrote:
 Hi Xiangrui, thank you very much for your response.  I looked for the .so as
 you suggested.

 It is not here:
 $ jar tf
 assembly/target/spark-assembly_2.10-1.1.0-dist/spark-assembly-1.1.0-hadoop2.4.0.jar
 | grep netlib-native_system-linux-x86_64.so

 or here:
 $ jar tf
 assembly/target/spark-assembly_2.10-1.1.0-dist/spark-mllib_2.10-1.1.0.jar |
 grep netlib-native_system-linux-x86_64.so

 However, I do find it here:
 $ jar tf
 /root/.m2/repository/com/github/fommil/netlib/netlib-native_system-linux-x86_64/1.1/netlib-native_system-linux-x86_64-1.1-natives.jar
 | grep netlib-native_system-linux-x86_64.so

 Am I not building it correctly?  Should I just add the above jar to the
 Spark classpath (if so, where exactly do I add that, I tried adding to
 .extraClassPath but did not help)?

 Thanks a lot,
 jeff




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-usage-BLAS-dependency-warning-tp18660p18775.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: No module named pyspark - latest built

2014-11-12 Thread Xiangrui Meng

You need to use maven to include python files. See
https://github.com/apache/spark/pull/1223 . -Xiangrui

On Wed, Nov 12, 2014 at 4:48 PM, jamborta jambo...@gmail.com wrote:
 I have figured out that building the fat jar with sbt does not seem to
 included the pyspark scripts using the following command:

 sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean
 publish-local assembly

 however the maven command works OK:

 mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests
 clean package

 am I running the correct sbt command?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18787.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-14 Thread Xiangrui Meng

If you use Kryo serialier, you need to register mutable.BitSet and Rating:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102

The JIRA was marked resolved because chill resolved the problem in
v0.4.0 and we have this workaround.

-Xiangrui

On Fri, Nov 14, 2014 at 12:41 AM, aaronlin aaron...@kkbox.com wrote:
 Hi folks,

 Although spark-1977 said that this problem is resolved in 1.0.2, but I will
 have this problem while running the script in AWS EC2 via spark-c2.py.

 I checked spark-1977 and found that twitter.chill resolve the problem in
 v.0.4.0 not v.0.3.6, but spark depends on twitter.chill v0.3.6 based on
 maven page. For more information, you can check the following pages
 - https://github.com/twitter/chill
 - http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.0.2

 Can anyone give me advises?
 Thanks

 --
 Aaron Lin


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Client application that calls Spark and receives an MLlib model Scala Object and then predicts without Spark installed on hadoop

2014-11-15 Thread Xiangrui Meng

If Spark is not installed on the client side, you won't be able to
deserialize the model. Instead of serializing the model object, you
may serialize the model weights array and implement predict on the
client side. -Xiangrui

On Fri, Nov 14, 2014 at 2:54 PM, xiaoyan yu xiaoyan...@gmail.com wrote:
 I had the same need as those documented back to July archived at
 http://qnalist.com/questions/5013193/client-application-that-calls-spark-and-receives-an-mllib-model-scala-object-not-just-result.

 I wonder if anyone would like to share any successful stories.


 Thanks,
 Xiaoyan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng

This is a bug. Could you make a JIRA? -Xiangrui

On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote:
 Hi,

 I'm having trouble using both zipWithIndex and repartition. When I use them
 both, the following action will get stuck and won't return.
 I'm using spark 1.1.0.


 Those 2 lines work as expected:

 scala sc.parallelize(1 to 10).repartition(10).count()
 res0: Long = 10

 scala sc.parallelize(1 to 10).zipWithIndex.count()
 res1: Long = 10


 But this statement get stuck and doesn't return:

 scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
 Option.scala:120
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
 Option.scala:120) with 3 output partitions (allowLocal=false)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
 Option.scala:120)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
 List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
 (ParallelCollectionRDD[7] at parallelize at console:13), which has no
 missing parents
 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
 with curMem=7616, maxMem=138938941
 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
 values in memory (estimated size 1096.0 B, free 132.5 MB)


 Am I doing something wrong here or is it a bug?
 Is there some work around?

 Thanks,
 Lev.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng

I think I understand where the bug is now. I created a JIRA
(https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR
soon. -Xiangrui

On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote:
 This is a bug. Could you make a JIRA? -Xiangrui

 On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote:
 Hi,

 I'm having trouble using both zipWithIndex and repartition. When I use them
 both, the following action will get stuck and won't return.
 I'm using spark 1.1.0.


 Those 2 lines work as expected:

 scala sc.parallelize(1 to 10).repartition(10).count()
 res0: Long = 10

 scala sc.parallelize(1 to 10).zipWithIndex.count()
 res1: Long = 10


 But this statement get stuck and doesn't return:

 scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
 Option.scala:120
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
 Option.scala:120) with 3 output partitions (allowLocal=false)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
 Option.scala:120)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
 List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
 (ParallelCollectionRDD[7] at parallelize at console:13), which has no
 missing parents
 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
 with curMem=7616, maxMem=138938941
 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
 values in memory (estimated size 1096.0 B, free 132.5 MB)


 Am I doing something wrong here or is it a bug?
 Is there some work around?

 Thanks,
 Lev.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng

PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround:

val a = sc.parallelize(1 to 10).zipWithIndex()
a.partitions // call .partitions explicitly
a.repartition(10).count()

Thanks for reporting the bug! -Xiangrui



On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng men...@gmail.com wrote:
 I think I understand where the bug is now. I created a JIRA
 (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR
 soon. -Xiangrui

 On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote:
 This is a bug. Could you make a JIRA? -Xiangrui

 On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote:
 Hi,

 I'm having trouble using both zipWithIndex and repartition. When I use them
 both, the following action will get stuck and won't return.
 I'm using spark 1.1.0.


 Those 2 lines work as expected:

 scala sc.parallelize(1 to 10).repartition(10).count()
 res0: Long = 10

 scala sc.parallelize(1 to 10).zipWithIndex.count()
 res1: Long = 10


 But this statement get stuck and doesn't return:

 scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
 Option.scala:120
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
 Option.scala:120) with 3 output partitions (allowLocal=false)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
 Option.scala:120)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
 List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
 (ParallelCollectionRDD[7] at parallelize at console:13), which has no
 missing parents
 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
 with curMem=7616, maxMem=138938941
 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
 values in memory (estimated size 1096.0 B, free 132.5 MB)


 Am I doing something wrong here or is it a bug?
 Is there some work around?

 Thanks,
 Lev.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLIB KMeans Exception

2014-11-20 Thread Xiangrui Meng

How many features and how many partitions? You set kmeans_clusters to
1. If the feature dimension is large, it would be really
expensive. You can check the WebUI and see task failures there. The
stack trace you posted is from the driver. Btw, the total memory you
have is 64GB * 10, so you can cache about 300GB of data under the
default setting, which is not enough for the 1.2TB data you want to
process. -Xiangrui

On Thu, Nov 20, 2014 at 5:57 AM, Alan Prando a...@scanboo.com.br wrote:
 Hi Folks!

 I'm running a Python Spark job on a cluster with 1 master and 10 slaves (64G
 RAM and 32 cores each machine).
 This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and
 call Kmeans method as following:

 # SLAVE CODE - Reading features from HDFS
 def get_features_from_images_hdfs(self, timestamp):
 def shallow(lista):
 for row in lista:
 for col in row:
 yield col

 features = self.sc.textFile(hdfs://999.999.99:/FOLDER/)
 return features.map(lambda row: eval(row)[1]).mapPartitions(shallow)


  # SLAVE CODE - Extract centroids with Kmeans
  def extract_centroids_on_slaves(self, features, kmeans_clusters,
 kmeans_max_iterations, kmeans_mode):

 #Error line
 clusters = KMeans.train(
 features,
 kmeans_clusters,
 maxIterations=kmeans_max_iterations,
 runs=1,
 initializationMode=kmeans_mode
 )
 return clusters.clusterCenters

 # MASTER CODE - Main
 features = get_features_from_images_hdfs(kwargs.get(timestamp))
 kmeans_clusters = 1
 kmeans_max_interations  = 13
 kmeans_mode = random

 centroids = extract_centroids_on_slaves(
 features,
 kmeans_clusters,
 kmeans_max_interations,
kmeans_mode
 )

 centroids_rdd = sc.parallelize(centroids)


 I'm getting the following exception when I call KMeans.train:

  14/11/20 13:19:34 INFO TaskSetManager: Starting task 2539.0 in stage 0.0
 (TID 2327, ip-172-31-7-120.ec2.internal, NODE_LOCAL, 1649 bytes)
 14/11/20 13:19:34 WARN TaskSetManager: Lost task 2486.0 in stage 0.0 (TID
 2257, ip-172-31-7-120.ec2.internal): java.io.IOException: Filesystem closed
 org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:765)

 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:783)
 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844)
 java.io.DataInputStream.read(DataInputStream.java:100)
 org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)

 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)

 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)

 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:220)
 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:189)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)

 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)

 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)

 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1314)

 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_4_2174 in memory on
 ip-172-31-7-121.ec2.internal:57211 (size: 5.3 MB, free: 23.0 GB)
 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_2_2349 in memory on
 ip-172-31-7-124.ec2.internal:56258 (size: 47.6 MB, free: 23.5 GB)
 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_2_2386 in memory on
 ip-172-31-7-124.ec2.internal:56258 (size: 46.0 MB, free: 23.5 GB)
 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_2_2341 in memory on
 ip-172-31-7-124.ec2.internal:56258 (size: 47.3 MB, free: 23.4 GB)
 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_4_2279 in memory on
 ip-172-31-7-124.ec2.internal:56258 (size: 5.1 MB, free: 23.4 GB)
 14/11/20 13:19:34 INFO BlockManagerInfo: Added rdd_2_2324 in memory on
 ip-172-31-7-124.ec2.internal:56258 (size: 46.1 MB, free: 23.4 GB)
 14/11/20 13:19:34 INFO TaskSetManager: Starting task 2525.0 in stage 0.0
 (TID 2328,

Re: Python Logistic Regression error

2014-11-24 Thread Xiangrui Meng

The data is in LIBSVM format. So this line won't work:

values = [float(s) for s in line.split(' ')]

Please use the util function in MLUtils to load it as an RDD of LabeledPoint.

http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point

from pyspark.mllib.util import MLUtils

examples = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt)

-Xiangrui

On Sun, Nov 23, 2014 at 11:38 AM, Venkat, Ankam
ankam.ven...@centurylink.com wrote:
 Can you please suggest sample data for running the logistic_regression.py?



 I am trying to use a sample data file at
 https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt



 I am running this on CDH5.2 Quickstart VM.



 [cloudera@quickstart mllib]$ spark-submit logistic_regression.py lr.txt 3



 But, getting below error.



 14/11/23 11:23:55 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
 aborting job

 14/11/23 11:23:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
 have all completed, from pool

 14/11/23 11:23:55 INFO TaskSchedulerImpl: Cancelling stage 0

 14/11/23 11:23:55 INFO DAGScheduler: Failed to run runJob at
 PythonRDD.scala:296

 Traceback (most recent call last):

   File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 50,
 in module

 model = LogisticRegressionWithSGD.train(points, iterations)

   File /usr/lib/spark/python/pyspark/mllib/classification.py, line 110, in
 train

 initialWeights)

   File /usr/lib/spark/python/pyspark/mllib/_common.py, line 430, in
 _regression_train_wrapper

 initial_weights = _get_initial_weights(initial_weights, data)

   File /usr/lib/spark/python/pyspark/mllib/_common.py, line 415, in
 _get_initial_weights

 initial_weights = _convert_vector(data.first().features)

   File /usr/lib/spark/python/pyspark/rdd.py, line 1127, in first

 rs = self.take(1)

   File /usr/lib/spark/python/pyspark/rdd.py, line 1109, in take

 res = self.context.runJob(self, takeUpToNumLeft, p, True)

   File /usr/lib/spark/python/pyspark/context.py, line 770, in runJob

 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
 javaPartitions, allowLocal)

   File
 /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line
 538, in __call__

   File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value

 py4j.protocol.Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.runJob.

 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
 (TID 3, 192.168.139.145): org.apache.spark.api.python.PythonException:
 Traceback (most recent call last):

   File /usr/lib/spark/python/pyspark/worker.py, line 79, in main

 serializer.dump_stream(func(split_index, iterator), outfile)

   File /usr/lib/spark/python/pyspark/serializers.py, line 196, in
 dump_stream

 self.serializer.dump_stream(self._batched(iterator), stream)

   File /usr/lib/spark/python/pyspark/serializers.py, line 127, in
 dump_stream

 for obj in iterator:

   File /usr/lib/spark/python/pyspark/serializers.py, line 185, in _batched

 for item in iterator:

   File /usr/lib/spark/python/pyspark/rdd.py, line 1105, in takeUpToNumLeft

 yield next(iterator)

   File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 37,
 in parsePoint

 values = [float(s) for s in line.split(' ')]

 ValueError: invalid literal for float(): 1:0.4551273600657362



 Regards,

 Venkat

 This communication is the property of CenturyLink and may contain
 confidential or privileged information. Unauthorized use of this
 communication is strictly prohibited and may be unlawful. If you have
 received this communication in error, please immediately notify the sender
 by reply e-mail and destroy all copies of the communication and any
 attachments.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Xiangrui Meng

Try building Spark with -Pnetlib-lgpl, which includes the JNI library
in the Spark assembly jar. This is the simplest approach. If you want
to include it as part of your project, make sure the library is inside
the assembly jar or you specify it via `--jars` with spark-submit.
-Xiangrui

On Mon, Nov 24, 2014 at 8:51 AM, agg212 alexander_galaka...@brown.edu wrote:
 Hi, i'm trying to improve performance for Spark's Mllib, and I am having
 trouble getting native netlib-java libraries installed/recognized by Spark.
 I am running on a single machine, Ubuntu 14.04 and here is what I've tried:

 sudo apt-get install libgfortran3
 sudo apt-get install libatlas3-base libopenblas-base (this is how
 netlib-java's website says to install it)

 I also double checked and it looks like the libraries are linked correctly
 in /usr/lib (see below):
 /usr/lib/libblas.so.3 - /etc/alternatives/libblas.so.3
 /usr/lib/liblapack.so.3 - /etc/alternatives/liblapack.so.3


 The Dependencies section on Spark's Mllib website also says to include
 com.github.fommil.netlib:all:1.1.2 as a dependency.  I therefore tried
 adding this to my sbt file like so:

 libraryDependencies += com.github.fommil.netlib % all % 1.1.2

 After all this, i'm still seeing the following error message.  Does anyone
 have more detailed installation instructions?

 14/11/24 16:49:29 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 14/11/24 16:49:29 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS

 Thanks!




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is spark streaming +MlLib for online learning?

2014-11-25 Thread Xiangrui Meng

In 1.2, we added streaming k-means:
https://github.com/apache/spark/pull/2942 . -Xiangrui

On Mon, Nov 24, 2014 at 5:25 PM, Joanne Contact joannenetw...@gmail.com wrote:
 Thank you Tobias!

 On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com
 wrote:

 I seemed to read somewhere that spark is still batch learning, but spark
 streaming could allow online learning.


 Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently
 can do online learning only for linear regression
 https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html#streaming-linear-regression,
 as far as I know.

 Tobias



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: K-means clustering

2014-11-25 Thread Xiangrui Meng

There is a simple example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py
. You can take advantage of sparsity by computing the distance via
inner products:
http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
-Xiangrui

On Tue, Nov 25, 2014 at 2:39 AM, amin mohebbi
aminn_...@yahoo.com.invalid wrote:
  I  have generated a sparse matrix by python, which has the size of
 4000*174000 (.pkl), the following is a small part of this matrix :

  (0, 45) 1
   (0, 413) 1
   (0, 445) 1
   (0, 107) 4
   (0, 80) 2
   (0, 352) 1
   (0, 157) 1
   (0, 191) 1
   (0, 315) 1
   (0, 395) 4
   (0, 282) 3
   (0, 184) 1
   (0, 403) 1
   (0, 169) 1
   (0, 267) 1
   (0, 148) 1
   (0, 449) 1
   (0, 241) 1
   (0, 303) 1
   (0, 364) 1
   (0, 257) 1
   (0, 372) 1
   (0, 73) 1
   (0, 64) 1
   (0, 427) 1
   : :
   (2, 399) 1
   (2, 277) 1
   (2, 229) 1
   (2, 255) 1
   (2, 409) 1
   (2, 355) 1
   (2, 391) 1
   (2, 28) 1
   (2, 384) 1
   (2, 86) 1
   (2, 285) 2
   (2, 166) 1
   (2, 165) 1
   (2, 419) 1
   (2, 367) 2
   (2, 133) 1
   (2, 61) 1
   (2, 434) 1
   (2, 51) 1
   (2, 423) 1
   (2, 398) 1
   (2, 438) 1
   (2, 389) 1
   (2, 26) 1
   (2, 455) 1

 I am new in Spark and would like to cluster this matrix by k-means
 algorithm. Can anyone explain to me what kind of problems  I might be faced.
 Please note that I do not want to use Mllib and would like to write my own
 k-means.
 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 Tel : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MlLib Colaborative filtering factors

2014-11-25 Thread Xiangrui Meng

It is data-dependent, and hence needs hyper-parameter tuning, e.g.,
grid search. The first batch is certainly expensive. But after you
figure out a small range for each parameter that fits your data,
following batches should be not that expensive. There is an example
from AMPCamp: 
http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html
-Xiangrui

On Tue, Nov 25, 2014 at 4:28 AM, Saurabh Agrawal
saurabh.agra...@markit.com wrote:


 HI,



 I am trying to execute Collaborative filtering using MlLib. Can somebody
 please suggest how to calculate the following



 1.   Rank

 2.   Iterations

 3.   Lambda



 I understand these are adjustment factors and they help reduce the MSE in
 turn defining accuracy of algorithm but then is it all hit and trial or is
 there a definitive way to calculate them?





 Thanks!!



 Regards,

 Saurabh Agrawal


 
 This e-mail, including accompanying communications and attachments, is
 strictly confidential and only for the intended recipient. Any retention,
 use or disclosure not expressly authorised by Markit is prohibited. This
 email is subject to all waivers and other terms at the following link:
 http://www.markit.com/en/about/legal/email-disclaimer.page

 Please visit http://www.markit.com/en/about/contact/contact-us.page? for
 contact information on our offices worldwide.

 MarkitSERV Limited has its registered office located at Level 4, Ropemaker
 Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated
 by the Financial Conduct Authority with registration number 207294

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: why MatrixFactorizationModel private?

2014-11-25 Thread Xiangrui Meng

Besides API stability concerns, models constructed directly from users
rather than returned by ALS may not work well. The userFeatures and
productFeatures are both with partitioners so we can perform quick
lookup for prediction. If you save userFeatures and productFeatures
and load them back, it is very likely the partitioning info is
missing. That being said, we will try to address model export/import
in v1.3: https://issues.apache.org/jira/browse/SPARK-4587 . -Xiangrui

On Tue, Nov 25, 2014 at 8:26 AM, jamborta jambo...@gmail.com wrote:
 Hi all,

 seems that all the mllib models are declared accessible in the package,
 except MatrixFactorizationModel, which is declared private to mllib. Any
 reason why?

 thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Xiangrui Meng

The training RMSE may increase due to regularization. Squared loss
only represents part of the global loss. If you watch the sum of the
squared loss and the regularization, it should be non-increasing.
-Xiangrui

On Wed, Nov 26, 2014 at 9:53 AM, Sean Owen so...@cloudera.com wrote:
 I also modified the example to try 1, 5, 9, ... iterations as you did,
 and also ran with the same default parameters. I used the
 sample_movielens_data.txt file. Is that what you're using?

 My result is:

 Iteration 1 Test RMSE = 1.426079653593016 Train RMSE = 1.5013155094216357
 Iteration 5 Test RMSE = 1.405598012724468 Train RMSE = 1.4847078708333596
 Iteration 9 Test RMSE = 1.4055990901261632 Train RMSE = 1.484713206769993
 Iteration 13 Test RMSE = 1.4055990999738366 Train RMSE = 1.4847132332994588
 Iteration 17 Test RMSE = 1.40559910003368 Train RMSE = 1.48471323345531
 Iteration 21 Test RMSE = 1.4055991000342158 Train RMSE = 1.4847132334567061
 Iteration 25 Test RMSE = 1.4055991000342174 Train RMSE = 1.4847132334567108

 Train error is higher than test error, consistently, which could be
 underfitting. A higher rank=50 gets a reasonable result:

 Iteration 1 Test RMSE = 1.5981883186995312 Train RMSE = 1.4841671360432005
 Iteration 5 Test RMSE = 1.5745145659678204 Train RMSE = 1.4672341345080382
 Iteration 9 Test RMSE = 1.5745147110505406 Train RMSE = 1.4672385714907996
 Iteration 13 Test RMSE = 1.5745147108258577 Train RMSE = 1.4672385929631868
 Iteration 17 Test RMSE = 1.5745147108246424 Train RMSE = 1.4672385930428344
 Iteration 21 Test RMSE = 1.5745147108246367 Train RMSE = 1.4672385930431973
 Iteration 25 Test RMSE = 1.5745147108246367 Train RMSE = 1.467238593043199

 I'm not sure what the difference is. I looked at your modifications
 and they seem very similar. Is it the data you're using?


 On Wed, Nov 26, 2014 at 3:34 PM, Kostas Kloudas kklou...@gmail.com wrote:
 For the training I am using the code in the MovieLensALS example with 
 trainImplicit set to false
 and for the training RMSE I use the

 val rmseTr = computeRmse(model, training, params.implicitPrefs).

 The computeRmse() method is provided in the MovieLensALS class.


 Thanks a lot,
 Kostas


 On Nov 26, 2014, at 2:41 PM, Sean Owen so...@cloudera.com wrote:

 How are you computing RMSE?
 and how are you training the model -- not with trainImplicit right?
 I wonder if you are somehow optimizing something besides RMSE.

 On Wed, Nov 26, 2014 at 2:36 PM, Kostas Kloudas kklou...@gmail.com wrote:
 Once again, the error even with the training dataset increases. The results
 are:

 Running 1 iterations
 For 1 iter.: Test RMSE  = 1.2447121194304893  Training RMSE =
 1.2394166987104076 (34.751317636 s).
 Running 5 iterations
 For 5 iter.: Test RMSE  = 1.3253957117600659  Training RMSE =
 1.3206317416138509 (37.69311802304 s).
 Running 9 iterations
 For 9 iter.: Test RMSE  = 1.3255293380139364  Training RMSE =
 1.3207661218210436 (41.046175661 s).
 Running 13 iterations
 For 13 iter.: Test RMSE  = 1.3255295352665748  Training RMSE =
 1.3207663201865092 (47.763619515 s).
 Running 17 iterations
 For 17 iter.: Test RMSE  = 1.32552953555787  Training RMSE =
 1.3207663204794406 (59.68236110305 s).
 Running 21 iterations
 For 21 iter.: Test RMSE  = 1.3255295355583026  Training RMSE =
 1.3207663204798756 (57.210578232 s).
 Running 25 iterations
 For 25 iter.: Test RMSE  = 1.325529535558303  Training RMSE =
 1.3207663204798765 (65.785485882 s).

 Thanks a lot,
 Kostas

 On Nov 26, 2014, at 12:04 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 copying user group - I keep replying directly vs reply all :)

 On Wed, Nov 26, 2014 at 2:03 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 ALS will be guaranteed to decrease the squared error (therefore RMSE) in
 each iteration, on the training set.

 This does not hold for the test set / cross validation. You would expect
 the test set RMSE to stabilise as iterations increase, since the algorithm
 converges - but not necessarily to decrease.

 On Wed, Nov 26, 2014 at 1:57 PM, Kostas Kloudas kklou...@gmail.com
 wrote:

 Hi all,

 I am getting familiarized with Mllib and a thing I noticed is that
 running the MovieLensALS
 example on the movieLens dataset for increasing number of iterations does
 not decrease the
 rmse.

 The results for 0.6% training set and 0.4% test are below. For training
 set to 0.8%, the results
 are almost identical. Shouldn’t it be normal to see a decreasing error?
 Especially going from 1 to 5 iterations.

 Running 1 iterations
 Test RMSE for 1 iter. = 1.2452964343277886 (52.75712592704 s).
 Running 5 iterations
 Test RMSE for 5 iter. = 1.3258973764470259 (61.183927666 s).
 Running 9 iterations
 Test RMSE for 9 iter. = 1.3260308117704385 (61.8494887581 s).
 Running 13 iterations
 Test RMSE for 13 iter. = 1.3260310099809915 (73.799510125 s).
 Running 17 iterations
 Test RMSE for 17 iter. = 1.3260310102735398 (77.5651218531 s).
 Running 21 iterations
 Test RMSE for 21 iter.

Re: ALS failure with size Integer.MAX_VALUE

2014-12-02 Thread Xiangrui Meng

Hi Bharath,

You can try setting a small item blocks in this case. 1200 is
definitely too large for ALS. Please try 30 or even smaller. I'm not
sure whether this could solve the problem because you have 100 items
connected with 10^8 users. There is a JIRA for this issue:

https://issues.apache.org/jira/browse/SPARK-3735

which I will try to implement in 1.3. I'll ping you when it is ready.

Best,
Xiangrui

On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com wrote:
Yes, the issue appears to be due to the 2GB block size limitation. I am
hence looking for (user, product) block sizing suggestions to work around
the block size limitation.

On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

(It won't be that, since you see that the error occur when reading a
block from disk. I think this is an instance of the 2GB block size
limitation.)

On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:
Hi Bharath – I’m unsure if this is your problem but the
MatrixFactorizationModel in MLLIB which is the underlying component for
ALS
expects your User/Product fields to be integers. Specifically, the input
to
ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
wondering if
perhaps one of your identifiers exceeds MAX_INT, could you write a quick
check for that?

I have been running a very similar use case to yours (with more
constrained
hardware resources) and I haven’t seen this exact problem but I’m sure
we’ve
seen similar issues. Please let me know if you have other questions.

From: Bharath Ravi Kumar reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
To: user@spark.apache.org user@spark.apache.org
Subject: ALS failure with size Integer.MAX_VALUE

We're training a recommender with ALS in mllib 1.1 against a dataset of
150M
users and 4.5K items, with the total number of training records being
1.2
Billion (~30GB data). The input data is spread across 1200 partitions on
HDFS. For the training, rank=10, and we've configured {number of user
data
blocks = number of item data blocks}. The number of user/item blocks was
varied between 50 to 1200. Irrespective of the block size (e.g. at 1200
blocks each), there are atleast a couple of tasks that end up shuffle
reading 9.7G each in the aggregate stage (ALS.scala:337) and failing
with
the following exception:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
at
org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
at

org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at

org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: printing mllib.linalg.vector

2014-12-15 Thread Xiangrui Meng

you can use the default toString method to get the string
representation. if you want to customized, check the indices/values
fields. -Xiangrui

On Fri, Dec 5, 2014 at 7:32 AM, debbie debbielarso...@hotmail.com wrote:
 Basic question:

 What is the best way to loop through one of these and print their
 components? Convert them to an array?

 Thanks

 Deb

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLlib(Logistic Regression) + Spark Streaming.

2014-12-15 Thread Xiangrui Meng

If you want to train offline and predict online, you can use the
current LR implementation to train a model and then apply
model.predict on the dstream. -Xiangrui

On Sun, Dec 7, 2014 at 6:30 PM, Nasir Khan nasirkhan.onl...@gmail.com wrote:
 I am new to spark.
 Lets say i want to develop a machine learning model. which trained on normal
 method in MLlib. I want to use that model with classifier Logistic
 regression and predict the streaming data coming from a file or socket.


 Streaming data - Logistic Regression - binary label prediction.

 Is it possible? since there is no streaming logistic regression algo like
 streaming linear regression.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Logistic-Regression-Spark-Streaming-tp20564.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-15 Thread Xiangrui Meng

Is it possible that after filtering the feature dimension changed?
This may happen if you use LIBSVM format but didn't specify the number
of features. -Xiangrui

On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak ssti...@live.com wrote:
 Hi All,


 I was able to run LinearRegressionwithSGD for a largeer dataset ( 2GB
 sparse). I have now filtered the data and I am running regression on a
 subset of it  (~ 200 MB). I see this error, which is strange since it was
 running fine with the superset data. Is this a formatting issue (which I
 doubt) or is this some other issue in data preparation? I confirmed that
 there is no empty line in my dataset. Any help with this will be highly
 appreciated.


 14/12/08 20:32:03 WARN TaskSetManager: Lost TID 5 (task 3.0:1)

 14/12/08 20:32:03 WARN TaskSetManager: Loss was due to
 java.lang.ArrayIndexOutOfBoundsException

 java.lang.ArrayIndexOutOfBoundsException: 150323

 at
 breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:231)

 at
 breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:216)

 at breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60)

 at breeze.linalg.VectorOps$$anon$178.apply(Vector.scala:391)

 at breeze.linalg.NumericOps$class.dot(NumericOps.scala:83)

 at breeze.linalg.DenseVector.dot(DenseVector.scala:47)

 at
 org.apache.spark.mllib.optimization.LeastSquaresGradient.compute(Gradient.scala:125)

 at
 org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:180)

 at
 org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:179)

 at
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)

 at
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)

 at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)

 at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)

 at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)

 at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)

 at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)

 at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)

 at org.apache.spark.scheduler.Task.run(Task.scala:51)

 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Xiangrui Meng

Please check the number of partitions after sc.textFile. Use
sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui

On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai dbt...@dbtsai.com wrote:
 You just need to use the latest master code without any configuration
 to get performance improvement from my PR.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Mon, Dec 8, 2014 at 7:53 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
 After some investigation, I learned that I can't compare kmeans in mllib
 with another kmeans implementation directly. The kmeans|| initialization
 step takes more time than the algorithm implemented in julia for example.
 There is also the ability to run multiple runs of kmeans algorithm in mllib
 even by default the number of runs is 1.

 DB Tsai can you please tell me the configuration you took for the
 improvement you mention in your pull request. I'd like to run the same
 benchmark on mnist8m on my computer.


 Cheers;



 On Fri, Dec 5, 2014 at 10:34 PM, DB Tsai dbt...@dbtsai.com wrote:

 Also, are you using the latest master in this experiment? A PR merged
 into the master couple days ago will spend up the k-means three times.
 See


 https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, Dec 5, 2014 at 9:36 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  The code is really simple :
 
  object TestKMeans {
 
def main(args: Array[String]) {
 
  val conf = new SparkConf()
.setAppName(Test KMeans)
.setMaster(local[8])
.set(spark.executor.memory, 8g)
 
  val sc = new SparkContext(conf)
 
  val numClusters = 500;
  val numIterations = 2;
 
 
  val data = sc.textFile(sample.csv).map(x =
  Vectors.dense(x.split(',').map(_.toDouble)))
  data.cache()
 
 
  val clusters = KMeans.train(data, numClusters, numIterations)
 
  println(clusters.clusterCenters.size)
 
  val wssse = clusters.computeCost(data)
  println(serror : $wssse)
 
}
  }
 
 
  For the testing purpose, I was generating a sample random data with
  julia
  and store it in a csv file delimited by comma. The dimensions is 248000
  x
  384.
 
  In the target application, I will have more than 248k data to cluster.
 
 
  On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu dav...@databricks.com
  wrote:
 
  Could you post you script to reproduce the results (also how to
  generate the dataset)? That will help us to investigate it.
 
  On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com
  wrote:
   Hmm, here I use spark on local mode on my laptop with 8 cores. The
   data
   is
   on my local filesystem. Event thought, there an overhead due to the
   distributed computation, I found the difference between the runtime
   of
   the
   two implementations really, really huge. Is there a benchmark on how
   well
   the algorithm implemented in mllib performs ?
  
   On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen so...@cloudera.com wrote:
  
   Spark has much more overhead, since it's set up to distribute the
   computation. Julia isn't distributed, and so has no such overhead in
   a
   completely in-core implementation. You generally use Spark when you
   have a problem large enough to warrant distributing, or, your data
   already lives in a distributed store like HDFS.
  
   But it's also possible you're not configuring the implementations
   the
   same way, yes. There's not enough info here really to say.
  
   On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa
   jaon...@gmail.com
   wrote:
Hi all,
   
I'm trying to a run clustering with kmeans algorithm. The size of
my
data
set is about 240k vectors of dimension 384.
   
Solving the problem with the kmeans available in julia (kmean++)
   
http://clusteringjl.readthedocs.org/en/latest/kmeans.html
   
take about 8 minutes on a single core.
   
Solving the same problem with spark kmean|| take more than 1.5
hours
with 8
cores
   
Either they don't implement the same algorithm either I don't
understand
how
the kmeans in spark works. Is my data not big enough to take full
advantage
of spark ? At least, I expect to the same runtime.
   
   
Cheers,
   
   
Jao
  
  
 
 



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Stack overflow Error while executing spark SQL

2014-12-15 Thread Xiangrui Meng

Could you post the full stacktrace? It seems to be some recursive call
in parsing. -Xiangrui

On Tue, Dec 9, 2014 at 7:44 PM,  jishnu.prat...@wipro.com wrote:
 Hi



 I am getting Stack overflow Error

 Exception in main java.lang.stackoverflowerror

 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)

at
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)

at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)

at
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)


 while executing the following code

 sqlContext.sql(SELECT text FROM tweetTable LIMIT
 10).collect().foreach(println)



 The complete code is from github

 https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala



 import com.google.gson.{GsonBuilder, JsonParser}

 import org.apache.spark.mllib.clustering.KMeans

 import org.apache.spark.sql.SQLContext

 import org.apache.spark.{SparkConf, SparkContext}

 import org.apache.spark.mllib.clustering.KMeans

 /**

 * Examine the collected tweets and trains a model based on them.

 */

 object ExamineAndTrain {

 val jsonParser = new JsonParser()

 val gson = new GsonBuilder().setPrettyPrinting().create()

 def main(args: Array[String]) {

 // Process program arguments and set properties

 /*if (args.length  3) {

 System.err.println(Usage:  + this.getClass.getSimpleName +

  tweetInput outputModelDir numClusters numIterations)

 System.exit(1)

 }

 *

 */

val outputModelDir=C:\\MLModel

  val tweetInput=C:\\MLInput

val numClusters=10

val numIterations=20



 //val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters),
 Utils.IntParam(numIterations)) = args



 val conf = new
 SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4])

 val sc = new SparkContext(conf)

 val sqlContext = new SQLContext(sc)

 // Pretty print some of the tweets.

 val tweets = sc.textFile(tweetInput)

 println(Sample JSON Tweets---)

 for (tweet - tweets.take(5)) {

 println(gson.toJson(jsonParser.parse(tweet)))

 }

 val tweetTable = sqlContext.jsonFile(tweetInput).cache()

 tweetTable.registerTempTable(tweetTable)

 println(--Tweet table Schema---)

 tweetTable.printSchema()

 println(Sample Tweet Text-)



 sqlContext.sql(SELECT text FROM tweetTable LIMIT
 10).collect().foreach(println)







 println(--Sample Lang, Name, text---)

 sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT
 1000).collect().foreach(println)

 println(--Total count by languages Lang, count(*)---)

 sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY
 user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println)

 println(--- Training the model and persist it)

 val texts = sqlContext.sql(SELECT text from
 tweetTable).map(_.head.toString)

 // Cache the vectors RDD since it will be used for all the KMeans
 iterations.

 val vectors = texts.map(Utils.featurize).cache()

 vectors.count() // Calls an action on the RDD to populate the vectors cache.

 val model = KMeans.train(vectors, numClusters, numIterations)

 sc.makeRDD(model.clusterCenters,
 numClusters).saveAsObjectFile(outputModelDir)

 val some_tweets = texts.take(100)

 println(Example tweets from the clusters)

 for (i - 0 until numClusters) {

 println(s\nCLUSTER $i:)

 some_tweets.foreach { t =

 if (model.predict(Utils.featurize(t)) == i) {

 println(t)

 }

 }

 }

 }

 }



 Thanks  Regards

 Jishnu Menath Prathap





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Building Desktop application for ALS-MlLib/ Training ALS

2014-12-15 Thread Xiangrui Meng

On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal
saurabh.agra...@markit.com wrote:


 Hi,



 I am a new bee in spark and scala world



 I have been trying to implement Collaborative filtering using MlLib supplied
 out of the box with Spark and Scala



 I have 2 problems



 1.   The best model was trained with rank = 20 and lambda = 5.0, and
 numIter = 10, and its RMSE on the test set is 25.718710831912485. The best
 model improves the baseline by 18.29%. Is there a scientific way in which
 RMSE could be brought down? What is a descent acceptable value for RMSE?


The grid search approach used in the AMPCamp tutorial is pretty
standard. Whether an RMSE is good or not really depends on your
dataset.

 2.   I picked up the Collaborative filtering algorithm from
 http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html
 and executed the given code with my dataset. Now, I want to build a desktop
 application around it.

 a.   What is the best language to do this Java/ Scala? Any possibility
 to do this using C#?


We support Java/Scala/Python. Start with the one your are most
familiar with. C# is not supported.

 b.  Can somebody please share any relevant documents/ source or any
 helper links to help me get started on this?


For ALS, you can check the API documentation.



 Your help is greatly appreciated



 Thanks!!



 Regards,

 Saurabh Agrawal


 
 This e-mail, including accompanying communications and attachments, is
 strictly confidential and only for the intended recipient. Any retention,
 use or disclosure not expressly authorised by Markit is prohibited. This
 email is subject to all waivers and other terms at the following link:
 http://www.markit.com/en/about/legal/email-disclaimer.page

 Please visit http://www.markit.com/en/about/contact/contact-us.page? for
 contact information on our offices worldwide.

 MarkitSERV Limited has its registered office located at Level 4, Ropemaker
 Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated
 by the Financial Conduct Authority with registration number 207294

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ALS failure with size Integer.MAX_VALUE

2014-12-15 Thread Xiangrui Meng

Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui

On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar
reachb...@gmail.com wrote:
Hi Xiangrui,

The block size limit was encountered even with reduced number of item blocks
as you had expected. I'm wondering if I could try the new implementation as
a standalone library against a 1.1 deployment. Does it have dependencies on
any core API's in the current master?

Thanks,
Bharath

On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:

Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
yes, I've been following the JIRA for the new ALS implementation. I'll try
it out when it's ready for testing. .

On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:

Hi Bharath,

https://issues.apache.org/jira/browse/SPARK-3735

which I will try to implement in 1.3. I'll ping you when it is ready.

Best,
Xiangrui

On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Yes, the issue appears to be due to the 2GB block size limitation. I am
hence looking for (user, product) block sizing suggestions to work
around
the block size limitation.

On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

(It won't be that, since you see that the error occur when reading a
block from disk. I think this is an instance of the 2GB block size
limitation.)

On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:
Hi Bharath – I’m unsure if this is your problem but the
MatrixFactorizationModel in MLLIB which is the underlying component
for
ALS
expects your User/Product fields to be integers. Specifically, the
input
to
ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
wondering if
perhaps one of your identifiers exceeds MAX_INT, could you write a
quick
check for that?

I have been running a very similar use case to yours (with more
constrained
hardware resources) and I haven’t seen this exact problem but I’m
sure
we’ve
seen similar issues. Please let me know if you have other questions.

From: Bharath Ravi Kumar reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
To: user@spark.apache.org user@spark.apache.org
Subject: ALS failure with size Integer.MAX_VALUE

We're training a recommender with ALS in mllib 1.1 against a dataset
of
150M
users and 4.5K items, with the total number of training records
being
1.2
Billion (~30GB data). The input data is spread across 1200
partitions on
HDFS. For the training, rank=10, and we've configured {number of
user
data
blocks = number of item data blocks}. The number of user/item blocks
was
varied between 50 to 1200. Irrespective of the block size (e.g. at
1200
blocks each), there are atleast a couple of tasks that end up
shuffle
reading 9.7G each in the aggregate stage (ALS.scala:337) and
failing
with
the following exception:

org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at

org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-18 Thread Xiangrui Meng

Hi Jay,

Please try increasing executor memory (if the available memory is more
than 2GB) and reduce numBlocks in ALS. The current implementation
stores all subproblems in memory and hence the memory requirement is
significant when k is large. You can also try reducing k and see
whether the problem is still there. I made a PR that improves the ALS
implementation, which generates subproblems one by one. You can try
that as well.

https://github.com/apache/spark/pull/3720

Best,
Xiangrui

On Wed, Dec 17, 2014 at 6:57 PM, buring qyqb...@gmail.com wrote:
 I am not sure this can help you. I have 57 million rating,about 4million user
 and 4k items. I used 7-14 total-executor-cores,executal-memory 13g,cluster
 have 4 nodes,each have 4cores,max memory 16g.
 I found set as follows may help avoid this problem:
 conf.set(spark.shuffle.memoryFraction,0.65) //default is 0.2
 conf.set(spark.storage.memoryFraction,0.3)//default is 0.6
 I have to set rank value under 40, otherwise occure this problem.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-java-lang-OutOfMemoryError-Java-heap-space-tp20584p20755.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng

Dear Spark users and developers,

I’m happy to announce Spark Packages (http://spark-packages.org), a
community package index to track the growing number of open source
packages and libraries that work with Apache Spark. Spark Packages
makes it easy for users to find, discuss, rate, and install packages
for any version of Spark, and makes it easy for developers to
contribute packages.

Spark Packages will feature integrations with various data sources,
management tools, higher level domain-specific libraries, machine
learning algorithms, code samples, and other Spark content. Thanks to
the package authors, the initial listing of packages includes
scientific computing libraries, a job execution server, a connector
for importing Avro data, tools for launching Spark on Google Compute
Engine, and many others.

I’d like to invite you to contribute and use Spark Packages and
provide feedback! As a disclaimer: Spark Packages is a community index
maintained by Databricks and (by design) will include packages outside
of the ASF Spark project. We are excited to help showcase and support
all of the great work going on in the broader Spark community!

Cheers,
Xiangrui

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Interpreting MLLib's linear regression o/p

2014-12-22 Thread Xiangrui Meng

Did you check the indices in the LIBSVM data and the master file? Do
they match? -Xiangrui

On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote:
 Hi All,
 I use LIBSVM format to specify my input feature vector, which used 1-based
 index. When I run regression the o/p is 0-indexed based. I have a master
 lookup file that maps back these indices to what they stand or. However, I
 need to add offset of 2 and not 1 to the regression outcome during the
 mapping. So for example to map the index of 800 from the regression output
 file, I look for 802 in my master lookup file and then things make sense. I
 can understand adding offset of 1, but not sure why adding offset 2 is
 working fine. Have others seem something like this as well?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib beginner question

2014-12-22 Thread Xiangrui Meng

How big is the dataset you want to use in prediction? -Xiangrui

On Mon, Dec 22, 2014 at 1:47 PM, boci boci.b...@gmail.com wrote:
 Hi!

 I want to try out spark mllib in my spark project, but I got a little
 problem. I have training data (external file), but the real data com from
 another rdd. How can I do that?
 I try to simple using same SparkContext to boot rdd (first I create rdd
 using sc.textFile() and after NaiveBayes.train... After that I want to fetch
 the real data using same context and internal the map using the predict. But
 My application never exit (I think stucked or something). Why not work this
 solution?

 Thanks

 b0c1


 --
 Skype: boci13, Hangout: boci.b...@gmail.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLlib + Streaming

2014-12-23 Thread Xiangrui Meng

We have streaming linear regression (since v1.1) and k-means (v1.2) in
MLlib. You can check the user guide:

http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering

-Xiangrui

On Tue, Dec 23, 2014 at 10:01 AM, Gianmarco De Francisci Morales
g...@apache.org wrote:
 Hi,

 I have recently seen a demo of Spark where different pieces were put
 together (training via MLlib + deploying on Spark Streaming).
 I was wondering if MLlib currently works to directly train on Streaming.
 And, if so, what are the semantics of the algorithms?
 If not, would it be interesting to have ML algorithms developed for the
 streaming setting?

 Thanks,
 --
 Gianmarco

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Xiangrui Meng

Sean's PR may be relevant to this issue
(https://github.com/apache/spark/pull/3702). As a workaround, you can
try to truncate the raw scores to 4 digits (e.g., 0.5643215 - 0.5643)
before sending it to BinaryClassificationMetrics. This may not work
well if he score distribution is very skewed. See discussion on
https://issues.apache.org/jira/browse/SPARK-4547 -Xiangrui

On Tue, Dec 23, 2014 at 9:00 AM, Thomas Kwan thomas.k...@manage.com wrote:
 Hi there,

 We are using mllib 1.1.1, and doing Logistics Regression with a dataset of
 about 150M rows.
 The training part usually goes pretty smoothly without any retries. But
 during the prediction stage and BinaryClassificationMetrics stage, I am
 seeing retries with error of fetch failure.

 The prediction part is just as follows:

 val predictionAndLabel = testRDD.map { point =
 val prediction = model.predict(point.features)
 (prediction, point.label)
 }
 ...
 val metrics = new BinaryClassificationMetrics(predictionAndLabel)

 The fetch failure happened with the following stack trace:

 org.apache.spark.rdd.PairRDDFunctions.combineByKey(PairRDDFunctions.scala:515)

 org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3$lzycompute(BinaryClassificationMetrics.scala:101)

 org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3(BinaryClassificationMetrics.scala:96)

 org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:98)

 org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:98)

 org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:142)

 org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:50)

 org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:60)

 com.manage.ml.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:14)

 ...


 We are doing this in the yarn-client mode. 32 executors, 16G executor
 memory, and 12 cores as the spark-submit settings.

 I wonder if anyone has suggestion on how to debug this.

 thanks in advance
 thomas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using TF-IDF from MLlib

2014-12-29 Thread Xiangrui Meng

Hopefully the new pipeline API addresses this problem. We have a code
example here:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

-Xiangrui

On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote:
 Here is what I did for this case : https://github.com/andypetrella/tf-idf


 Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :

 Given (label, terms) you can just transform the values to a TF vector,
 then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
 make a LabeledPoint from (label, vector) pairs. Is that what you're
 looking for?

 On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
  I found the TF-IDF feature extraction and all the MLlib code that work
  with
  pure Vector RDD very difficult to work with due to the lack of ability
  to
  associate vector back to the original data. Why can't Spark MLlib
  support
  LabeledPoint?
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: weights not changed with different reg param

2014-12-29 Thread Xiangrui Meng

Could you post your code? It sounds like a bug. One thing to check is
that wheher you set regType, which is None by default. -Xiangrui

On Tue, Dec 23, 2014 at 3:36 PM, Thomas Kwan thomas.k...@manage.com wrote:
 Hi there

 We are on mllib 1.1.1, and trying different regularization parameters. We
 noticed that the regParam dont affect the weights at all. Is setting the reg
 param via the optimizer the right thing to do? Do we need to set our own
 updater? Anyone else seeing the same behaviour?

 thanks again
 thomas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib beginner question

2014-12-29 Thread Xiangrui Meng

b0c1, did you apply model.predict to a DStream? Maybe it would help
understand your question better if you can post your code. -Xiangrui

On Tue, Dec 23, 2014 at 11:54 AM, boci boci.b...@gmail.com wrote:
 Xiangrui: Hi, I want to using this with streaming and with job too. I using
 kafka (streaming) and elasticsearch (job) as source and want to calculate
 sentiment value from the input text.
 Simon: great, you have any doc how can I embed into my application without
 using the http interface? (how can I direct call the service?)

 b0c1

 --
 Skype: boci13, Hangout: boci.b...@gmail.com

 On Tue, Dec 23, 2014 at 1:35 AM, Xiangrui Meng men...@gmail.com wrote:

 How big is the dataset you want to use in prediction? -Xiangrui

 On Mon, Dec 22, 2014 at 1:47 PM, boci boci.b...@gmail.com wrote:
  Hi!
 
  I want to try out spark mllib in my spark project, but I got a little
  problem. I have training data (external file), but the real data com
  from
  another rdd. How can I do that?
  I try to simple using same SparkContext to boot rdd (first I create rdd
  using sc.textFile() and after NaiveBayes.train... After that I want to
  fetch
  the real data using same context and internal the map using the predict.
  But
  My application never exit (I think stucked or something). Why not work
  this
  solution?
 
  Thanks
 
  b0c1
 
 
 
  --
  Skype: boci13, Hangout: boci.b...@gmail.com



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SVDPlusPlus Recommender in MLLib

2014-12-29 Thread Xiangrui Meng

There is an SVD++ implementation in GraphX. It would be nice if you
can compare its performance vs. Mahout. -Xiangrui

On Wed, Dec 24, 2014 at 6:46 AM, Prafulla Wani prafulla.w...@gmail.com wrote:
 hi ,

 Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is
 implemented in Mahout from this paper -
 http://research.yahoo.com/files/kdd08koren.pdf

 Regards,
 Prafulla.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-14 Thread Xiangrui Meng

Hi Nishanth,

Just found out where you work:) We had some discussion in
https://issues.apache.org/jira/browse/SPARK-2465 . Having long IDs
will increase the communication cost, which may not worth the benefit.
Not many companies have more than 1 billion users. If they do, maybe
they can mirror the implementation for their use cases. I can suggest
several possible solutions:

1. Hash user IDs into integers before training. If the collision rate
is high and it is crucial for your business, you can recompute user
features from product features by solving least squares after
training. This works when the product IDs could be mapped to integers.
2. Make type aliases in ALS, so that you can easily mirror the
implementation to use long IDs and track future changes.
3. Make ALS implementation use generic ID types. This would be the
best solution, but it requires some refactoring of the code.

Best,
Xiangrui

On Wed, Jan 14, 2015 at 1:04 PM, Nishanth P S nishant...@gmail.com wrote:
Yes, we are close to having more 2 billion users. In this case what is the
best way to handle this.

Thanks,
Nishanth

On Fri, Jan 9, 2015 at 9:50 PM, Xiangrui Meng men...@gmail.com wrote:

Do you have more than 2 billion users/products? If not, you can pair
each user/product id with an integer (check RDD.zipWithUniqueId), use
them in ALS, and then join the original bigInt IDs back after
training. -Xiangrui

On Fri, Jan 9, 2015 at 5:12 PM, nishanthps nishant...@gmail.com wrote:
Hi,

The userId's and productId's in my data are bigInts, what is the best
way to
run collaborative filtering on this data. Should I modify MLlib's
implementation to support more types? or is there an easy way.

Thanks!,
Nishanth

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-BigInteger-for-userId-and-productId-in-collaborative-Filtering-tp21072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using a RowMatrix inside a map

2015-01-14 Thread Xiangrui Meng

Yes, you can only use RowMatrix.multiply() within the driver. We are
working on distributed block matrices and linear algebra operations on
top of it, which would fit your use cases well. It may take several
PRs to finish. You can find the first one here:
https://github.com/apache/spark/pull/3200 -Xiangrui

On Wed, Jan 14, 2015 at 2:06 PM, Alex Minnaar
aminn...@verticalscope.com wrote:
 I am working with a RowMatrix and I noticed in the multiply() method that
 the local matrix with which it is being multiplied is being distributed to
 all of the rows of the RowMatrix.  If this is the case, then is it
 impossible to multiply a row matrix within a map operation? Because this
 would essentially be creating RDDs within RDDs.  For example, If you had an
 RDD of local matrices and you wanted to perform a map operation where each
 local matrix is multiplied with a distributed matrix.  This does not seem
 possible since it would require distributing each local matrix in the map
 when multiplication occurs (i.e. creating an RDD in each element of the
 original RDD).  If this is true then does it mean you can only multiply a
 RowMatrix within the driver i.e. you cannot parallelize RowMatrix
 multiplications?


 Thanks,


 Alex

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-20 Thread Xiangrui Meng

The assumption of implicit feedback model is that the unobserved
ratings are more likely to be negative. So you may want to add some
negatives for evaluation. Otherwise, the input ratings are all 1 and
the test ratings are all 1 as well. The baseline predictor, which uses
the average rating (that is 1), could easily give you an RMSE of 0.0.
-Xiangrui

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to create distributed matrixes from hive tables.

2015-01-20 Thread Xiangrui Meng

You can get a SchemaRDD from the Hive table, map it into a RDD of
Vectors, and then construct a RowMatrix. The transformations are lazy,
so there is no external storage requirement for intermediate data.
-Xiangrui

On Sun, Jan 18, 2015 at 4:07 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:
 Hi,

 We have large datasets with data format for Spark MLLib matrix, but there
 are pre-computed by Hive and stored inside Hive, my question is can we
 create a distributed matrix such as IndexedRowMatrix directlly from Hive
 tables, avoiding reading data from Hive tables and feed them into an empty
 Matrix.

 Regards



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Xiangrui Meng

You can save the cluster centers as a SchemaRDD of two columns (id:
Int, center: Array[Double]). When you load it back, you can construct
the k-means model from its cluster centers. -Xiangrui

On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian lian.cs@gmail.com wrote:
 This is because KMeanModel is neither a built-in type nor a user defined
 type recognized by Spark SQL. I think you can write your own UDT version of
 KMeansModel in this case. You may refer to o.a.s.mllib.linalg.Vector and
 o.a.s.mllib.linalg.VectorUDT as an example.

 Cheng

 On 1/20/15 5:34 AM, Divyansh Jain wrote:

 Hey people,

 I have run into some issues regarding saving the k-means mllib model in
 Spark SQL by converting to a schema RDD. This is what I am doing:

 case class Model(id: String, model:
 org.apache.spark.mllib.clustering.KMeansModel)
 import sqlContext.createSchemaRDD
 val rowRdd = sc.makeRDD(Seq(id, model)).map(p = Model(id, model))

 This is the error that I get :

 scala.MatchError: org.apache.spark.mllib.classification.ClassificationModel
 (of class scala.reflect.internal.Types$TypeRef$anon$6)
   at
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:53)
   at
 org.apache.spark.sql.catalyst.ScalaReflection$anonfun$schemaFor$1.apply(ScalaReflection.scala:64)
   at
 org.apache.spark.sql.catalyst.ScalaReflection$anonfun$schemaFor$1.apply(ScalaReflection.scala:62)
   at
 scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)
   at
 scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:62)
   at
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:50)
   at
 org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:44)
   at
 org.apache.spark.sql.execution.ExistingRdd$.fromProductRdd(basicOperators.scala:229)
   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:94)

 Any help would be appreciated. Thanks!







 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Saving-a-mllib-model-in-Spark-SQL-tp21264.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Stepsize with Linear Regression

2015-02-17 Thread Xiangrui Meng

The best step size depends on the condition number of the problem. You
can try some conditioning heuristics first, e.g., normalizing the
columns, and then try a common step size like 0.01. We should
implement line search for linear regression in the future, as in
LogisticRegressionWithLBFGS. Line search can determine the step size
automatically for you. -Xiangrui

On Tue, Feb 10, 2015 at 8:56 AM, Rishi Yadav ri...@infoobjects.com wrote:
 Are there any thumbrules how to set stepsize with gradient descent. I am
 using it for Linear Regression but I am sure it applies in general to
 gradient descent.

 I am at present deriving a number which fits closest to training data set
 response variable values. I am sure there is a better way to do it.

 Thanks and Regards,
 Rishi
 @meditativesoul

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: high GC in the Kmeans algorithm

2015-02-17 Thread Xiangrui Meng

Did you cache the data? Was it fully cached? The k-means
implementation doesn't create many temporary objects. I guess you need
more RAM to avoid GC triggered frequently. Please monitor the memory
usage using YourKit or VisualVM. -Xiangrui

On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote:
 I just want to make the best use of CPU,  and test the performance of spark
 if there is a lot of task in a single node.

 On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote:

 Good, worth double-checking that's what you got. That's barely 1GB per
 task though. Why run 48 if you have 24 cores?

 On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote:
  I give 50GB to the executor,  so it seem that  there is no reason the
  memory
  is not enough.
 
  On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote:
 
  Meaning, you have 128GB per machine but how much memory are you giving
  the executors?
 
  On Wed, Feb 11, 2015 at 8:49 AM, lihu lihu...@gmail.com wrote:
   What do you mean?  Yes,I an see there  is some data put in the memory
   from
   the web ui.
  
   On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen so...@cloudera.com
   wrote:
  
   Are you actually using that memory for executors?
  
   On Wed, Feb 11, 2015 at 8:17 AM, lihu lihu...@gmail.com wrote:
Hi,
I  run the kmeans(MLlib) in a cluster with 12 workers.  Every
work
own a
128G RAM, 24Core. I run 48 task in one machine. the total data is
just
40GB.
   
   When the dimension of the data set is about 10^7, for every
task
the
duration is about 30s, but the cost for GC is about 20s.
   
   When I reduce the dimension to 10^4, then the gc is small.
   
So why gc is so high when the dimension is larger? or this is
the
reason
caused by MLlib?
   
   
   
   
  
  
  
  
   --
   Best Wishes!
  
   Li Hu(李浒) | Graduate Student
   Institute for Interdisciplinary Information Sciences(IIIS)
   Tsinghua University, China
  
   Email: lihu...@gmail.com
   Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
  
  
 
 
 
 
  --
  Best Wishes!
 
  Li Hu(李浒) | Graduate Student
  Institute for Interdisciplinary Information Sciences(IIIS)
  Tsinghua University, China
 
  Email: lihu...@gmail.com
  Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
 
 




 --
 Best Wishes!

 Li Hu(李浒) | Graduate Student
 Institute for Interdisciplinary Information Sciences(IIIS)
 Tsinghua University, China

 Email: lihu...@gmail.com
 Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: feeding DataFrames into predictive algorithms

2015-02-17 Thread Xiangrui Meng

Hey Sandy,

The work should be done by a VectorAssembler, which combines multiple
columns (double/int/vector) into a vector column, which becomes the
features column for regression. We can going to create JIRAs for each
of these standard feature transformers. It would be great if you can
help implement some of them.

Best,
Xiangrui

On Wed, Feb 11, 2015 at 7:55 PM, Patrick Wendell pwend...@gmail.com wrote:
 I think there is a minor error here in that the first example needs a
 tail after the seq:

 df.map { row =
   (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double]))
 }.toDataFrame(label, features)

 On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust
 mich...@databricks.com wrote:
 It sounds like you probably want to do a standard Spark map, that results in
 a tuple with the structure you are looking for.  You can then just assign
 names to turn it back into a dataframe.

 Assuming the first column is your label and the rest are features you can do
 something like this:

 val df = sc.parallelize(
   (1.0, 2.3, 2.4) ::
   (1.2, 3.4, 1.2) ::
   (1.2, 2.3, 1.2) :: Nil).toDataFrame(a, b, c)

 df.map { row =
   (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double]))
 }.toDataFrame(label, features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 If you'd prefer to stick closer to SQL you can define a UDF:

 val createArray = udf((a: Double, b: Double) = Seq(a, b))
 df.select('a as 'label, createArray('b,'c) as 'features)

 df: org.apache.spark.sql.DataFrame = [label: double, features:
 arraydouble]

 We'll add createArray as a first class member of the DSL.

 Michael

 On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hey All,

 I've been playing around with the new DataFrame and ML pipelines APIs and
 am having trouble accomplishing what seems like should be a fairly basic
 task.

 I have a DataFrame where each column is a Double.  I'd like to turn this
 into a DataFrame with a features column and a label column that I can feed
 into a regression.

 So far all the paths I've gone down have led me to internal APIs or
 convoluted casting in and out of RDD[Row] and DataFrame.  Is there a simple
 way of accomplishing this?

 any assistance (lookin' at you Xiangrui) much appreciated,
 Sandy



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLib usage on Spark Streaming

2015-02-17 Thread Xiangrui Meng

JavaDStream.foreachRDD
(https://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function))
and Statistics.corr
(https://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/mllib/stat/Statistics.html#corr(org.apache.spark.rdd.RDD))
should be good starting points. -Xiangrui

On Mon, Feb 16, 2015 at 6:39 AM, Spico Florin spicoflo...@gmail.com wrote:
 Hello!
   I'm newbie to Spark and I have the following case study:
 1. Client sending at 100ms the following data:
   {uniqueId, timestamp, measure1, measure2 }
 2. Each 30 seconds I would like to correlate the data collected in the
 window, with some predefined double vector pattern for each given key. The
 predefined pattern has 300 records. The data should be also sorted by
 timestamp.
 3. When the correlation is greater than a predefined threshold (e.g 0.9) I
 would like to emit an new message containing {uniqueId,
 doubleCorrelationValue}
 4. For the correlation I would like to use MLlib
 5. As a programming language I would like to muse Java 7.

 Can you please give me some suggestions on how to create the skeleton for
 the above scenario?

 Thanks.
  Regards,
  Florin


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Large Similarity Job failing

2015-02-17 Thread Xiangrui Meng

The complexity of DIMSUM is independent of the number of rows but
still have quadratic dependency on the number of columns. 1.5M columns
may be too large to use DIMSUM. Try to increase the threshold and see
whether it helps. -Xiangrui

On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com wrote:
 Hi,

 I am running brute force similarity from RowMatrix on a job with 5M x 1.5M
 sparse matrix with 800M entries. With 200M entries the job run fine but with
 800M I am getting exceptions like too many files open and no space left on
 device...

 Seems like I need more nodes or use dimsum sampling ?

 I am running on 10 nodes where ulimit on each node is set at 65K...Memory is
 not an issue since I can cache the dataset before similarity computation
 starts.

 I tested the same job on YARN with Spark 1.1 and Spark 1.2 stable. Both the
 jobs failed with FetchFailed msgs.

 Thanks.
 Deb

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [POWERED BY] Radius Intelligence

2015-02-17 Thread Xiangrui Meng

Thanks! I added Radius to
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
-Xiangrui

On Tue, Feb 10, 2015 at 12:02 AM, Alexis Roos alexis.r...@gmail.com wrote:
 Also long due given our usage of Spark ..

 Radius Intelligence:
 URL: radius.com

 Description:
 Spark, MLLib
 Using Scala, Spark and MLLib for Radius Marketing and Sales intelligence
 platform including data aggregation, data processing, data clustering, data
 analysis and predictive modeling of all US businesses.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Unknown sample in Naive Baye's

2015-02-17 Thread Xiangrui Meng

If there exists a sample that doesn't not belong to A/B/C, it means
that there exists another class D or Unknown besides A/B/C. You should
have some of these samples in the training set in order to let naive
Bayes learn the priors. -Xiangrui

On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet jatinpr...@gmail.com wrote:
 Hi,

 I am using MLlib's Naive Baye's classifier to classify textual data. I am
 accessing the posterior probabilities through a hack for each class.

 Once I have trained the model, I want to remove documents whose confidence
 of classification is low. Say for a document, if the highest class
 probability is lesser than a pre-defined threshold(separate for each class),
 categorize this document as 'unknown'.

 Say there are three classes A, B and C with thresholds 0.35, 0.32 and 0.33
 respectively defined after training and testing. If I score a sample that
 belongs to neither of the three categories, I wish to classify it as
 'unknown'. But the issue is I can get a probability higher than these
 thresholds for a document that doesn't belong to the trained categories.

 Is there any technique which I can apply to segregate documents that belong
 to untrained classes with certain degree of confidence?

 Thanks



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Naive Bayes model fails after a few predictions

2015-02-17 Thread Xiangrui Meng

Could you share the error log? What do you mean by 500 instead of
200? If this is the number of files, try to use `repartition` before
calling naive Bayes, which works the best when the number of
partitions matches the number of cores, or even less. -Xiangrui

On Tue, Feb 10, 2015 at 10:34 PM, rkgurram rkgur...@gmail.com wrote:
 Further I have tried HttpBroadcast but that too does not work.

 It is almost like there is a MemoryLeak because if I increase the input
 files to 500 instead of 200 the system crashes early.


 The code is as follows
 

   logger.info(Training the model Fold:[+ fold +])
 logger.info(Step 1: Split the input into Training and Testing sets)
 val splits = labeledPointRDD.randomSplit(Array(0.6, 0.4), seed = 11L)
 logger.info(Step 1: splits successful...)

 val training = splits(0)
 val test = splits(1)
 status = ModelStatus.IN_TRAINING
 //logger.info(Fold:[ + fold + ] Training count:  + training.count()
 +  Testing/Verification count: + test.count())

 logger.info(Step 2: Train the NB classifier)
 model = NaiveBayes.train(training, lambda = 1.0)
 logger.info(Step 2: NB model training complete Fold:[ + fold + ])

 logger.info(Step 3: Testing/Verification of the model)
 status = ModelStatus.IN_VERIFICATION
 val predictionAndLabel = test.map(p = (model.predict(p.features),
 p.label))
 val arry = predictionAndLabel.filter(x = x._1 == x._2)
 val accuracy = 1.0 * predictionAndLabel.filter(x = x._1 ==
 x._2).count() / test.count()
 logger.info(Step 3: Testing complete)
 status = ModelStatus.INITIALIZED
 logger.info(Fold[+ fold +] Accuracy:[ + accuracy + ] Model
 Status:[ + status + ])




 -Ravi



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Bayes-model-fails-after-a-few-predictions-tp21592p21593.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: WARN from Similarity Calculation

2015-02-17 Thread Xiangrui Meng

It may be caused by GC pause. Did you check the GC time in the Spark
UI? -Xiangrui

On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das debasish.da...@gmail.com wrote:
 Hi,

 I am sometimes getting WARN from running Similarity calculation:

 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager
 BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms
 exceeds 45000ms

 Do I need to increase the default 45 s to larger values for cases where we
 are doing blocked operation or long compute in the mapPartitions ?

 Thanks.
 Deb

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Unknown sample in Naive Baye's

2015-02-19 Thread Xiangrui Meng

If you know there are data doesn't belong to any existing category,
put them into the training set and make a new category for them. It
won't help much if instances from this unknown category are all
outliers. In that case, lower the thresholds and tune the parameters
to get a lower error rate. -Xiangrui

On Thu, Feb 19, 2015 at 8:58 AM, Jatinpreet Singh jatinpr...@gmail.com wrote:
 Hi Xiangrui,

 Thanks for the answer. The problem is that in my application, I can not stop
 user from scoring any type of sample against trained model.

 So, even if the class of a completely unknown sample has not been trained,
 the model will put it in one of the categories with high priority. I wish to
 eliminate this with come kind of probability threshold. Is this possible in
 any way with Naive Baye's? Can changing the classification algorithm help in
 this regard?

 I appreciate any help on this.

 Thanks,
 Jatin

 On Wed, Feb 18, 2015 at 3:07 AM, Xiangrui Meng men...@gmail.com wrote:

 If there exists a sample that doesn't not belong to A/B/C, it means
 that there exists another class D or Unknown besides A/B/C. You should
 have some of these samples in the training set in order to let naive
 Bayes learn the priors. -Xiangrui

 On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet jatinpr...@gmail.com wrote:
  Hi,
 
  I am using MLlib's Naive Baye's classifier to classify textual data. I
  am
  accessing the posterior probabilities through a hack for each class.
 
  Once I have trained the model, I want to remove documents whose
  confidence
  of classification is low. Say for a document, if the highest class
  probability is lesser than a pre-defined threshold(separate for each
  class),
  categorize this document as 'unknown'.
 
  Say there are three classes A, B and C with thresholds 0.35, 0.32 and
  0.33
  respectively defined after training and testing. If I score a sample
  that
  belongs to neither of the three categories, I wish to classify it as
  'unknown'. But the issue is I can get a probability higher than these
  thresholds for a document that doesn't belong to the trained categories.
 
  Is there any technique which I can apply to segregate documents that
  belong
  to untrained classes with certain degree of confidence?
 
  Thanks
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Unknown-sample-in-Naive-Baye-s-tp21594.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 




 --
 Regards,
 Jatinpreet Singh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLIB and Openblas library in non-default dir

2015-01-05 Thread Xiangrui Meng

It might be hard to do that with spark-submit, because the executor
JVMs may be already up and running before a user runs spark-submit.
You can try to use `System.setProperty` to change the property at
runtime, though it doesn't seem to be a good solution. -Xiangrui

On Fri, Jan 2, 2015 at 6:28 AM, xhudik xhu...@gmail.com wrote:
 Hi
 I have compiled OpenBlas library into nonstandard directory and I want to
 inform Spark app about it via:
 -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so
 which is a standard option in netlib-java
 (https://github.com/fommil/netlib-java)

 I tried 2 ways:
 1. via *--conf* parameter
 /bin/spark-submit -v  --class
 org.apache.spark.examples.mllib.LinearRegression *--conf
 -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so*
 examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar
 data/mllib/sample_libsvm_data.txt/

 2. via *--driver-java-options* parameter
 /bin/spark-submit -v *--driver-java-options
 -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so*
 --class org.apache.spark.examples.mllib.LinearRegression
 examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar
 data/mllib/sample_libsvm_data.txt
 /

 How can I force spark-submit to propagate info about non-standard placement
 of openblas library to netlib-java lib?

 thanks, Tomas



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-and-Openblas-library-in-non-default-dir-tp20943.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: python API for gradient boosting?

2015-01-05 Thread Xiangrui Meng

I created a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-5094. Hopefully someone
would work on it and make it available in the 1.3 release. -Xiangrui

On Sun, Jan 4, 2015 at 6:58 PM, Christopher Thom
christopher.t...@quantium.com.au wrote:
 Hi,



 I wonder if anyone knows when a python API will be added for Gradient
 Boosted Trees? I see that java and scala APIs were added for the 1.2
 release, and would love to be able to build GBMs in pyspark too.



 cheers

 chris



 Christopher Thom
 QUANTIUM
 Level 25, 8 Chifley, 8-12 Chifley Square
 Sydney NSW 2000

 T: +61 2 8222 3577
 F: +61 2 9292 6444

 W: quantium.com.au

 

 linkedin.com/company/quantium

 facebook.com/QuantiumAustralia

 twitter.com/QuantiumAU

 The contents of this email, including attachments, may be confidential
 information. If you are not the intended recipient, any use, disclosure or
 copying of the information is unauthorised. If you have
 received this email in error, we would be grateful if you would notify us
 immediately by email reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292
 6444) and delete the message from your system.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Xiangrui Meng

How big is your dataset, and what is the vocabulary size? -Xiangrui

On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen zhpeng...@gmail.com wrote:
 Hi,

 When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
 usage. Here is the jstack output:

 main prio=10 tid=0x40112800 nid=0x46f2 runnable
 [0x4162e000]
java.lang.Thread.State: RUNNABLE
 at
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
 at
 java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
 at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
 at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
 at
 java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
 at
 java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
 at
 java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
 at
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
 at com.baidu.inf.WordCount$.main(WordCount.scala:31)
 at com.baidu.inf.WordCount.main(WordCount.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 --
 Best Regards

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: OptionalDataException during Naive Bayes Training

2015-01-09 Thread Xiangrui Meng

How big is your data? Did you see other error messages from executors?
It seems to me like a shuffle communication error. This thread may be
relevant:

http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccalrnvjuvtgae_ag1rqey_cod1nmrlfpesxgsb7g8r21h0bm...@mail.gmail.com%3E


-Xiangrui

On Fri, Jan 9, 2015 at 3:19 AM, jatinpreet jatinpr...@gmail.com wrote:
 Hi,

 I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes,
 during Naive Baye's training, I get OptionalDataException at line,

 map at NaiveBayes.scala:109

 I am getting following exception on the console,

 java.io.OptionalDataException:
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1371)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
 java.util.HashMap.readObject(HashMap.java:1394)
 sun.reflect.GeneratedMethodAccessor626.invoke(Unknown Source)

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 java.lang.reflect.Method.invoke(Method.java:483)

 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)

 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)

 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

 What could be the reason behind this?

 Thanks



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/OptionalDataException-during-Naive-Bayes-Training-tp21059.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-09 Thread Xiangrui Meng

Do you have more than 2 billion users/products? If not, you can pair
each user/product id with an integer (check RDD.zipWithUniqueId), use
them in ALS, and then join the original bigInt IDs back after
training. -Xiangrui

On Fri, Jan 9, 2015 at 5:12 PM, nishanthps nishant...@gmail.com wrote:
 Hi,

 The userId's and productId's in my data are bigInts, what is the best way to
 run collaborative filtering on this data. Should I modify MLlib's
 implementation to support more types? or is there an easy way.

 Thanks!,
 Nishanth



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-BigInteger-for-userId-and-productId-in-collaborative-Filtering-tp21072.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Discrepancy in PCA values

2015-01-09 Thread Xiangrui Meng

You need to subtract mean values to obtain the covariance matrix
(http://en.wikipedia.org/wiki/Covariance_matrix).

On Fri, Jan 9, 2015 at 6:41 PM, Upul Bandara upulband...@gmail.com wrote:
 Hi Xiangrui,

 Thanks for the reply.

 Julia code is also using the covariance matrix:
 (1/n)*X'*X ;

 Thanks,
 Upul

 On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng men...@gmail.com wrote:

 The Julia code is computing the SVD of the Gram matrix. PCA should be
 applied to the covariance matrix. -Xiangrui

 On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com
 wrote:
  Hi All,
 
  I tried to do PCA for the Iris dataset
  [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib
 
  [http://spark.apache.org/docs/1.1.1/mllib-dimensionality-reduction.html].
  Also, PCA  was calculated in Julia using following method:
 
  Sigma = (1/numRow(X))*X'*X ;
  [U, S, V] = svd(Sigma);
  Ureduced = U(:, 1:k);
  Z = X*Ureduced;
 
  However, I'm seeing a little difference between values given by MLLib
  and
  the method shown above .
 
  Does anyone have any idea about this difference?
 
  Additionally, I have attached two visualizations, related to two
  approaches.
 
  Thanks,
  Upul
 
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Xiangrui Meng

colStats() computes the mean values along with several other summary
statistics, which makes it slower. How is the performance if you don't
use kryo? -Xiangrui

On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar rokros...@gmail.com wrote:
 thanks for the suggestion -- however, looks like this is even slower. With
 the small data set I'm using, my aggregate function takes ~ 9 seconds and
 the colStats.mean() takes ~ 1 minute. However, I can't get it to run with
 the Kyro serializer -- I get the error:

 com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
 required: 8

 is there an easy/obvious fix?


 On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng men...@gmail.com wrote:

 There is some serialization overhead. You can try

 https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
 . -Xiangrui

 On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote:
  I have an RDD of SparseVectors and I'd like to calculate the means
  returning
  a dense vector. I've tried doing this with the following (using pyspark,
  spark v1.2.0):
 
  def aggregate_partition_values(vec1, vec2) :
  vec1[vec2.indices] += vec2.values
  return vec1
 
  def aggregate_combined_vectors(vec1, vec2) :
  if all(vec1 == vec2) :
  # then the vector came from only one partition
  return vec1
  else:
  return vec1 + vec2
 
  means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
  aggregate_combined_vectors)
  means = means / nvals
 
  This turns out to be really slow -- and doesn't seem to depend on how
  many
  vectors there are so there seems to be some overhead somewhere that I'm
  not
  understanding. Is there a better way of doing this?
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Zipping RDDs of equal size not possible

2015-01-09 Thread Xiangrui Meng

sample 2 * n tuples, split them into two parts, balance the sizes of
these parts by filtering some tuples out

How do you guarantee that the two RDDs have the same size?

-Xiangrui

On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke
1wil...@informatik.uni-hamburg.de wrote:
 Hi Spark community,

 I have a problem with zipping two RDDs of the same size and same number of
 partitions.
 The error message says that zipping is only allowed on RDDs which are
 partitioned into chunks of exactly the same sizes.
 How can I assure this? My workaround at the moment is to repartition both
 RDDs to only one partition but that obviously
 does not scale.

 This problem originates from my problem to draw n random tuple pairs (Tuple,
 Tuple) from an RDD[Tuple].
 What I do is to sample 2 * n tuples, split them into two parts, balance the
 sizes of these parts
 by filtering some tuples out and zipping them together.

 I would appreciate to read better approaches for both problems.

 Thanks in advance,
 Niklas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Discrepancy in PCA values

2015-01-12 Thread Xiangrui Meng

Could you compare V directly and tell us more about the difference you
saw? The column of V should be the same subject to signs. For example,
the first column of V could be either [0.8, -0.6, 0.0] or [-0.8, 0.6,
0.0]. -Xiangrui

On Sat, Jan 10, 2015 at 8:08 PM, Upul Bandara upulband...@gmail.com wrote:
 Hi Xiangrui,

 Thanks a lot for you answer.
 So I fixed my Julia code, also calculated PCA using R as well.

 R Code:
 -
 data - read.csv('/home/upul/Desktop/iris.csv');
 X - data[,1:4]
 pca - prcomp(X, center = TRUE, scale=FALSE)
 transformed - predict(pca, newdata = X)

 Julia Code (Fixed)
 --
 data = readcsv(/home/upul/temp/iris.csv);
 X = data[:,1:end-1];
 meanX = mean(X,1);
 m,n = size(X);
 X = X - repmat(x, m,1);
 u,s,v = svd(X);
 transformed =  X*v;

 Now PCA calculated using Julia and R is identical, but still I can see a
 small
 difference between PCA  values given by Spark and other two.

 Thanks,
 Upul

 On Sat, Jan 10, 2015 at 11:17 AM, Xiangrui Meng men...@gmail.com wrote:

 You need to subtract mean values to obtain the covariance matrix
 (http://en.wikipedia.org/wiki/Covariance_matrix).

 On Fri, Jan 9, 2015 at 6:41 PM, Upul Bandara upulband...@gmail.com
 wrote:
  Hi Xiangrui,
 
  Thanks for the reply.
 
  Julia code is also using the covariance matrix:
  (1/n)*X'*X ;
 
  Thanks,
  Upul
 
  On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng men...@gmail.com wrote:
 
  The Julia code is computing the SVD of the Gram matrix. PCA should be
  applied to the covariance matrix. -Xiangrui
 
  On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com
  wrote:
   Hi All,
  
   I tried to do PCA for the Iris dataset
   [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib
  
  
   [http://spark.apache.org/docs/1.1.1/mllib-dimensionality-reduction.html].
   Also, PCA  was calculated in Julia using following method:
  
   Sigma = (1/numRow(X))*X'*X ;
   [U, S, V] = svd(Sigma);
   Ureduced = U(:, 1:k);
   Z = X*Ureduced;
  
   However, I'm seeing a little difference between values given by MLLib
   and
   the method shown above .
  
   Does anyone have any idea about this difference?
  
   Additionally, I have attached two visualizations, related to two
   approaches.
  
   Thanks,
   Upul
  
  
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Xiangrui Meng

No, colStats() computes all summary statistics in one pass and store
the values. It is not lazy.

On Mon, Jan 12, 2015 at 4:42 AM, Rok Roskar rokros...@gmail.com wrote:
 This was without using Kryo -- if I use kryo, I got errors about buffer
 overflows (see above):

 com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
 required: 8

 Just calling colStats doesn't actually compute those statistics, does it? It
 looks like the computation is only carried out once you call the .mean()
 method.



 On Sat, Jan 10, 2015 at 7:04 AM, Xiangrui Meng men...@gmail.com wrote:

 colStats() computes the mean values along with several other summary
 statistics, which makes it slower. How is the performance if you don't
 use kryo? -Xiangrui

 On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar rokros...@gmail.com wrote:
  thanks for the suggestion -- however, looks like this is even slower.
  With
  the small data set I'm using, my aggregate function takes ~ 9 seconds
  and
  the colStats.mean() takes ~ 1 minute. However, I can't get it to run
  with
  the Kyro serializer -- I get the error:
 
  com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
  required: 8
 
  is there an easy/obvious fix?
 
 
  On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng men...@gmail.com wrote:
 
  There is some serialization overhead. You can try
 
 
  https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
  . -Xiangrui
 
  On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote:
   I have an RDD of SparseVectors and I'd like to calculate the means
   returning
   a dense vector. I've tried doing this with the following (using
   pyspark,
   spark v1.2.0):
  
   def aggregate_partition_values(vec1, vec2) :
   vec1[vec2.indices] += vec2.values
   return vec1
  
   def aggregate_combined_vectors(vec1, vec2) :
   if all(vec1 == vec2) :
   # then the vector came from only one partition
   return vec1
   else:
   return vec1 + vec2
  
   means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
   aggregate_combined_vectors)
   means = means / nvals
  
   This turns out to be really slow -- and doesn't seem to depend on how
   many
   vectors there are so there seems to be some overhead somewhere that
   I'm
   not
   understanding. Is there a better way of doing this?
  
  
  
   --
   View this message in context:
  
   http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html
   Sent from the Apache Spark User List mailing list archive at
   Nabble.com.
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: including the spark-mllib in build.sbt

2015-01-12 Thread Xiangrui Meng

I don't know the root cause. Could you try including only

libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1

It should be sufficient because mllib depends on core.

-Xiangrui

On Mon, Jan 12, 2015 at 2:27 PM, Jianguo Li flyingfromch...@gmail.com wrote:
 Hi,

 I am trying to build my own scala project using sbt. The project is
 dependent on both spark-score and spark-mllib. I included the following two
 dependencies in my build.sbt file

 libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1
 libraryDependencies += org.apache.spark %% spark-core % 1.1.1

 However, when I run the package command in sbt, I got an error message
 indicating that object mllib is not a member of package org.apache.spark.

 Did I do anything wrong?

 Thanks,

 Jianguo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-09 Thread Xiangrui Meng

This is worker log, not executor log. The executor log can be found in
folders like /newdisk2/rta/rtauser/workerdir/app-20150109182514-0001/0/
. -Xiangrui

On Fri, Jan 9, 2015 at 5:03 AM, Priya Ch learnings.chitt...@gmail.com wrote:
 Please find the attached worker log.
  I could see stream closed exception

 On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote:

 Could you attach the executor log? That may help identify the root
 cause. -Xiangrui

 On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com
 wrote:
  Hi All,
 
  Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
  local mode and not on distributed mode. Null pointer exception has been
  thrown. Is this a bug in spark-1.1.0 ?
 
  Following is the code:
def main(args:Array[String])
{
   val conf=new SparkConf
   val sc=new SparkContext(conf)
   val
 
  documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split(
  ).toSeq)
   val hashingTF = new HashingTF()
   val tf= hashingTF.transform(documents)
   tf.cache()
  val idf = new IDF().fit(tf)
  val tfidf = idf.transform(tf)
   val rdd=tfidf.map { vec = println(vector is+vec)
  (10)
 }
   rdd.saveAsTextFile(/home/padma/usecase)
 
}
 
 
 
 
  Exception thrown:
 
  15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
  with
  2 tasks
  15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered
  executor:
 
  Actor[akka.tcp://sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167]
  with ID 0
  15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in
  stage
  0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
  15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in
  stage
  0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
  15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering
  block
  manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM
  15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection
  from
  [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888]
  15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection
  to
  [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130]
  15/01/06 12:36:12 INFO network.SendingConnection: Connected to
  [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending
  15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added
  broadcast_1_piece0 in
  memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1
  GB)
  15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added
  broadcast_0_piece0 in
  memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1
  GB)
  15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory
  on
  IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB)
  15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory
  on
  IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB)
  15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage
  0.0
  (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException:
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
  org.apache.spark.scheduler.Task.run(Task.scala:54)
 
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 
  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 
 
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  java.lang.Thread.run(Thread.java:722)
 
 
  Thanks,
  Padma Ch



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLIB and Openblas library in non-default dir

2015-01-06 Thread Xiangrui Meng

spark-submit may not share the same JVM with Spark master and executors.

On Tue, Jan 6, 2015 at 11:40 AM, Tomas Hudik xhu...@gmail.com wrote:
 thanks Xiangrui

 I'll try it.

 BTW: spark-submit is a standalone program (bin/spark-submit). Therefore, JVM
 has to be executed after spark-submit script
 Am I correct?



 On Mon, Jan 5, 2015 at 10:35 PM, Xiangrui Meng men...@gmail.com wrote:

 It might be hard to do that with spark-submit, because the executor
 JVMs may be already up and running before a user runs spark-submit.
 You can try to use `System.setProperty` to change the property at
 runtime, though it doesn't seem to be a good solution. -Xiangrui

 On Fri, Jan 2, 2015 at 6:28 AM, xhudik xhu...@gmail.com wrote:
  Hi
  I have compiled OpenBlas library into nonstandard directory and I want
  to
  inform Spark app about it via:
 
  -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so
  which is a standard option in netlib-java
  (https://github.com/fommil/netlib-java)
 
  I tried 2 ways:
  1. via *--conf* parameter
  /bin/spark-submit -v  --class
  org.apache.spark.examples.mllib.LinearRegression *--conf
 
  -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so*
  examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar
  data/mllib/sample_libsvm_data.txt/
 
  2. via *--driver-java-options* parameter
  /bin/spark-submit -v *--driver-java-options
 
  -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so*
  --class org.apache.spark.examples.mllib.LinearRegression
  examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar
  data/mllib/sample_libsvm_data.txt
  /
 
  How can I force spark-submit to propagate info about non-standard
  placement
  of openblas library to netlib-java lib?
 
  thanks, Tomas
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-and-Openblas-library-in-non-default-dir-tp20943.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [MLLib] storageLevel in ALS

2015-01-06 Thread Xiangrui Meng

Which Spark version are you using? We made this configurable in 1.1:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L202

-Xiangrui

On Tue, Jan 6, 2015 at 12:57 PM, Fernando O. fot...@gmail.com wrote:

 Hi,
I was doing a tests with ALS and I noticed that if I persist the inner
 RDDs  from a MatrixFactorizationModel the RDD is not replicated, it seems
 like the storagelevel is hardcoded to MEMORY_AND_DISK, do you think it
 makes sense to make that configurable?
 [image: Inline image 1]

Re: confidence/probability for prediction in MLlib

2015-01-06 Thread Xiangrui Meng

This is addressed in https://issues.apache.org/jira/browse/SPARK-4789.
In the new pipeline API, we can simply output two columns, one for the
best predicted class, and the other for probabilities or confidence
scores for each class. -Xiangrui

On Tue, Jan 6, 2015 at 11:43 AM, Jianguo Li flyingfromch...@gmail.com wrote:
 Hi,

 A while ago, somebody asked about getting a confidence value of a prediction
 with MLlib's implementation of Naive Bayes's classification.

 I was wondering if there is any plan in the near future for the predict
 function to return both a label and a confidence/probability? Or could the
 private variables in the various machine learning models be exposed so we
 could write our own functions which return both?

 Having a confidence/probability could be very useful in real application.
 For one thing, you can choose to trust the predicted label only if it has a
 high confidence level. Also, if you want to combine the results from
 multiple classifiers, the confidence/probability could be used as some kind
 of weight for combining.

 Thanks,

 Jianguo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-06 Thread Xiangrui Meng

Could you attach the executor log? That may help identify the root
cause. -Xiangrui

On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote:
 Hi All,

 Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
 local mode and not on distributed mode. Null pointer exception has been
 thrown. Is this a bug in spark-1.1.0 ?

 Following is the code:
   def main(args:Array[String])
   {
  val conf=new SparkConf
  val sc=new SparkContext(conf)
  val
 documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split(
 ).toSeq)
  val hashingTF = new HashingTF()
  val tf= hashingTF.transform(documents)
  tf.cache()
 val idf = new IDF().fit(tf)
 val tfidf = idf.transform(tf)
  val rdd=tfidf.map { vec = println(vector is+vec)
 (10)
}
  rdd.saveAsTextFile(/home/padma/usecase)

   }




 Exception thrown:

 15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with
 2 tasks
 15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered
 executor:
 Actor[akka.tcp://sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167]
 with ID 0
 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in stage
 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
 15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block
 manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM
 15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection from
 [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888]
 15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection to
 [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130]
 15/01/06 12:36:12 INFO network.SendingConnection: Connected to
 [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending
 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in
 memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1 GB)
 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in
 memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1 GB)
 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory on
 IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB)
 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory on
 IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB)
 15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0
 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException:
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 java.lang.Thread.run(Thread.java:722)


 Thanks,
 Padma Ch

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib: feature standardization

2015-02-09 Thread Xiangrui Meng

`mean()` and `variance()` are not defined in `Vector`. You can use the
mean and variance implementation from commons-math3
(http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html)
if you don't want to implement them. -Xiangrui

On Fri, Feb 6, 2015 at 12:50 PM, SK skrishna...@gmail.com wrote:
 Hi,

 I have a dataset in csv format and I am trying to standardize the features
 before using k-means clustering. The data does not have any labels but has
 the following format:

 s1, f12,f13,...
 s2, f21,f22,...

 where s is a string id, and f is a floating point feature value.
 To perform feature standardization, I need to compute the mean and
 variance/std deviation of the features values in each element of the RDD
 (i.e each row). However, the summary Statistics library in Spark MLLib
 provides only a colStats() method that provides column-wise mean and
 variance. I tried to compute the mean and variance per row, using the code
 below but got a compilation error that there is no mean() or variance()
 method for a tuple or Vector object. Is there a Spark library to compute the
 row-wise mean and variance for an RDD, where each row (i.e. element) of the
 RDD is a Vector or tuple of N feature values?

 thanks

 My code for standardization is as follows:

 //read the data
 val data=sc.textFile(file_name)
   .map(_.split(,))

 // extract the features. For this example I am using only 2 features, but
 the data has more features
 val features = data.map(d= Vectors.dense(d(1).toDouble, d(2).toDouble))

 val std_features = features.map(f= {
val fmean = f.mean()   // Error:
 NO MEAN() for a Vector or Tuple object
val fstd=
 scala.math.sqrt(f.variance())// Error: NO variance() for a Vector or
 Tuple object
for (i - 0 to f.length) //
 standardize the features
{ var fs = 0.0
   if (fstd 0.0)
   fs = (f(i)  -
 fmean)/fstd
   fs
}
   }
   )




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-feature-standardization-tp21539.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Number of goals to win championship

2015-02-09 Thread Xiangrui Meng

Logistic regression outputs probabilities if the data fits the model
assumption. Otherwise, you might need to calibrate its output to
correctly read it. You may be interested in reading this:
http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression/.
We have isotonic regression implemented in Spark 1.3. Another problem
with your input is that the dataset is too small. Try to put more
points and see the result. Also, use LogisticRegressionWithLBFGS,
which is better than the SGD implementation. -Xiangrui

On Thu, Feb 5, 2015 at 10:40 AM, jvuillermet
jeremy.vuiller...@gmail.com wrote:
I want to find the minimum number of goals for a player that likely allows
its team to win the championship.

My data :
goals win/loose
25 1
5 0
10 1
20 0

After some reading and courses, I think I need a Logistic Regression model
to get those datas.
I create my LabeledPoint with those data (1/0 being the label) and use
val model = LogisticRegressionWithSGD.train

model.clearTreshold()
I then try some model.predict(Vectors.dense(10)) but don't understand the
output.

All the results are 0.5 and I'm not even sure how to use the predicted
value.
Am I using the good model ? How do I read the predicted value ?
What do I need more to find a goal number from which it's likely your team
will win the championship or say (3/4 chances to win it)

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Number-of-goals-to-win-championship-tp21519.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: no option to add intercepts for StreamingLinearAlgorithm

2015-02-09 Thread Xiangrui Meng

No particular reason. We didn't add it in the first version. Let's add
it in 1.4. -Xiangrui

On Thu, Feb 5, 2015 at 3:44 PM, jamborta jambo...@gmail.com wrote:
 hi all,

 just wondering if there is a reason why it is not possible to add intercepts
 for streaming regression models? I understand that run method in the
 underlying GeneralizedLinearModel does not take intercept as a parameter
 either. Any reason for that?

 thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/no-option-to-add-intercepts-for-StreamingLinearAlgorithm-tp21526.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [MLlib] Performance issues when building GBM models

2015-02-09 Thread Xiangrui Meng

Could you check the Spark UI and see whether there are RDDs being
kicked out during the computation? We cache the residual RDD after
each iteration. If we don't have enough memory/disk, it gets
recomputed and results something like `t(n) = t(n-1) + const`. We
might cache the features multiple times, which could be improved.
-Xiangrui

On Sun, Feb 8, 2015 at 5:32 PM, Christopher Thom
christopher.t...@quantium.com.au wrote:
 Hi All,

 I wonder if anyone else has some experience building a Gradient Boosted Trees 
 model using spark/mllib? I have noticed when building decent-size models that 
 the process slows down over time. We observe that the time to build tree n is 
 approximately a constant time longer than the time to build tree n-1 i.e. 
 t(n) = t(n-1) + const. The implication is that the total build time goes as 
 something like N^2, where N is the total number of trees. I would expect that 
 the algorithm should be approximately linear in total time (i.e. each 
 boosting iteration takes roughly the same time to complete).

 So I have a couple of questions:
 1. Is this behaviour expected, or consistent with what others are seeing?
 2. Does anyone know if there a tuning parameters (e.g. in the boosting 
 strategy, or tree stategy) that may be impacting this?

 All aspects of the build seem to slow down as I go. Here's a random example 
 culled from the logs, from the beginning and end of the model build:

 15/02/09 17:22:11 INFO scheduler.DAGScheduler: Job 42 finished: count at 
 DecisionTreeMetadata.scala:111, took 0.077957 s
 
 15/02/09 19:44:01 INFO scheduler.DAGScheduler: Job 7954 finished: count at 
 DecisionTreeMetadata.scala:111, took 5.495166 s

 Any thoughts or advice, or even suggestions on where to dig for more info 
 would be welcome.

 thanks
 chris

 Christopher Thom

 QUANTIUM
 Level 25, 8 Chifley, 8-12 Chifley Square
 Sydney NSW 2000

 T: +61 2 8222 3577
 F: +61 2 9292 6444

 W: quantium.com.auwww.quantium.com.au

 

 linkedin.com/company/quantiumwww.linkedin.com/company/quantium

 facebook.com/QuantiumAustraliawww.facebook.com/QuantiumAustralia

 twitter.com/QuantiumAUwww.twitter.com/QuantiumAU


 The contents of this email, including attachments, may be confidential 
 information. If you are not the intended recipient, any use, disclosure or 
 copying of the information is unauthorised. If you have received this email 
 in error, we would be grateful if you would notify us immediately by email 
 reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the 
 message from your system.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

< 1 2 3 4 5 >

201 - 300 of 464 matches

Mail list logo