ScaledML 2020 Spark Speakers and Promo

2019-12-01 Thread Reza Zadeh
Spark Users,

You are all welcome to join us at ScaledML 2020: http://scaledml.org

A very steep discount is available for this list, using this link
.

We'd love to see you there.

Best,
Reza


Re: [MLlib] DIMSUM row similarity?

2015-08-31 Thread Reza Zadeh
This is ongoing work tracked by SPARK-4823
 with a PR for it here:
PR6213  - unfortunately the PR
submitter didn't make it for Spark 1.5.

On Mon, Aug 31, 2015 at 4:17 AM, Maandy  wrote:

> I've been trying to implement cosine similarity using Spark and stumbled
> upon
> this article
>
> https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
> The only problem I have with it is that it seems that they assume that in
> my
> input file each *column* is a separate tweet, hence by computing column
> similarity we will find similar tweets. Hope I understood it correctly?
>
> Since it's a Twitter based method I guess there has to be some logic behind
> it but how would one go around preparing such a dataset? Personally after
> doing TF-IDF on a document I get an RDD of Vectors which I can then
> transform into a RowMatrix but this RowMatrix has each document as a
> separate row not column hence the aforementioned method won't really help
> me. I tried looking for some transpose methods but failed to find any. Am I
> missing something or just misusing the available methods?
>
> Mateusz
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-DIMSUM-row-similarity-tp24518.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Duplicate entries in output of mllib column similarities

2015-05-12 Thread Reza Zadeh
Great! Reza

On Tue, May 12, 2015 at 7:42 AM, Richard Bolkey rbol...@gmail.com wrote:

 Hi Reza,

 That was the fix we needed. After sorting, the transposed entries are gone!

 Thanks a bunch,
 rick

 On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh r...@databricks.com wrote:

 Hi Richard,
 One reason that could be happening is that the rows of your matrix are
 using SparseVectors, but the entries in your vectors aren't sorted by
 index. Is that the case? Sparse Vectors
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
 need sorted indices.
 Reza

 On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey rbol...@gmail.com wrote:

 Hi Reza,

 After a bit of digging, I had my previous issue a little bit wrong.
 We're not getting duplicate (i,j) entries, but we are getting transposed
 entries (i,j) and (j,i) with potentially different scores. We assumed the
 output would be a triangular matrix. Still, let me know if that's expected.
 A transposed entry occurs for about 5% of our output entries.

 scala matrix.entries.filter(x = (x.i,x.j) == (22769,539029)).collect()
 res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
 Array(MatrixEntry(22769,539029,0.00453050595770095))

 scala matrix.entries.filter(x = (x.i,x.j) == (539029,22769)).collect()
 res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
 Array(MatrixEntry(539029,22769,0.002265252978850475))

 I saved a subset of vectors to object files that replicates the issue .
 It's about 300mb. Should I try to whittle that down some more? What would
 be the best way to get that to you.

 Many thanks,
 Rick

 On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh r...@databricks.com wrote:

 This shouldn't be happening, do you have an example to reproduce it?

 On Thu, May 7, 2015 at 4:17 PM, rbolkey rbol...@gmail.com wrote:

 Hi,

 I have a question regarding one of the oddities we encountered while
 running
 mllib's column similarities operation. When we examine the output, we
 find
 duplicate matrix entries (the same i,j). Sometimes the entries have
 the same
 value/similarity score, but they're frequently different too.

 Is this a known issue? An artifact of the probabilistic nature of the
 output? Which output score should we trust (lower vs higher one when
 different)? We're using a threshold of 0.3, and running Spark 1.3.1 on
 a 10
 node cluster.

 Thanks
 Rick



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: Duplicate entries in output of mllib column similarities

2015-05-07 Thread Reza Zadeh
This shouldn't be happening, do you have an example to reproduce it?

On Thu, May 7, 2015 at 4:17 PM, rbolkey rbol...@gmail.com wrote:

 Hi,

 I have a question regarding one of the oddities we encountered while
 running
 mllib's column similarities operation. When we examine the output, we find
 duplicate matrix entries (the same i,j). Sometimes the entries have the
 same
 value/similarity score, but they're frequently different too.

 Is this a known issue? An artifact of the probabilistic nature of the
 output? Which output score should we trust (lower vs higher one when
 different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a 10
 node cluster.

 Thanks
 Rick



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Understanding Spark/MLlib failures

2015-04-23 Thread Reza Zadeh
Hi Andrew,

The .principalComponents feature of RowMatrix is currently constrained to
tall and skinny matrices. Your matrix is barely above the skinny
requirement (10k columns), though the number of rows is fine.

What are you looking to do with the principal components? If unnormalized
PCA is OK for your application, you can instead run RowMatrix.computeSVD,
and use the 'V' matrix, which can be used the same way as the principal
components. The computeSVD method can handle square matrices, so it should
be able to handle your matrix.

Reza

On Thu, Apr 23, 2015 at 4:11 PM, aleverentz andylevere...@fico.com wrote:

 [My apologies if this is a re-post.  I wasn't subscribed the first time I
 sent this message, and I'm hoping this second message will get through.]

 I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks.  In
 a
 fit of blind optimism, I decided to try running MLlib’s Principal
 Components
 Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000
 rows.

 The Spark job has been running for about 5 hours on a small cluster, and it
 has been stuck on a particular job (treeAggregate at RowMatrix.scala:119)
 for most of that time.  The treeAggregate job is now on retry 5, and
 after
 each failure it seems that the next retry uses a smaller number of tasks.
 (Initially, there were around 80 tasks; later it was down to 50, then 42;
 now it’s down to 16.)  The web UI shows the following error under failed
 stages:  org.apache.spark.shuffle.MetadataFetchFailedException: Missing
 an
 output location for shuffle 1.

 This raises a few questions:

 1. What does missing an output location for shuffle 1 mean?  I’m guessing
 this cryptic error message is indicative of some more fundamental problem
 (out of memory? out of disk space?), but I’m not sure how to diagnose it.

 2. Why do subsequent retries use fewer and fewer tasks?  Does this mean
 that
 the algorithm is actually making progress?  Or is the scheduler just
 performing some kind of repartitioning and starting over from scratch?
 (Also, If the algorithm is in fact making progress, should I expect it to
 finish eventually?  Or do repeated failures generally indicate that the
 cluster is too small to perform the given task?)

 3. Is it reasonable to expect that I could get PCA to run on this dataset
 using the same cluster simply by changing some configuration parameters?
 Or
 is a larger cluster with significantly more resources per node the only way
 around this problem?

 4. In general, are there any tips for diagnosing performance issues like
 the
 one above?  I've spent some time trying to get a few different algorithms
 to
 scale to larger and larger datasets, and whenever I run into a failure, I'd
 like to be able to identify the bottleneck that is preventing further
 scaling.  Any general advice for doing that kind of detective work would be
 much appreciated.

 Thanks,

 ~ Andrew






 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Benchmaking col vs row similarities

2015-04-10 Thread Reza Zadeh
You should pull in this PR: https://github.com/apache/spark/pull/5364
It should resolve that. It is in master.
Best,
Reza

On Fri, Apr 10, 2015 at 8:32 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi,

 I am benchmarking row vs col similarity flow on 60M x 10M matrices...

 Details are in this JIRA:

 https://issues.apache.org/jira/browse/SPARK-4823

 For testing I am using Netflix data since the structure is very similar:
 50k x 17K near dense similarities..

 Items are 17K and so I did not activate threshold in colSimilarities yet
 (it's at 1e-4)

 Running Spark on YARN with 20 nodes, 4 cores, 16 gb, shuffle threshold 0.6

 I keep getting these from col similarity code from 1.2 branch. Should I
 use Master ?

 15/04/10 11:08:36 WARN BlockManagerMasterActor: Removing BlockManager
 BlockManagerId(5, tblpmidn36adv-hdp.tdc.vzwcorp.com, 44410) with no
 recent heart beats: 50315ms exceeds 45000ms

 15/04/10 11:09:12 ERROR ContextCleaner: Error cleaning broadcast 1012

 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)

 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)

 at scala.concurrent.Await$.result(package.scala:107)

 at
 org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)

 at
 org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)

 at
 org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)

 at
 org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)

 at
 org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)

 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)

 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1468)

 at org.apache.spark.ContextCleaner.org
 $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133)

 at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)

 I knew how to increase the 45 ms to something higher as it is compute
 heavy job but in YARN, I am not sure how to set that config..

 But in any-case that's a warning and should not affect the job...

 Any idea how to improve the runtime other than increasing threshold to
 1e-2 ? I will do that next

 Was netflix dataset benchmarked for col based similarity flow before ?
 similarity output from this dataset becomes near dense and so it is
 interesting for stress testing...

 Thanks.

 Deb



Re: Using DIMSUM with ids

2015-04-06 Thread Reza Zadeh
Right now dimsum is meant to be used for tall and skinny matrices, and so
columnSimilarities() returns similar columns, not rows. We are working on
adding an efficient row similarity as well, tracked by this JIRA:
https://issues.apache.org/jira/browse/SPARK-4823
Reza

On Mon, Apr 6, 2015 at 6:08 AM, James alcaid1...@gmail.com wrote:

 The example below illustrates how to use the DIMSUM algorithm to calculate
 the similarity between each two rows and output row pairs with cosine
 simiarity that is not less than a threshold.


 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala


 But what if I hope to hold an Id of each row, which means the input file
 is:

 id1 vector1
 id2 vector2
 id3 vector3
 ...

 And we hope to output

 id1 id2 sim(id1, id2)
 id1 id3 sim(id1, id3)
 ...


 Alcaid



Re: Need a spark mllib tutorial

2015-04-02 Thread Reza Zadeh
Here's one:
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
Reza

On Thu, Apr 2, 2015 at 12:51 PM, Phani Yadavilli -X (pyadavil) 
pyada...@cisco.com wrote:

  Hi,



 I am new to the spark MLLib and I was browsing through the internet for
 good tutorials advanced to the spark documentation example. But, I do not
 find any. Need help.



 Regards

 Phani Kumar



Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Reza Zadeh
How many dimensions does your data have? The size of the k-means model is k
* d, where d is the dimension of the data.

Since you're using k=1000, if your data has dimension higher than say,
10,000, you will have trouble, because k*d doubles have to fit in the
driver.

Reza

On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen davidshe...@gmail.com wrote:

 I have put more detail of my problem at
 http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed

 It is really appreciate if you can help me take a look at this problem. I
 have tried various settings and ways to load/partition my data, but I just
 cannot get rid that long pause.


 Thanks,
 David





 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen

 On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen davidshe...@gmail.com wrote:

 Yes, I have done repartition.

 I tried to repartition to the number of cores in my cluster. Not
 helping...
 I tried to repartition to the number of centroids (k value). Not
 helping...


 On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com
 wrote:

 Can you try specifying the number of partitions when you load the data
 to equal the number of executors?  If your ETL changes the number of
 partitions, you can also repartition before calling KMeans.


 On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did
 repartition and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the Spark
 UI, all data are loaded onto one executor. I checked that executor, and its
 CPU workload is very low. I think it is using only 1 of the 8 cores. And
 all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all
 4 executors?


 Thanks,
 David






Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Reza Zadeh
Great!

On Tue, Mar 24, 2015 at 2:53 PM, roni roni.epi...@gmail.com wrote:

 Reza,
 That SVD.v matches the H2o and R prComp (non-centered)
 Thanks
 -R

 On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen so...@cloudera.com wrote:

 (Oh sorry, I've only been thinking of TallSkinnySVD)

 On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote:
  If you want to do a nonstandard (or uncentered) PCA, you can call
  computeSVD on RowMatrix, and look at the resulting 'V' Matrix.
 
  That should match the output of the other two systems.
 
  Reza
 
  On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote:
 
  Those implementations are computing an SVD of the input matrix
  directly, and while you generally need the columns to have mean 0, you
  can turn that off with the options you cite.
 
  I don't think this is possible in the MLlib implementation, since it
  is computing the principal components by computing eigenvectors of the
  covariance matrix. The means inherently don't matter either way in
  this computation.
 
  On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote:
   I am trying to compute PCA  using  computePrincipalComponents.
   I  also computed PCA using h2o in R and R's prcomp. The answers I get
   from
   H2o and R's prComp (non h2o) is same when I set the options for H2o
 as
   standardized=FALSE and for r's prcomp as center = false.
  
   How do I make sure that the settings for MLib PCA is same as I am
 using
   for
   H2o or prcomp.
  
   Thanks
   Roni
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 





Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Reza Zadeh
If you want to do a nonstandard (or uncentered) PCA, you can call
computeSVD on RowMatrix, and look at the resulting 'V' Matrix.

That should match the output of the other two systems.

Reza

On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote:

 Those implementations are computing an SVD of the input matrix
 directly, and while you generally need the columns to have mean 0, you
 can turn that off with the options you cite.

 I don't think this is possible in the MLlib implementation, since it
 is computing the principal components by computing eigenvectors of the
 covariance matrix. The means inherently don't matter either way in
 this computation.

 On Tue, Mar 24, 2015 at 6:13 AM, roni roni.epi...@gmail.com wrote:
  I am trying to compute PCA  using  computePrincipalComponents.
  I  also computed PCA using h2o in R and R's prcomp. The answers I get
 from
  H2o and R's prComp (non h2o) is same when I set the options for H2o as
  standardized=FALSE and for r's prcomp as center = false.
 
  How do I make sure that the settings for MLib PCA is same as I am using
 for
  H2o or prcomp.
 
  Thanks
  Roni

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to do nested foreach with RDD

2015-03-22 Thread Reza Zadeh
You can do this with the 'cartesian' product method on RDD. For example:

val rdd1 = ...
val rdd2 = ...

val combinations = rdd1.cartesian(rdd2).filter{ case (a,b) = a  b }

Reza

On Sat, Mar 21, 2015 at 10:37 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have two big RDD, and I need to do some math against each pair of them.
 Traditionally, it is like a nested for-loop. But for RDD, it cause a nested
 RDD which is prohibited.

 Currently, I am collecting one of them, then do a nested for-loop, so to
 avoid nested RDD. But would like to know if there's spark-way to do this.


 Thanks,
 David




Re: Column Similarity using DIMSUM

2015-03-19 Thread Reza Zadeh
Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn.
When a single row is dense, that can end up overwhelming a machine. You can
push that up with more RAM, but note that DIMSUM is meant for tall and
skinny matrices: so it scales linearly and across cluster with rows, but
still quadratically with the number of columns. I will be updating the
documentation to make this clear.
Best,
Reza

On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 mgupt...@sapient.com
wrote:

  Hi Reza,



 *Behavior*:

 · I tried running the job with different thresholds - 0.1, 0.5,
 5, 20  100.  Every time, the job got stuck at mapPartitionsWithIndex at
 RowMatrix.scala:522
 http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118attempt=0
  with
 all workers running on 100% CPU. There is hardly any shuffle read/write
 happening. And after some time, “ERROR YarnClientClusterScheduler: Lost
 executor” start showing (maybe because of the nodes running on 100% CPU).

 · For threshold 200+ (tried up to 1000) it gave an error (here
  was different for different thresholds)

 Exception in thread main java.lang.IllegalArgumentException: requirement
 failed: Oversampling should be greater than 1: 0.

 at scala.Predef$.require(Predef.scala:233)

 at
 org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)

 at
 org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)

 at
 EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)

 at EntitySimilarity$.main(EntitySimilarity.scala:80)

 at EntitySimilarity.main(EntitySimilarity.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
 Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:606)

 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)

 at
 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)

 at
 org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 · If I get rid of frequently occurring attributes and keep only
 those attributes which are occurring in at 2% entities, then job doesn’t
 stuck / fail.



 *Data  environment*:

 · RowMatrix of size 43345 X 56431

 · In the matrix there are couple of rows, whose value is same in
 up to 50% of the columns (frequently occurring attributes).

 · I am running this, on one of our Dev cluster running on CDH
 5.3.0 5 data nodes (each 4-core and 16GB RAM).



 My question – Do you think this is a hardware size issue and we should
 test it on larger machines?



 Regards,

 Manish



 *From:* Manish Gupta 8 [mailto:mgupt...@sapient.com]
 *Sent:* Wednesday, March 18, 2015 11:20 PM
 *To:* Reza Zadeh
 *Cc:* user@spark.apache.org
 *Subject:* RE: Column Similarity using DIMSUM



 Hi Reza,



 I have tried threshold to be only in the range of 0 to 1. I was not aware
 that threshold can be set to above 1.

 Will try and update.



 Thank You



 - Manish



 *From:* Reza Zadeh [mailto:r...@databricks.com r...@databricks.com]
 *Sent:* Wednesday, March 18, 2015 10:55 PM
 *To:* Manish Gupta 8
 *Cc:* user@spark.apache.org
 *Subject:* Re: Column Similarity using DIMSUM



 Hi Manish,

 Did you try calling columnSimilarities(threshold) with different threshold
 values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.

 Best,

 Reza



 On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mgupt...@sapient.com
 wrote:

   Hi,



 I am running Column Similarity (All Pairs Similarity using DIMSUM) in
 Spark on a dataset that looks like (Entity, Attribute, Value) after
 transforming the same to a row-oriented dense matrix format (one line per
 Attribute, one column per Entity, each cell with normalized value – between
 0 and 1).



 It runs extremely fast in computing similarities between Entities in most
 of the case, but if there is even a single attribute which is frequently
 occurring across the entities (say in 30% of entities), job falls apart.
 Whole job get stuck and worker nodes start running on 100% CPU without
 making any progress on the job stage. If the dataset is very small (in the
 range of 1000 Entities X 500 attributes (some frequently occurring)) the
 job finishes but takes too long (some time it gives GC errors too).



 If none of the attribute is frequently occurring (all  2%), then job runs
 in a lightning fast manner (even for 100 Entities X 1 attributes)
 and results are very accurate.



 I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores
 and 16GB

Re: Column Similarity using DIMSUM

2015-03-18 Thread Reza Zadeh
Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mgupt...@sapient.com
wrote:

   Hi,



 I am running Column Similarity (All Pairs Similarity using DIMSUM) in
 Spark on a dataset that looks like (Entity, Attribute, Value) after
 transforming the same to a row-oriented dense matrix format (one line per
 Attribute, one column per Entity, each cell with normalized value – between
 0 and 1).



 It runs extremely fast in computing similarities between Entities in most
 of the case, but if there is even a single attribute which is frequently
 occurring across the entities (say in 30% of entities), job falls apart.
 Whole job get stuck and worker nodes start running on 100% CPU without
 making any progress on the job stage. If the dataset is very small (in the
 range of 1000 Entities X 500 attributes (some frequently occurring)) the
 job finishes but takes too long (some time it gives GC errors too).



 If none of the attribute is frequently occurring (all  2%), then job runs
 in a lightning fast manner (even for 100 Entities X 1 attributes)
 and results are very accurate.



 I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores
 and 16GB of RAM.



 My question is - *Is this behavior expected for datasets where some
 Attributes frequently occur*?



 Thanks,

 Manish Gupta







Re: SVD transform of large matrix with MLlib

2015-03-11 Thread Reza Zadeh
Answers:
databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html
Reza

On Wed, Mar 11, 2015 at 2:33 PM, sergunok ser...@gmail.com wrote:

 Does somebody used SVD from MLlib for very large (like 10^6 x 10^7) sparse
 matrix?
 What time did it take?
 What implementation of SVD is used in MLLib?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SVD-transform-of-large-matrix-with-MLlib-tp22005.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Reza Zadeh
Hi Sab,
The current method is optimized for having many rows and few columns. In
your case it is exactly the opposite. We are working on your case, tracked
by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823
Your case is very common, so I will put some time into building it.

In the meantime, if you're looking for groups of similar points, consider
using K-means - it will get you clusters of similar rows with euclidean
distance.

Best,
Reza


On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:

 ​Hi Reza
 ​​
 I see that ((int, int), double) pairs are generated for any combination
 that meets the criteria controlled by the threshold. But assuming a simple
 1x10K matrix that means I would need atleast 12GB memory per executor for
 the flat map just for these pairs excluding any other overhead. Is that
 correct? How can we make this scale for even larger n (when m stays small)
 like 100 x 5 million.​ One is by using higher thresholds. The other is that
 I use a SparseVector to begin with. Are there any other optimizations I can
 take advantage of?

 ​Thanks
 Sab




Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sab,
In this dense case, the output will contain 1 x 1 entries, i.e. 100
million doubles, which doesn't fit in 1GB with overheads.
For a dense matrix, similarColumns() scales quadratically in the number of
columns, so you need more memory across the cluster.
Reza


On Sun, Mar 1, 2015 at 7:06 PM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:

 Sorry, I actually meant 30 x 1 matrix (missed a 0)


 Regards
 Sab




Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sabarish,

Works fine for me with less than those settings (30x1000 dense matrix, 1GB
driver, 1GB executor):

bin/spark-shell --driver-memory 1G --executor-memory 1G

Then running the following finished without trouble and in a few seconds.
Are you sure your driver is actually getting the RAM you think you gave it?

// Create 30x1000 matrix
val rows = sc.parallelize(1 to 30, 4).map { line =
  val values = Array.tabulate(1000)(x=scala.math.random)
  Vectors.dense(values)
}.cache()
val mat = new RowMatrix(rows)

// Compute similar columns perfectly, with brute force.
val exact = mat.columnSimilarities().entries.map(x = x.value).sum()



On Sun, Mar 1, 2015 at 3:31 PM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:

 I am trying to compute column similarities on a 30x1000 RowMatrix of
 DenseVectors. The size of the input RDD is 3.1MB and its all in one
 partition. I am running on a single node of 15G and giving the driver 1G
 and the executor 9G. This is on a single node hadoop. In the first attempt
 the BlockManager doesn't respond within the heart beat interval. In the
 second attempt I am seeing a GC overhead limit exceeded error. And it is
 almost always in the RowMatrix.columSimilaritiesDIMSUM -
 mapPartitionsWithIndex (line 570)

 java.lang.OutOfMemoryError: GC overhead limit exceeded
 at
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$19$$anonfun$apply$2.apply(RowMatrix.scala:570)
 at
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$19$$anonfun$apply$2.apply(RowMatrix.scala:528)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)


 It also really seems to be running out of memory. I am seeing the
 following in the attempt log
 Heap
  PSYoungGen  total 2752512K, used 2359296K
   eden space 2359296K, 100% used
   from space 393216K, 0% used
   to   space 393216K, 0% used
  ParOldGen   total 6291456K, used 6291376K [0x00058000,
 0x0007, 0x0007)
   object space 6291456K, 99% used
  Metaspace   used 39225K, capacity 39558K, committed 39904K, reserved
 1083392K
   class spaceused 5736K, capacity 5794K, committed 5888K, reserved
 1048576K​

 ​What could be going wrong?

 Regards
 Sab



Re: Is spark streaming +MlLib for online learning?

2015-02-18 Thread Reza Zadeh
This feature request is already being tracked:
https://issues.apache.org/jira/browse/SPARK-4981
Aiming for 1.4
Best,
Reza

On Wed, Feb 18, 2015 at 2:40 AM, mucaho muc...@yahoo.com wrote:

 Hi

 What is the general consensus/roadmap for implementing additional online /
 streamed trainable models?

 Apache Spark 1.2.1 currently supports streaming linear regression 
 clustering, although other streaming linear methods are planned according
 to
 the issue tracker.
 However, I can not find any details on the issue tracker about online
 training of a collaborative filter. Judging from  another mailing list
 discussion
 
 http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3ce07aa61e-eeb9-4ded-be3e-3f04003e4...@storefront.be%3E
 
 incremental training should be possible for ALS. Any plans for the future?

 Regards
 mucaho



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-spark-streaming-MlLib-for-online-learning-tp19701p21698.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: what is behind matrix multiplications?

2015-02-11 Thread Reza Zadeh
Yes, the local matrix is broadcast to each worker. Here is the code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L407

In 1.3 we will have Block matrix multiplication too, which will allow
distributed matrix multiplication. This is thanks to Burak Yavuz from
Stanford and now at Databricks.

On Wed, Feb 11, 2015 at 4:12 AM, Donbeo lucapug...@gmail.com wrote:

 In Spark it is possible to multiply a distribuited matrix  x and a local
 matrix w

 val x = new RowMatrix(distribuited_data)
 val w: Matrix = Matrices.dense(local_data)
 val result = x.multiply(w) .

 What is the process behind this command?  Is the matrix w replicated on
 each
 worker?  Is there a reference that I can use for this?

 Thanks  a lot!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/what-is-behind-matrix-multiplications-tp21599.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: foreachActive functionality

2015-01-25 Thread Reza Zadeh
The idea is to unify the code path for dense and sparse vector operations,
which makes the codebase easier to maintain. By handling (index, value)
tuples, you can let the foreachActive method take care of checking if the
vector is sparse or dense, and running a foreach over the values.

On Sun, Jan 25, 2015 at 8:18 AM, kundan kumar iitr.kun...@gmail.com wrote:

 Can someone help me to understand the usage of foreachActive  function
 introduced for the Vectors.

 I am trying to understand its usage in MultivariateOnlineSummarizer class
 for summary statistics.


 sample.foreachActive { (index, value) =
   if (value != 0.0) {
 if (currMax(index)  value) {
   currMax(index) = value
 }
 if (currMin(index)  value) {
   currMin(index) = value
 }

 val prevMean = currMean(index)
 val diff = value - prevMean
 currMean(index) = prevMean + diff / (nnz(index) + 1.0)
 currM2n(index) += (value - currMean(index)) * diff
 currM2(index) += value * value
 currL1(index) += math.abs(value)

 nnz(index) += 1.0
   }
 }

 Regards,
 Kundan





Re: Row similarities

2015-01-17 Thread Reza Zadeh
Pat, columnSimilarities is what that blog post is about, and is already
part of Spark 1.2.

rowSimilarities in a RowMatrix is a little more tricky because you can't
transpose a RowMatrix easily, and is being tracked by this JIRA:
https://issues.apache.org/jira/browse/SPARK-4823

Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for
example the number of rows in your RowMatrix is less than 1m, you can
transpose it and use rowSimilarities.


On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel p...@occamsmachete.com wrote:

 BTW it looks like row and column similarities (cosine based) are coming to
 MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the
 master yet. Does anyone know the status?

 See:
 https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

 Also the method for computation reduction (make it less than O(n^2)) seems
 rooted in cosine. A different computation reduction method is used in the
 Mahout code tied to LLR. Seems like we should get these together.

 On Jan 17, 2015, at 9:37 AM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 Excellent, thanks Pat.

 On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Mahout’s Spark implementation of rowsimilarity is in the Scala
 SimilarityAnalysis class. It actually does either row or column similarity
 but only supports LLR at present. It does [AA’] for columns or [A’A] for
 rows first then calculates the distance (LLR) for non-zero elements. This
 is a major optimization for sparse matrices. As I recall the old hadoop
 code only did this for half the matrix since it’s symmetric but that
 optimization isn’t in the current code because the downsampling is done as
 LLR is calculated, so the entire similarity matrix is never actually
 calculated unless you disable downsampling.

 The primary use is for recommenders but I’ve used it (in the test suite)
 for row-wise text token similarity too.

 On Jan 17, 2015, at 9:00 AM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259
 and poking around to see how to do things.

 https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259

 On Jan 17, 2015, at 8:35 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

 Andrew, u would be better off using Mahout's RowSimilarityJob for what u r
 trying to accomplish.

  1.  It does give u pair-wise distances
  2.  U can specify the Distance measure u r looking to use
  3.  There's the old MapReduce impl and the Spark DSL impl per ur
 preference.

   --
  *From:* Andrew Musselman andrew.mussel...@gmail.com
 *To:* Reza Zadeh r...@databricks.com
 *Cc:* user user@spark.apache.org
 *Sent:* Saturday, January 17, 2015 11:29 AM
 *Subject:* Re: Row similarities

 Thanks Reza, interesting approach.  I think what I actually want is to
 calculate pair-wise distance, on second thought.  Is there a pattern for
 that?



 On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote:

 You can use K-means
 https://spark.apache.org/docs/latest/mllib-clustering.html with a
 suitably large k. Each cluster should correspond to rows that are similar
 to one another.

 On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

 What's a good way to calculate similarities between all vector-rows in a
 matrix or RDD[Vector]?

 I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm
 going down a good path to transpose a matrix in order to run that.









Re: Row similarities

2015-01-17 Thread Reza Zadeh
We're focused on providing block matrices, which makes transposition
simple: https://issues.apache.org/jira/browse/SPARK-3434

On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel p...@occamsmachete.com wrote:

 In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a
 transpose—it’s optimized out. Mahout has had a stand alone row matrix
 transpose since day 1 and supports it in the Spark version. Can’t really do
 matrix algebra without it even though it’s often possible to optimize it
 away.

 Row similarity with LLR is much simpler than cosine since you only need
 non-zero sums for column, row, and matrix elements so rowSimilarity is
 implemented in Mahout for Spark. Full blown row similarity including all
 the different similarity methods (long since implemented in hadoop
 mapreduce) hasn’t been moved to spark yet.

 Yep, rows are not covered in the blog, my mistake. Too bad it has a lot of
 uses and can at very least be optimized for output matrix symmetry.

 On Jan 17, 2015, at 11:44 AM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 Yeah okay, thanks.

 On Jan 17, 2015, at 11:15 AM, Reza Zadeh r...@databricks.com wrote:

 Pat, columnSimilarities is what that blog post is about, and is already
 part of Spark 1.2.

 rowSimilarities in a RowMatrix is a little more tricky because you can't
 transpose a RowMatrix easily, and is being tracked by this JIRA:
 https://issues.apache.org/jira/browse/SPARK-4823

 Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for
 example the number of rows in your RowMatrix is less than 1m, you can
 transpose it and use rowSimilarities.


 On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel p...@occamsmachete.com
 wrote:

 BTW it looks like row and column similarities (cosine based) are coming
 to MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in
 the master yet. Does anyone know the status?

 See:
 https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

 Also the method for computation reduction (make it less than O(n^2))
 seems rooted in cosine. A different computation reduction method is used in
 the Mahout code tied to LLR. Seems like we should get these together.

 On Jan 17, 2015, at 9:37 AM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 Excellent, thanks Pat.

 On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Mahout’s Spark implementation of rowsimilarity is in the Scala
 SimilarityAnalysis class. It actually does either row or column similarity
 but only supports LLR at present. It does [AA’] for columns or [A’A] for
 rows first then calculates the distance (LLR) for non-zero elements. This
 is a major optimization for sparse matrices. As I recall the old hadoop
 code only did this for half the matrix since it’s symmetric but that
 optimization isn’t in the current code because the downsampling is done as
 LLR is calculated, so the entire similarity matrix is never actually
 calculated unless you disable downsampling.

 The primary use is for recommenders but I’ve used it (in the test suite)
 for row-wise text token similarity too.

 On Jan 17, 2015, at 9:00 AM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259
 and poking around to see how to do things.

 https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259

 On Jan 17, 2015, at 8:35 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

 Andrew, u would be better off using Mahout's RowSimilarityJob for what u
 r trying to accomplish.

  1.  It does give u pair-wise distances
  2.  U can specify the Distance measure u r looking to use
  3.  There's the old MapReduce impl and the Spark DSL impl per ur
 preference.

   --
  *From:* Andrew Musselman andrew.mussel...@gmail.com
 *To:* Reza Zadeh r...@databricks.com
 *Cc:* user user@spark.apache.org
 *Sent:* Saturday, January 17, 2015 11:29 AM
 *Subject:* Re: Row similarities

 Thanks Reza, interesting approach.  I think what I actually want is to
 calculate pair-wise distance, on second thought.  Is there a pattern for
 that?



 On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote:

 You can use K-means
 https://spark.apache.org/docs/latest/mllib-clustering.html with a
 suitably large k. Each cluster should correspond to rows that are similar
 to one another.

 On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

 What's a good way to calculate similarities between all vector-rows in a
 matrix or RDD[Vector]?

 I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm
 going down a good path to transpose a matrix in order to run that.











Re: Row similarities

2015-01-16 Thread Reza Zadeh
You can use K-means
https://spark.apache.org/docs/latest/mllib-clustering.html with a
suitably large k. Each cluster should correspond to rows that are similar
to one another.

On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 What's a good way to calculate similarities between all vector-rows in a
 matrix or RDD[Vector]?

 I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm
 going down a good path to transpose a matrix in order to run that.



Re: Broadcast joins on RDD

2015-01-12 Thread Reza Zadeh
First, you should collect().toMap() the small RDD, then you should use
broadcast followed by a map to do a map-side join
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
(slide
10 has an example).

Spark SQL also does it by default for tables that are smaller than the
spark.sql.autoBroadcastJoinThreshold setting (by default 10 KB, which is
really small, but you can bump this up with set
spark.sql.autoBroadcastJoinThreshold=100 for example).

On Mon, Jan 12, 2015 at 3:15 PM, Pala M Muthaia mchett...@rocketfuelinc.com
 wrote:

 Hi,


 How do i do broadcast/map join on RDDs? I have a large RDD that i want to
 inner join with a small RDD. Instead of having the large RDD repartitioned
 and shuffled for join, i would rather send a copy of a small RDD to each
 task, and then perform the join locally.

 How would i specify this in Spark code? I didn't find much documentation
 online. I attempted to create a broadcast variable out of the small RDD and
 then access that in the join operator:

 largeRdd.join(smallRddBroadCastVar.value)

 but that didn't work as expected ( I found that all rows with same key
 were on same task)

 I am using Spark version 1.0.1


 Thanks,
 pala





Re: [mllib] GradientDescent requires huge memory for storing weight vector

2015-01-12 Thread Reza Zadeh
I guess you're not using too many features (e.g.  10m), just that hashing
the index makes it look that way, is that correct?

If so, the simple dictionary that maps your feature index - rank can be
broadcast and used everywhere, so you can pass mllib just the feature's
rank as its index.

Reza

On Mon, Jan 12, 2015 at 4:26 PM, Tianshuo Deng td...@twitter.com.invalid
wrote:

 Hi,
 Currently in GradientDescent.scala, weights is constructed as a dense
 vector:

 initialWeights = Vectors.dense(new Array[Double](numFeatures))

 And the numFeatures is determined in the loadLibSVMFile as the max index
 of features.

 But in the case of using hash function to compute feature index, it
 results in a huge dense vector being generated taking lots of memory space.

 Any suggestions?




Re: RowMatrix.multiply() ?

2015-01-09 Thread Reza Zadeh
Yes that is the correct JIRA. It should make it to 1.3.
Best,
Reza

On Fri, Jan 9, 2015 at 11:13 AM, Adrian Mocanu amoc...@verticalscope.com
wrote:

  I’m resurrecting this thread because I’m interested in doing transpose
 on a RowMatrix.

 There is this other thread too:
 http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-td12562.html

 Which presents https://issues.apache.org/jira/browse/SPARK-3434 which is
 still in work at this time.

 Is this the correct Jira issue for the transpose operation? ETA?



 Thanks a lot!

 -A



 *From:* Reza Zadeh [mailto:r...@databricks.com]
 *Sent:* October-15-14 1:48 PM
 *To:* ll
 *Cc:* u...@spark.incubator.apache.org
 *Subject:* Re: RowMatrix.multiply() ?



 Hi,

 We are currently working on distributed matrix operations. Two RowMatrices
 cannot be currently multiplied together. Neither can be they be added. They
 functionality will be added soon.



 You can of course achieve this yourself by using IndexedRowMatrix and
 doing one join per operation you requested.



 Best,

 Reza



 On Wed, Oct 15, 2014 at 8:50 AM, ll duy.huynh@gmail.com wrote:

 hi.. it looks like RowMatrix.multiply() takes a local Matrix as a parameter
 and returns the result as a distributed RowMatrix.

 how do you perform this series of multiplications if A, B, C, and D are all
 RowMatrix?

 ((A x B) x C) x D)

 thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RowMatrix-multiply-tp16509.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-02 Thread Reza Zadeh
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4981

On Fri, Jan 2, 2015 at 8:28 PM, Peng Cheng rhw...@gmail.com wrote:

 I was under the impression that ALS wasn't designed for it :- The famous
 ebay online recommender uses SGD
 However, you can try using the previous model as starting point, and
 gradually reduce the number of iteration after the model stablize. I never
 verify this idea, so you need to at least cross-validate it before putting
 into productio

 On 2 January 2015 at 04:40, Wouter Samaey wouter.sam...@storefront.be
 wrote:

 Hi all,

 I'm curious about MLlib and if it is possible to do incremental training
 on
 the ALSModel.

 Usually training is run first, and then you can query. But in my case,
 data
 is collected in real-time and I want the predictions of my ALSModel to
 consider the latest data without complete re-training phase.

 I've checked out these resources, but could not find any info on how to
 solve this:
 https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

 http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html

 My question fits in a larger picture where I'm using Prediction IO, and
 this
 in turn is based on Spark.

 Thanks in advance for any advice!

 Wouter



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: how to do incremental model updates using spark streaming and mllib

2014-12-26 Thread Reza Zadeh
As of Spark 1.2 you can do Streaming k-means, see examples here:
http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1
Best,
Reza

On Fri, Dec 26, 2014 at 1:36 AM, vishnu johnfedrickena...@gmail.com wrote:

 Hi,

 Say I have created a clustering model using KMeans for 100million
 transactions at time t1. I am using streaming and say for every 1 hour i
 need to update my existing model. How do I do it. Should it include every
 time all the data or can it be incrementally updated.

 If I can do an incrementally updating , how do i do it.

 Thanks,
 Vishnu



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-do-incremental-model-updates-using-spark-streaming-and-mllib-tp20862.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Reza Zadeh
As Sean mentioned, you would be computing similar features then.

If you want to find similar users, I suggest running k-means with some
fixed number of clusters. It's not reasonable to try and compute all pairs
of similarities between 1bn items, so k-means with fixed k is more suitable
here.

Best,
Reza

On Wed, Dec 10, 2014 at 10:39 AM, Sean Owen so...@cloudera.com wrote:

 Well, you're computing similarity of your features then. Whether it is
 meaningful depends a bit on the nature of your features and more on
 the similarity algorithm.

 On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  Dear all,
 
  I'm trying to understand what is the correct use case of ColumnSimilarity
  implemented in RowMatrix.
 
  As far as I know, this function computes the similarity of a column of a
  given matrix. The DIMSUM paper says that it's efficient for large m
 (rows)
  and small n (columns). In this case the output will be a n by n matrix.
 
  Now, suppose I want to compute similarity of several users, say m =
  billions. Each users is described by a high dimensional feature vector,
 say
  n = 1. In my dataset, one row represent one user. So in that case
  computing the similarity my matrix is not the same as computing the
  similarity of all users. Then, what does it mean computing the
 similarity of
  the columns of my matrix in this case ?
 
  Best regards,
 
  Jao

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: sparse x sparse matrix multiplication

2014-11-07 Thread Reza Zadeh
If you're have very large and very sparse matrix represented as (i, j,
value) entries, then you can try the algorithms mentioned in the post
https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA brought
up earlier.

Reza

On Fri, Nov 7, 2014 at 8:31 AM, Duy Huynh duy.huynh@gmail.com wrote:

 thanks reza.  i'm not familiar with the block matrix multiplication, but
 is it a good fit for very large dimension, but extremely sparse matrix?

 if not, what is your recommendation on implementing matrix multiplication
 in spark on very large dimension, but extremely sparse matrix?




 On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh r...@databricks.com wrote:

 See this thread for examples of sparse matrix x sparse matrix:
 https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA

 We thought about providing matrix multiplies on CoordinateMatrix,
 however, the matrices have to be very dense for the overhead of having many
 little (i, j, value) objects to be worth it. For this reason, we are
 focused on doing block matrix multiplication first. The goal is version 1.3.

 Best,
 Reza

 On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan w...@us.ibm.com wrote:

 I think Xiangrui's ALS code implement certain aspect of it. You may want
 to check it out.
 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center


 [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40
 PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui
 Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse
 matrix multiplication and then define an RDD of sub-matri

 From: Xiangrui Meng men...@gmail.com
 To: Duy Huynh duy.huynh@gmail.com
 Cc: user u...@spark.incubator.apache.org
 Date: 11/05/2014 01:13 PM
 Subject: Re: sparse x sparse matrix multiplication
 --



 You can use breeze for local sparse-sparse matrix multiplication and
 then define an RDD of sub-matrices

 RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)

 and then use join and aggregateByKey to implement this feature, which
 is the same as in MapReduce.

 -Xiangrui

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: sparse x sparse matrix multiplication

2014-11-06 Thread Reza Zadeh
See this thread for examples of sparse matrix x sparse matrix:
https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA

We thought about providing matrix multiplies on CoordinateMatrix, however,
the matrices have to be very dense for the overhead of having many little
(i, j, value) objects to be worth it. For this reason, we are focused on
doing block matrix multiplication first. The goal is version 1.3.

Best,
Reza

On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan w...@us.ibm.com wrote:

 I think Xiangrui's ALS code implement certain aspect of it. You may want
 to check it out.
 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center


 [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40
 PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui
 Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse
 matrix multiplication and then define an RDD of sub-matri

 From: Xiangrui Meng men...@gmail.com
 To: Duy Huynh duy.huynh@gmail.com
 Cc: user u...@spark.incubator.apache.org
 Date: 11/05/2014 01:13 PM
 Subject: Re: sparse x sparse matrix multiplication
 --



 You can use breeze for local sparse-sparse matrix multiplication and
 then define an RDD of sub-matrices

 RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)

 and then use join and aggregateByKey to implement this feature, which
 is the same as in MapReduce.

 -Xiangrui

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Measuring execution time

2014-10-24 Thread Reza Zadeh
The Spark UI has timing information. When running locally, it is at
http://localhost:4040
Otherwise the url to the UI is printed out onto the console when you
startup spark shell or run a job.

Reza

On Fri, Oct 24, 2014 at 5:51 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I just wonder if there is any built-in function to get the execution time
 for each of the jobs/tasks ? in simple words, how can I find out how much
 time is spent on loading/mapping/filtering/reducing part of a job? I can
 see printout in the logs but since there is no clear presentation of the
 underlying DAG and associated tasks it is hard to find what I am looking
 for.

 best,
 /Shahab



Re: mllib CoordinateMatrix

2014-10-14 Thread Reza Zadeh
Hello,

CoordinateMatrix is in its infancy, and right now is only a placeholder.

To get/set the value at (i,j), you should map the entries rdd using the
usual rdd map operation, and change the relevant entries.

To get the values on a specific row, you can call toIndexedRowMatrix(),
which returns a RowMatrix
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
with indices.

Best,
Reza


On Tue, Oct 14, 2014 at 1:18 PM, ll duy.huynh@gmail.com wrote:

 after creating a coordinate matrix from my rdd[matrixentry]...

 1.  how can i get/query the value at coordiate (i, j)?

 2.  how can i set/update the value at coordiate (i, j)?

 3.  how can i get all the values on a specific row i, ideally as a vector?

 thanks!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/mllib-CoordinateMatrix-tp16412.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Huge matrix

2014-09-18 Thread Reza Zadeh
Hi Deb,

I am not templating RowMatrix/CoordinateMatrix since that would be a big
deviation from the PR. We can add jaccard and other similarity measures in
later PRs.

In the meantime, you can un-normalize the cosine similarities to get the
dot product, and then compute the other similarity measures from the dot
product.

Best,
Reza


On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi Reza,

 In similarColumns, it seems with cosine similarity I also need other
 numbers such as intersection, jaccard and other measures...

 Right now I modified the code to generate jaccard but I had to run it
 twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
 modify RowMatrix and CoordinateMatrix to be templated on the value...

 Are you considering this in your design ?

 Thanks.
 Deb


 On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh r...@databricks.com wrote:

 Better to do it in a PR of your own, it's not sufficiently related to
 dimsum

 On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Cool...can I add loadRowMatrix in your PR ?

 Thanks.
 Deb

 On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 Did you mean to message me instead of Xiangrui?

 For TS matrices, dimsum with positiveinfinity and computeGramian have
 the same cost, so you can do either one. For dense matrices with say, 1m
 columns this won't be computationally feasible and you'll want to start
 sampling with dimsum.

 It would be helpful to have a loadRowMatrix function, I would use it.

 Best,
 Reza

 On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das debasish.da...@gmail.com
  wrote:

 Hi Xiangrui,

 For tall skinny matrices, if I can pass a similarityMeasure to
 computeGrammian, I could re-use the SVD's computeGrammian for similarity
 computation as well...

 Do you recommend using this approach for tall skinny matrices or just
 use the dimsum's routines ?

 Right now RowMatrix does not have a loadRowMatrix function like the
 one available in LabeledPoint...should I add one ? I want to export the
 matrix out from my stable code and then test dimsum...

 Thanks.
 Deb



 On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh r...@databricks.com
 wrote:

 I will add dice, overlap, and jaccard similarity in a future PR,
 probably still for 1.2


 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Awesome...Let me try it out...

 Any plans of putting other similarity measures in future (jaccard is
 something that will be useful) ? I guess it makes sense to add some
 similarity measures in mllib...


 On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com
 wrote:

 Yes you're right, calling dimsum with gamma as PositiveInfinity
 turns it into the usual brute force algorithm for cosine similarity, 
 there
 is no sampling. This is by design.


 On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 I looked at the code: similarColumns(Double.posInf) is generating
 the brute force...

 Basically dimsum with gamma as PositiveInfinity will produce the
 exact same result as doing catesian products of RDD[(product, 
 vector)] and
 computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the
 weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com
 wrote:

 For 60M x 10K brute force and dimsum thresholding should be fine.

 For 60M x 10M probably brute force won't work depending on the
 cluster's power, and dimsum thresholding should work with appropriate
 threshold.

 Dimensionality reduction should help, and how effective it is
 will depend on your application and domain, it's worth trying if the 
 direct
 computation doesn't work.

 You can also try running KMeans clustering (perhaps after
 dimensionality reduction) if your goal is to find batches of similar 
 points
 instead of all pairs above a threshold.




 On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am
 considering running a matrix factorization to reduce the dimension 
 to say
 ~60M x 50 and then run all pair similarity...

 Did you also try similar ideas and saw positive results ?



 On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ok...just to make sure I have RowMatrix[SparseVector] where
 rows are ~ 60M and columns are 10M say with billion data points...

 I have another version that's around 60M and ~ 10K...

 I guess for the second one both all pair and dimsum will run
 fine...

 But for tall and wide, what do you suggest ? can dimsum handle
 it ?

 I might need jaccard as well...can I plug that in the PR ?



 On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh r...@databricks.com
  wrote:

 You might want to wait until Wednesday since the interface
 will be changing in that PR before Wednesday, probably over the 
 weekend, so
 that you don't have

Re: Huge matrix

2014-09-18 Thread Reza Zadeh
Hi Deb,
I am currently seeding the algorithm to be pseudo-random, this is an issue
being addressed in the PR. If you pull the current version it will be
deterministic, but not potentially not pseudo-random. The PR will updated
today.
Best,
Reza

On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi Reza,

 Have you tested if different runs of the algorithm produce different
 similarities (basically if the algorithm is deterministic) ?

 This number does not look like a Monoid aggregation...iVal * jVal /
 (math.min(sg, colMags(i)) * math.min(sg, colMags(j))

 I am noticing some weird behavior as different runs are changing the
 results...

 Also can columnMagnitudes produce non-deterministic results ?

 Thanks.

 Deb

 On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 I am not templating RowMatrix/CoordinateMatrix since that would be a big
 deviation from the PR. We can add jaccard and other similarity measures in
 later PRs.

 In the meantime, you can un-normalize the cosine similarities to get the
 dot product, and then compute the other similarity measures from the dot
 product.

 Best,
 Reza


 On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Reza,

 In similarColumns, it seems with cosine similarity I also need other
 numbers such as intersection, jaccard and other measures...

 Right now I modified the code to generate jaccard but I had to run it
 twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
 modify RowMatrix and CoordinateMatrix to be templated on the value...

 Are you considering this in your design ?

 Thanks.
 Deb


 On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh r...@databricks.com wrote:

 Better to do it in a PR of your own, it's not sufficiently related to
 dimsum

 On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Cool...can I add loadRowMatrix in your PR ?

 Thanks.
 Deb

 On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com
 wrote:

 Hi Deb,

 Did you mean to message me instead of Xiangrui?

 For TS matrices, dimsum with positiveinfinity and computeGramian have
 the same cost, so you can do either one. For dense matrices with say, 1m
 columns this won't be computationally feasible and you'll want to start
 sampling with dimsum.

 It would be helpful to have a loadRowMatrix function, I would use it.

 Best,
 Reza

 On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Hi Xiangrui,

 For tall skinny matrices, if I can pass a similarityMeasure to
 computeGrammian, I could re-use the SVD's computeGrammian for similarity
 computation as well...

 Do you recommend using this approach for tall skinny matrices or
 just use the dimsum's routines ?

 Right now RowMatrix does not have a loadRowMatrix function like the
 one available in LabeledPoint...should I add one ? I want to export the
 matrix out from my stable code and then test dimsum...

 Thanks.
 Deb



 On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh r...@databricks.com
 wrote:

 I will add dice, overlap, and jaccard similarity in a future PR,
 probably still for 1.2


 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Awesome...Let me try it out...

 Any plans of putting other similarity measures in future (jaccard
 is something that will be useful) ? I guess it makes sense to add some
 similarity measures in mllib...


 On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com
 wrote:

 Yes you're right, calling dimsum with gamma as PositiveInfinity
 turns it into the usual brute force algorithm for cosine similarity, 
 there
 is no sampling. This is by design.


 On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 I looked at the code: similarColumns(Double.posInf) is
 generating the brute force...

 Basically dimsum with gamma as PositiveInfinity will produce the
 exact same result as doing catesian products of RDD[(product, 
 vector)] and
 computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the
 weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com
 wrote:

 For 60M x 10K brute force and dimsum thresholding should be
 fine.

 For 60M x 10M probably brute force won't work depending on the
 cluster's power, and dimsum thresholding should work with 
 appropriate
 threshold.

 Dimensionality reduction should help, and how effective it is
 will depend on your application and domain, it's worth trying if 
 the direct
 computation doesn't work.

 You can also try running KMeans clustering (perhaps after
 dimensionality reduction) if your goal is to find batches of 
 similar points
 instead of all pairs above a threshold.




 On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am
 considering running a matrix factorization

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
Hi Deb,

Did you mean to message me instead of Xiangrui?

For TS matrices, dimsum with positiveinfinity and computeGramian have the
same cost, so you can do either one. For dense matrices with say, 1m
columns this won't be computationally feasible and you'll want to start
sampling with dimsum.

It would be helpful to have a loadRowMatrix function, I would use it.

Best,
Reza

On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi Xiangrui,

 For tall skinny matrices, if I can pass a similarityMeasure to
 computeGrammian, I could re-use the SVD's computeGrammian for similarity
 computation as well...

 Do you recommend using this approach for tall skinny matrices or just use
 the dimsum's routines ?

 Right now RowMatrix does not have a loadRowMatrix function like the one
 available in LabeledPoint...should I add one ? I want to export the matrix
 out from my stable code and then test dimsum...

 Thanks.
 Deb



 On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh r...@databricks.com wrote:

 I will add dice, overlap, and jaccard similarity in a future PR, probably
 still for 1.2


 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Awesome...Let me try it out...

 Any plans of putting other similarity measures in future (jaccard is
 something that will be useful) ? I guess it makes sense to add some
 similarity measures in mllib...


 On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com wrote:

 Yes you're right, calling dimsum with gamma as PositiveInfinity turns
 it into the usual brute force algorithm for cosine similarity, there is no
 sampling. This is by design.


 On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 I looked at the code: similarColumns(Double.posInf) is generating the
 brute force...

 Basically dimsum with gamma as PositiveInfinity will produce the exact
 same result as doing catesian products of RDD[(product, vector)] and
 computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com
 wrote:

 For 60M x 10K brute force and dimsum thresholding should be fine.

 For 60M x 10M probably brute force won't work depending on the
 cluster's power, and dimsum thresholding should work with appropriate
 threshold.

 Dimensionality reduction should help, and how effective it is will
 depend on your application and domain, it's worth trying if the direct
 computation doesn't work.

 You can also try running KMeans clustering (perhaps after
 dimensionality reduction) if your goal is to find batches of similar 
 points
 instead of all pairs above a threshold.




 On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am considering
 running a matrix factorization to reduce the dimension to say ~60M x 50 
 and
 then run all pair similarity...

 Did you also try similar ideas and saw positive results ?



 On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ok...just to make sure I have RowMatrix[SparseVector] where rows
 are ~ 60M and columns are 10M say with billion data points...

 I have another version that's around 60M and ~ 10K...

 I guess for the second one both all pair and dimsum will run fine...

 But for tall and wide, what do you suggest ? can dimsum handle it ?

 I might need jaccard as well...can I plug that in the PR ?



 On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh r...@databricks.com
 wrote:

 You might want to wait until Wednesday since the interface will be
 changing in that PR before Wednesday, probably over the weekend, so 
 that
 you don't have to redo your code. Your call if you need it before a 
 week.
 Reza


 On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ohh coolall-pairs brute force is also part of this PR ? Let
 me pull it in and test on our dataset...

 Thanks.
 Deb


 On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com
 wrote:

 Hi Deb,

 We are adding all-pairs and thresholded all-pairs via dimsum in
 this PR: https://github.com/apache/spark/pull/1778

 Your question wasn't entirely clear - does this answer it?

 Best,
 Reza


 On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Hi Reza,

 Have you compared with the brute force algorithm for similarity
 computation with something like the following in Spark ?

 https://github.com/echen/scaldingale

 I am adding cosine similarity computation but I do want to
 compute an all pair similarities...

 Note that the data is sparse for me (the data that goes to
 matrix factorization) so I don't think joining and group-by on
 (product,product) will be a big issue for me...

 Does it make sense to add all pair similarities as well with
 dimsum based similarity ?

 Thanks.
 Deb






 On Fri, Apr 11, 2014 at 9:21 PM, Reza

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
Better to do it in a PR of your own, it's not sufficiently related to dimsum

On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Cool...can I add loadRowMatrix in your PR ?

 Thanks.
 Deb

 On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 Did you mean to message me instead of Xiangrui?

 For TS matrices, dimsum with positiveinfinity and computeGramian have the
 same cost, so you can do either one. For dense matrices with say, 1m
 columns this won't be computationally feasible and you'll want to start
 sampling with dimsum.

 It would be helpful to have a loadRowMatrix function, I would use it.

 Best,
 Reza

 On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Xiangrui,

 For tall skinny matrices, if I can pass a similarityMeasure to
 computeGrammian, I could re-use the SVD's computeGrammian for similarity
 computation as well...

 Do you recommend using this approach for tall skinny matrices or just
 use the dimsum's routines ?

 Right now RowMatrix does not have a loadRowMatrix function like the one
 available in LabeledPoint...should I add one ? I want to export the matrix
 out from my stable code and then test dimsum...

 Thanks.
 Deb



 On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh r...@databricks.com wrote:

 I will add dice, overlap, and jaccard similarity in a future PR,
 probably still for 1.2


 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Awesome...Let me try it out...

 Any plans of putting other similarity measures in future (jaccard is
 something that will be useful) ? I guess it makes sense to add some
 similarity measures in mllib...


 On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com
 wrote:

 Yes you're right, calling dimsum with gamma as PositiveInfinity turns
 it into the usual brute force algorithm for cosine similarity, there is 
 no
 sampling. This is by design.


 On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 I looked at the code: similarColumns(Double.posInf) is generating
 the brute force...

 Basically dimsum with gamma as PositiveInfinity will produce the
 exact same result as doing catesian products of RDD[(product, vector)] 
 and
 computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com
 wrote:

 For 60M x 10K brute force and dimsum thresholding should be fine.

 For 60M x 10M probably brute force won't work depending on the
 cluster's power, and dimsum thresholding should work with appropriate
 threshold.

 Dimensionality reduction should help, and how effective it is will
 depend on your application and domain, it's worth trying if the direct
 computation doesn't work.

 You can also try running KMeans clustering (perhaps after
 dimensionality reduction) if your goal is to find batches of similar 
 points
 instead of all pairs above a threshold.




 On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am considering
 running a matrix factorization to reduce the dimension to say ~60M x 
 50 and
 then run all pair similarity...

 Did you also try similar ideas and saw positive results ?



 On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ok...just to make sure I have RowMatrix[SparseVector] where rows
 are ~ 60M and columns are 10M say with billion data points...

 I have another version that's around 60M and ~ 10K...

 I guess for the second one both all pair and dimsum will run
 fine...

 But for tall and wide, what do you suggest ? can dimsum handle it
 ?

 I might need jaccard as well...can I plug that in the PR ?



 On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh r...@databricks.com
 wrote:

 You might want to wait until Wednesday since the interface will
 be changing in that PR before Wednesday, probably over the weekend, 
 so that
 you don't have to redo your code. Your call if you need it before a 
 week.
 Reza


 On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ohh coolall-pairs brute force is also part of this PR ? Let
 me pull it in and test on our dataset...

 Thanks.
 Deb


 On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com
  wrote:

 Hi Deb,

 We are adding all-pairs and thresholded all-pairs via dimsum
 in this PR: https://github.com/apache/spark/pull/1778

 Your question wasn't entirely clear - does this answer it?

 Best,
 Reza


 On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Hi Reza,

 Have you compared with the brute force algorithm for
 similarity computation with something like the following in 
 Spark ?

 https://github.com/echen/scaldingale

 I am adding cosine similarity computation but I do want to
 compute an all pair similarities...

 Note

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
Hi Deb,

We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
https://github.com/apache/spark/pull/1778

Your question wasn't entirely clear - does this answer it?

Best,
Reza


On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi Reza,

 Have you compared with the brute force algorithm for similarity
 computation with something like the following in Spark ?

 https://github.com/echen/scaldingale

 I am adding cosine similarity computation but I do want to compute an all
 pair similarities...

 Note that the data is sparse for me (the data that goes to matrix
 factorization) so I don't think joining and group-by on (product,product)
 will be a big issue for me...

 Does it make sense to add all pair similarities as well with dimsum based
 similarity ?

 Thanks.
 Deb






 On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh r...@databricks.com wrote:

 Hi Xiaoli,

 There is a PR currently in progress to allow this, via the sampling
 scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf

 The PR is at https://github.com/apache/spark/pull/336 though it will
 need refactoring given the recent changes to matrix interface in MLlib. You
 may implement the sampling scheme for your own app since it's much code.

 Best,
 Reza


 On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li lixiaolima...@gmail.com
 wrote:

 Hi Andrew,

 Thanks for your suggestion. I have tried the method. I used 8 nodes and
 every node has 8G memory. The program just stopped at a stage for about
 several hours without any further information. Maybe I need to find
 out a more efficient way.


 On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash and...@andrewash.com
 wrote:

 The naive way would be to put all the users and their attributes into
 an RDD, then cartesian product that with itself.  Run the similarity score
 on every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) and
 take the .top(k) for each user.

 I doubt that you'll be able to take this approach with the 1T pairs
 though, so it might be worth looking at the literature for recommender
 systems to see what else is out there.


 On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li lixiaolima...@gmail.com
 wrote:

 Hi all,

 I am implementing an algorithm using Spark. I have one million users.
 I need to compute the similarity between each pair of users using some
 user's attributes.  For each user, I need to get top k most similar users.
 What is the best way to implement this?


 Thanks.








Re: Huge matrix

2014-09-05 Thread Reza Zadeh
You might want to wait until Wednesday since the interface will be changing
in that PR before Wednesday, probably over the weekend, so that you don't
have to redo your code. Your call if you need it before a week.
Reza


On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Ohh coolall-pairs brute force is also part of this PR ? Let me pull it
 in and test on our dataset...

 Thanks.
 Deb


 On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
 https://github.com/apache/spark/pull/1778

 Your question wasn't entirely clear - does this answer it?

 Best,
 Reza


 On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi Reza,

 Have you compared with the brute force algorithm for similarity
 computation with something like the following in Spark ?

 https://github.com/echen/scaldingale

 I am adding cosine similarity computation but I do want to compute an
 all pair similarities...

 Note that the data is sparse for me (the data that goes to matrix
 factorization) so I don't think joining and group-by on (product,product)
 will be a big issue for me...

 Does it make sense to add all pair similarities as well with dimsum
 based similarity ?

 Thanks.
 Deb






 On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh r...@databricks.com wrote:

 Hi Xiaoli,

 There is a PR currently in progress to allow this, via the sampling
 scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf

 The PR is at https://github.com/apache/spark/pull/336 though it will
 need refactoring given the recent changes to matrix interface in MLlib. You
 may implement the sampling scheme for your own app since it's much code.

 Best,
 Reza


 On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li lixiaolima...@gmail.com
 wrote:

 Hi Andrew,

 Thanks for your suggestion. I have tried the method. I used 8 nodes
 and every node has 8G memory. The program just stopped at a stage for 
 about
 several hours without any further information. Maybe I need to find
 out a more efficient way.


 On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash and...@andrewash.com
 wrote:

 The naive way would be to put all the users and their attributes into
 an RDD, then cartesian product that with itself.  Run the similarity 
 score
 on every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) 
 and
 take the .top(k) for each user.

 I doubt that you'll be able to take this approach with the 1T pairs
 though, so it might be worth looking at the literature for recommender
 systems to see what else is out there.


 On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li lixiaolima...@gmail.com
 wrote:

 Hi all,

 I am implementing an algorithm using Spark. I have one million
 users. I need to compute the similarity between each pair of users using
 some user's attributes.  For each user, I need to get top k most similar
 users. What is the best way to implement this?


 Thanks.










Re: Huge matrix

2014-09-05 Thread Reza Zadeh
For 60M x 10K brute force and dimsum thresholding should be fine.

For 60M x 10M probably brute force won't work depending on the cluster's
power, and dimsum thresholding should work with appropriate threshold.

Dimensionality reduction should help, and how effective it is will depend
on your application and domain, it's worth trying if the direct computation
doesn't work.

You can also try running KMeans clustering (perhaps after dimensionality
reduction) if your goal is to find batches of similar points instead of all
pairs above a threshold.




On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am considering running
 a matrix factorization to reduce the dimension to say ~60M x 50 and then
 run all pair similarity...

 Did you also try similar ideas and saw positive results ?



 On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~
 60M and columns are 10M say with billion data points...

 I have another version that's around 60M and ~ 10K...

 I guess for the second one both all pair and dimsum will run fine...

 But for tall and wide, what do you suggest ? can dimsum handle it ?

 I might need jaccard as well...can I plug that in the PR ?



 On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh r...@databricks.com wrote:

 You might want to wait until Wednesday since the interface will be
 changing in that PR before Wednesday, probably over the weekend, so that
 you don't have to redo your code. Your call if you need it before a week.
 Reza


 On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Ohh coolall-pairs brute force is also part of this PR ? Let me pull
 it in and test on our dataset...

 Thanks.
 Deb


 On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 We are adding all-pairs and thresholded all-pairs via dimsum in this
 PR: https://github.com/apache/spark/pull/1778

 Your question wasn't entirely clear - does this answer it?

 Best,
 Reza


 On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das debasish.da...@gmail.com
  wrote:

 Hi Reza,

 Have you compared with the brute force algorithm for similarity
 computation with something like the following in Spark ?

 https://github.com/echen/scaldingale

 I am adding cosine similarity computation but I do want to compute an
 all pair similarities...

 Note that the data is sparse for me (the data that goes to matrix
 factorization) so I don't think joining and group-by on (product,product)
 will be a big issue for me...

 Does it make sense to add all pair similarities as well with dimsum
 based similarity ?

 Thanks.
 Deb






 On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh r...@databricks.com
 wrote:

 Hi Xiaoli,

 There is a PR currently in progress to allow this, via the sampling
 scheme described in this paper:
 stanford.edu/~rezab/papers/dimsum.pdf

 The PR is at https://github.com/apache/spark/pull/336 though it
 will need refactoring given the recent changes to matrix interface in
 MLlib. You may implement the sampling scheme for your own app since it's
 much code.

 Best,
 Reza


 On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li lixiaolima...@gmail.com
 wrote:

 Hi Andrew,

 Thanks for your suggestion. I have tried the method. I used 8 nodes
 and every node has 8G memory. The program just stopped at a stage for 
 about
 several hours without any further information. Maybe I need to find
 out a more efficient way.


 On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash and...@andrewash.com
 wrote:

 The naive way would be to put all the users and their attributes
 into an RDD, then cartesian product that with itself.  Run the 
 similarity
 score on every pair (1M * 1M = 1T scores), map to (user, (score,
 otherUser)) and take the .top(k) for each user.

 I doubt that you'll be able to take this approach with the 1T
 pairs though, so it might be worth looking at the literature for
 recommender systems to see what else is out there.


 On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li 
 lixiaolima...@gmail.com wrote:

 Hi all,

 I am implementing an algorithm using Spark. I have one million
 users. I need to compute the similarity between each pair of users 
 using
 some user's attributes.  For each user, I need to get top k most 
 similar
 users. What is the best way to implement this?


 Thanks.













Re: Huge matrix

2014-09-05 Thread Reza Zadeh
Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
into the usual brute force algorithm for cosine similarity, there is no
sampling. This is by design.


On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das debasish.da...@gmail.com
wrote:

 I looked at the code: similarColumns(Double.posInf) is generating the
 brute force...

 Basically dimsum with gamma as PositiveInfinity will produce the exact
 same result as doing catesian products of RDD[(product, vector)] and
 computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com wrote:

 For 60M x 10K brute force and dimsum thresholding should be fine.

 For 60M x 10M probably brute force won't work depending on the cluster's
 power, and dimsum thresholding should work with appropriate threshold.

 Dimensionality reduction should help, and how effective it is will depend
 on your application and domain, it's worth trying if the direct computation
 doesn't work.

 You can also try running KMeans clustering (perhaps after dimensionality
 reduction) if your goal is to find batches of similar points instead of all
 pairs above a threshold.




 On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am considering
 running a matrix factorization to reduce the dimension to say ~60M x 50 and
 then run all pair similarity...

 Did you also try similar ideas and saw positive results ?



 On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~
 60M and columns are 10M say with billion data points...

 I have another version that's around 60M and ~ 10K...

 I guess for the second one both all pair and dimsum will run fine...

 But for tall and wide, what do you suggest ? can dimsum handle it ?

 I might need jaccard as well...can I plug that in the PR ?



 On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh r...@databricks.com wrote:

 You might want to wait until Wednesday since the interface will be
 changing in that PR before Wednesday, probably over the weekend, so that
 you don't have to redo your code. Your call if you need it before a week.
 Reza


 On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das debasish.da...@gmail.com
  wrote:

 Ohh coolall-pairs brute force is also part of this PR ? Let me
 pull it in and test on our dataset...

 Thanks.
 Deb


 On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com
 wrote:

 Hi Deb,

 We are adding all-pairs and thresholded all-pairs via dimsum in this
 PR: https://github.com/apache/spark/pull/1778

 Your question wasn't entirely clear - does this answer it?

 Best,
 Reza


 On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Hi Reza,

 Have you compared with the brute force algorithm for similarity
 computation with something like the following in Spark ?

 https://github.com/echen/scaldingale

 I am adding cosine similarity computation but I do want to compute
 an all pair similarities...

 Note that the data is sparse for me (the data that goes to matrix
 factorization) so I don't think joining and group-by on 
 (product,product)
 will be a big issue for me...

 Does it make sense to add all pair similarities as well with dimsum
 based similarity ?

 Thanks.
 Deb






 On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh r...@databricks.com
 wrote:

 Hi Xiaoli,

 There is a PR currently in progress to allow this, via the
 sampling scheme described in this paper:
 stanford.edu/~rezab/papers/dimsum.pdf

 The PR is at https://github.com/apache/spark/pull/336 though it
 will need refactoring given the recent changes to matrix interface in
 MLlib. You may implement the sampling scheme for your own app since 
 it's
 much code.

 Best,
 Reza


 On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li 
 lixiaolima...@gmail.com wrote:

 Hi Andrew,

 Thanks for your suggestion. I have tried the method. I used 8
 nodes and every node has 8G memory. The program just stopped at a 
 stage for
 about several hours without any further information. Maybe I need to 
 find
 out a more efficient way.


 On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash and...@andrewash.com
  wrote:

 The naive way would be to put all the users and their attributes
 into an RDD, then cartesian product that with itself.  Run the 
 similarity
 score on every pair (1M * 1M = 1T scores), map to (user, (score,
 otherUser)) and take the .top(k) for each user.

 I doubt that you'll be able to take this approach with the 1T
 pairs though, so it might be worth looking at the literature for
 recommender systems to see what else is out there.


 On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li 
 lixiaolima...@gmail.com wrote:

 Hi all,

 I am implementing an algorithm using Spark. I have one million
 users. I need to compute the similarity between each

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
I will add dice, overlap, and jaccard similarity in a future PR, probably
still for 1.2


On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Awesome...Let me try it out...

 Any plans of putting other similarity measures in future (jaccard is
 something that will be useful) ? I guess it makes sense to add some
 similarity measures in mllib...


 On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com wrote:

 Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
 into the usual brute force algorithm for cosine similarity, there is no
 sampling. This is by design.


 On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 I looked at the code: similarColumns(Double.posInf) is generating the
 brute force...

 Basically dimsum with gamma as PositiveInfinity will produce the exact
 same result as doing catesian products of RDD[(product, vector)] and
 computing similarities or there will be some approximation ?

 Sorry I have not read your paper yet. Will read it over the weekend.



 On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com wrote:

 For 60M x 10K brute force and dimsum thresholding should be fine.

 For 60M x 10M probably brute force won't work depending on the
 cluster's power, and dimsum thresholding should work with appropriate
 threshold.

 Dimensionality reduction should help, and how effective it is will
 depend on your application and domain, it's worth trying if the direct
 computation doesn't work.

 You can also try running KMeans clustering (perhaps after
 dimensionality reduction) if your goal is to find batches of similar points
 instead of all pairs above a threshold.




 On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Also for tall and wide (rows ~60M, columns 10M), I am considering
 running a matrix factorization to reduce the dimension to say ~60M x 50 
 and
 then run all pair similarity...

 Did you also try similar ideas and saw positive results ?



 On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das debasish.da...@gmail.com
  wrote:

 Ok...just to make sure I have RowMatrix[SparseVector] where rows are
 ~ 60M and columns are 10M say with billion data points...

 I have another version that's around 60M and ~ 10K...

 I guess for the second one both all pair and dimsum will run fine...

 But for tall and wide, what do you suggest ? can dimsum handle it ?

 I might need jaccard as well...can I plug that in the PR ?



 On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh r...@databricks.com
 wrote:

 You might want to wait until Wednesday since the interface will be
 changing in that PR before Wednesday, probably over the weekend, so that
 you don't have to redo your code. Your call if you need it before a 
 week.
 Reza


 On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Ohh coolall-pairs brute force is also part of this PR ? Let me
 pull it in and test on our dataset...

 Thanks.
 Deb


 On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com
 wrote:

 Hi Deb,

 We are adding all-pairs and thresholded all-pairs via dimsum in
 this PR: https://github.com/apache/spark/pull/1778

 Your question wasn't entirely clear - does this answer it?

 Best,
 Reza


 On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das 
 debasish.da...@gmail.com wrote:

 Hi Reza,

 Have you compared with the brute force algorithm for similarity
 computation with something like the following in Spark ?

 https://github.com/echen/scaldingale

 I am adding cosine similarity computation but I do want to
 compute an all pair similarities...

 Note that the data is sparse for me (the data that goes to matrix
 factorization) so I don't think joining and group-by on 
 (product,product)
 will be a big issue for me...

 Does it make sense to add all pair similarities as well with
 dimsum based similarity ?

 Thanks.
 Deb






 On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh r...@databricks.com
 wrote:

 Hi Xiaoli,

 There is a PR currently in progress to allow this, via the
 sampling scheme described in this paper:
 stanford.edu/~rezab/papers/dimsum.pdf

 The PR is at https://github.com/apache/spark/pull/336 though it
 will need refactoring given the recent changes to matrix interface 
 in
 MLlib. You may implement the sampling scheme for your own app since 
 it's
 much code.

 Best,
 Reza


 On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li 
 lixiaolima...@gmail.com wrote:

 Hi Andrew,

 Thanks for your suggestion. I have tried the method. I used 8
 nodes and every node has 8G memory. The program just stopped at a 
 stage for
 about several hours without any further information. Maybe I need 
 to find
 out a more efficient way.


 On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash 
 and...@andrewash.com wrote:

 The naive way would be to put all the users and their
 attributes into an RDD, then cartesian product that with itself.  
 Run the
 similarity score on every pair (1M * 1M = 1T scores

Re: Does MLlib in spark 1.0.2 only work for tall-and-skinny matrix?

2014-08-10 Thread Reza Zadeh
Hi Andy,
That is the case in Spark 1.0, yes. However, as of Spark 1.1 which is
coming out very soon, you will be able to run SVD on non-TS matrices.

If you try to apply the current algorithm to a matrix with more than 10,000
columns, you will overburden the master node, which has to compute a 10k x
10k local SVD by itself.

This bottleneck has been removed in Spark 1.1, and if you really want it
now you can pull the current master from github.

Best,
Reza


On Sun, Aug 10, 2014 at 9:35 PM, Andy Zhao andyrao1...@gmail.com wrote:

 Hi guys

  I'm considering apply MLlib SVD in my work. But I find that in the
 document, it says: In this release, we provide SVD computation to
 row-oriented matrices that have only a few columns, say, less than 1000,
 but
 many rows, which we call tall-and-skinny.  Does that mean this SVD will
 not
 work for a matrix which have a lot of columns, say more than 1? What
 will happen if this kind of matrix is applied to SVD?

 Thank you ,
 Andy Zhao




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Does-MLlib-in-spark-1-0-2-only-work-for-tall-and-skinny-matrix-tp11869.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: No Intercept for Python

2014-06-18 Thread Reza Zadeh
Hi Naftali,

Yes you're right. For now please add a column of ones. We are working on
adding a weighted regularization term, and exposing the scala intercept
option in the python binding.

Best,
Reza


On Mon, Jun 16, 2014 at 12:19 PM, Naftali Harris naft...@affirm.com wrote:

 Hi everyone,

 The Python LogisticRegressionWithSGD does not appear to estimate an
 intercept.  When I run the following, the returned weights and intercept
 are both 0.0:

 from pyspark import SparkContext
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD

 def main():
 sc = SparkContext(appName=NoIntercept)

 train = sc.parallelize([LabeledPoint(0, [0]), LabeledPoint(1, [0]),
 LabeledPoint(1, [0])])

 model = LogisticRegressionWithSGD.train(train, iterations=500,
 step=0.1)
 print Final weights:  + str(model.weights)
 print Final intercept:  + str(model.intercept)

 if __name__ == __main__:
 main()


 Of course, one can fit an intercept with the simple expedient of adding a
 column of ones, but that's kind of annoying.  Moreover, it looks like the
 scala version has an intercept option.

 Am I missing something? Should I just add the column of ones? If I
 submitted a PR doing that, is that the sort of thing you guys would accept?

 Thanks! :-)

 Naftali



Re: Huge matrix

2014-04-11 Thread Reza Zadeh
Hi Xiaoli,

There is a PR currently in progress to allow this, via the sampling scheme
described in this paper: stanford.edu/~rezab/papers/dimsum.pdf

The PR is at https://github.com/apache/spark/pull/336 though it will need
refactoring given the recent changes to matrix interface in MLlib. You may
implement the sampling scheme for your own app since it's much code.

Best,
Reza


On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li lixiaolima...@gmail.com wrote:

 Hi Andrew,

 Thanks for your suggestion. I have tried the method. I used 8 nodes and
 every node has 8G memory. The program just stopped at a stage for about
 several hours without any further information. Maybe I need to find
 out a more efficient way.


 On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash and...@andrewash.com wrote:

 The naive way would be to put all the users and their attributes into an
 RDD, then cartesian product that with itself.  Run the similarity score on
 every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) and
 take the .top(k) for each user.

 I doubt that you'll be able to take this approach with the 1T pairs
 though, so it might be worth looking at the literature for recommender
 systems to see what else is out there.


 On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li lixiaolima...@gmail.comwrote:

 Hi all,

 I am implementing an algorithm using Spark. I have one million users. I
 need to compute the similarity between each pair of users using some user's
 attributes.  For each user, I need to get top k most similar users. What is
 the best way to implement this?


 Thanks.