Pat, columnSimilarities is what that blog post is about, and is already part of Spark 1.2.
rowSimilarities in a RowMatrix is a little more tricky because you can't transpose a RowMatrix easily, and is being tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for example the number of rows in your RowMatrix is less than 1m, you can transpose it and use rowSimilarities. On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > BTW it looks like row and column similarities (cosine based) are coming to > MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the > master yet. Does anyone know the status? > > See: > https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html > > Also the method for computation reduction (make it less than O(n^2)) seems > rooted in cosine. A different computation reduction method is used in the > Mahout code tied to LLR. Seems like we should get these together. > > On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.mussel...@gmail.com> > wrote: > > Excellent, thanks Pat. > > On Jan 17, 2015, at 9:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > > Mahout’s Spark implementation of rowsimilarity is in the Scala > SimilarityAnalysis class. It actually does either row or column similarity > but only supports LLR at present. It does [AA’] for columns or [A’A] for > rows first then calculates the distance (LLR) for non-zero elements. This > is a major optimization for sparse matrices. As I recall the old hadoop > code only did this for half the matrix since it’s symmetric but that > optimization isn’t in the current code because the downsampling is done as > LLR is calculated, so the entire similarity matrix is never actually > calculated unless you disable downsampling. > > The primary use is for recommenders but I’ve used it (in the test suite) > for row-wise text token similarity too. > > On Jan 17, 2015, at 9:00 AM, Andrew Musselman <andrew.mussel...@gmail.com> > wrote: > > Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 > and poking around to see how to do things. > > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259 > > On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com> > wrote: > > Andrew, u would be better off using Mahout's RowSimilarityJob for what u r > trying to accomplish. > > 1. It does give u pair-wise distances > 2. U can specify the Distance measure u r looking to use > 3. There's the old MapReduce impl and the Spark DSL impl per ur > preference. > > ------------------------------ > *From:* Andrew Musselman <andrew.mussel...@gmail.com> > *To:* Reza Zadeh <r...@databricks.com> > *Cc:* user <user@spark.apache.org> > *Sent:* Saturday, January 17, 2015 11:29 AM > *Subject:* Re: Row similarities > > Thanks Reza, interesting approach. I think what I actually want is to > calculate pair-wise distance, on second thought. Is there a pattern for > that? > > > > On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote: > > You can use K-means > <https://spark.apache.org/docs/latest/mllib-clustering.html> with a > suitably large k. Each cluster should correspond to rows that are similar > to one another. > > On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman < > andrew.mussel...@gmail.com> wrote: > > What's a good way to calculate similarities between all vector-rows in a > matrix or RDD[Vector]? > > I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm > going down a good path to transpose a matrix in order to run that. > > > > > > >