Excellent, thanks Pat.
> On Jan 17, 2015, at 9:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > > Mahout’s Spark implementation of rowsimilarity is in the Scala > SimilarityAnalysis class. It actually does either row or column similarity > but only supports LLR at present. It does [AA’] for columns or [A’A] for rows > first then calculates the distance (LLR) for non-zero elements. This is a > major optimization for sparse matrices. As I recall the old hadoop code only > did this for half the matrix since it’s symmetric but that optimization isn’t > in the current code because the downsampling is done as LLR is calculated, so > the entire similarity matrix is never actually calculated unless you disable > downsampling. > > The primary use is for recommenders but I’ve used it (in the test suite) for > row-wise text token similarity too. > > On Jan 17, 2015, at 9:00 AM, Andrew Musselman <andrew.mussel...@gmail.com> > wrote: > > Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 and > poking around to see how to do things. > > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259 > >> On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote: >> >> Andrew, u would be better off using Mahout's RowSimilarityJob for what u r >> trying to accomplish. >> >> 1. It does give u pair-wise distances >> 2. U can specify the Distance measure u r looking to use >> 3. There's the old MapReduce impl and the Spark DSL impl per ur preference. >> >> From: Andrew Musselman <andrew.mussel...@gmail.com> >> To: Reza Zadeh <r...@databricks.com> >> Cc: user <user@spark.apache.org> >> Sent: Saturday, January 17, 2015 11:29 AM >> Subject: Re: Row similarities >> >> Thanks Reza, interesting approach. I think what I actually want is to >> calculate pair-wise distance, on second thought. Is there a pattern for >> that? >> >> >> >>> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote: >>> >>> You can use K-means with a suitably large k. Each cluster should correspond >>> to rows that are similar to one another. >>> >>> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman >>> <andrew.mussel...@gmail.com> wrote: >>> What's a good way to calculate similarities between all vector-rows in a >>> matrix or RDD[Vector]? >>> >>> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm >>> going down a good path to transpose a matrix in order to run that. >