Mahout’s Spark implementation of rowsimilarity is in the Scala 
SimilarityAnalysis class. It actually does either row or column similarity but 
only supports LLR at present. It does [AA’] for columns or [A’A] for rows first 
then calculates the distance (LLR) for non-zero elements. This is a major 
optimization for sparse matrices. As I recall the old hadoop code only did this 
for half the matrix since it’s symmetric but that optimization isn’t in the 
current code because the downsampling is done as LLR is calculated, so the 
entire similarity matrix is never actually calculated unless you disable 
downsampling. 

The primary use is for recommenders but I’ve used it (in the test suite) for 
row-wise text token similarity too.  

On Jan 17, 2015, at 9:00 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
wrote:

Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 and 
poking around to see how to do things.

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259 
<https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259>

On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com 
<mailto:suneel_mar...@yahoo.com>> wrote:

> Andrew, u would be better off using Mahout's RowSimilarityJob for what u r 
> trying to accomplish.
> 
>  1.  It does give u pair-wise distances
>  2.  U can specify the Distance measure u r looking to use
>  3.  There's the old MapReduce impl and the Spark DSL impl per ur preference.
> 
> From: Andrew Musselman <andrew.mussel...@gmail.com 
> <mailto:andrew.mussel...@gmail.com>>
> To: Reza Zadeh <r...@databricks.com <mailto:r...@databricks.com>> 
> Cc: user <user@spark.apache.org <mailto:user@spark.apache.org>> 
> Sent: Saturday, January 17, 2015 11:29 AM
> Subject: Re: Row similarities
> 
> Thanks Reza, interesting approach.  I think what I actually want is to 
> calculate pair-wise distance, on second thought.  Is there a pattern for that?
> 
> 
> 
> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com 
> <mailto:r...@databricks.com>> wrote:
> 
>> You can use K-means 
>> <https://spark.apache.org/docs/latest/mllib-clustering.html> with a suitably 
>> large k. Each cluster should correspond to rows that are similar to one 
>> another.
>> 
>> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
>> <andrew.mussel...@gmail.com <mailto:andrew.mussel...@gmail.com>> wrote:
>> What's a good way to calculate similarities between all vector-rows in a 
>> matrix or RDD[Vector]?
>> 
>> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm 
>> going down a good path to transpose a matrix in order to run that.
>> 
> 
> 

Reply via email to