Re: Row similarities

Andrew Musselman Sat, 17 Jan 2015 09:39:11 -0800

Excellent, thanks Pat.


> On Jan 17, 2015, at 9:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
> Mahout’s Spark implementation of rowsimilarity is in the Scala 
> SimilarityAnalysis class. It actually does either row or column similarity 
> but only supports LLR at present. It does [AA’] for columns or [A’A] for rows 
> first then calculates the distance (LLR) for non-zero elements. This is a 
> major optimization for sparse matrices. As I recall the old hadoop code only 
> did this for half the matrix since it’s symmetric but that optimization isn’t 
> in the current code because the downsampling is done as LLR is calculated, so 
> the entire similarity matrix is never actually calculated unless you disable 
> downsampling. 
> 
> The primary use is for recommenders but I’ve used it (in the test suite) for 
> row-wise text token similarity too.  
> 
> On Jan 17, 2015, at 9:00 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
> wrote:
> 
> Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 and 
> poking around to see how to do things.
> 
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259
> 
>> On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote:
>> 
>> Andrew, u would be better off using Mahout's RowSimilarityJob for what u r 
>> trying to accomplish.
>> 
>>  1.  It does give u pair-wise distances
>>  2.  U can specify the Distance measure u r looking to use
>>  3.  There's the old MapReduce impl and the Spark DSL impl per ur preference.
>> 
>> From: Andrew Musselman <andrew.mussel...@gmail.com>
>> To: Reza Zadeh <r...@databricks.com> 
>> Cc: user <user@spark.apache.org> 
>> Sent: Saturday, January 17, 2015 11:29 AM
>> Subject: Re: Row similarities
>> 
>> Thanks Reza, interesting approach.  I think what I actually want is to 
>> calculate pair-wise distance, on second thought.  Is there a pattern for 
>> that?
>> 
>> 
>> 
>>> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote:
>>> 
>>> You can use K-means with a suitably large k. Each cluster should correspond 
>>> to rows that are similar to one another.
>>> 
>>> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
>>> <andrew.mussel...@gmail.com> wrote:
>>> What's a good way to calculate similarities between all vector-rows in a 
>>> matrix or RDD[Vector]?
>>> 
>>> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm 
>>> going down a good path to transpose a matrix in order to run that.
>

Re: Row similarities

Reply via email to