Re: Row similarities

Andrew Musselman Sat, 17 Jan 2015 11:46:04 -0800

Yeah okay, thanks.


> On Jan 17, 2015, at 11:15 AM, Reza Zadeh <r...@databricks.com> wrote:
> 
> Pat, columnSimilarities is what that blog post is about, and is already part 
> of Spark 1.2.
> 
> rowSimilarities in a RowMatrix is a little more tricky because you can't 
> transpose a RowMatrix easily, and is being tracked by this JIRA: 
> https://issues.apache.org/jira/browse/SPARK-4823
> 
> Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for 
> example the number of rows in your RowMatrix is less than 1m, you can 
> transpose it and use rowSimilarities.
> 
> 
>> On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> BTW it looks like row and column similarities (cosine based) are coming to 
>> MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the 
>> master yet. Does anyone know the status?
>> 
>> See: 
>> https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
>> 
>> Also the method for computation reduction (make it less than O(n^2)) seems 
>> rooted in cosine. A different computation reduction method is used in the 
>> Mahout code tied to LLR. Seems like we should get these together.
>>  
>> On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
>> wrote:
>> 
>> Excellent, thanks Pat.
>> 
>> On Jan 17, 2015, at 9:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> 
>>> Mahout’s Spark implementation of rowsimilarity is in the Scala 
>>> SimilarityAnalysis class. It actually does either row or column similarity 
>>> but only supports LLR at present. It does [AA’] for columns or [A’A] for 
>>> rows first then calculates the distance (LLR) for non-zero elements. This 
>>> is a major optimization for sparse matrices. As I recall the old hadoop 
>>> code only did this for half the matrix since it’s symmetric but that 
>>> optimization isn’t in the current code because the downsampling is done as 
>>> LLR is calculated, so the entire similarity matrix is never actually 
>>> calculated unless you disable downsampling. 
>>> 
>>> The primary use is for recommenders but I’ve used it (in the test suite) 
>>> for row-wise text token similarity too.  
>>> 
>>> On Jan 17, 2015, at 9:00 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
>>> wrote:
>>> 
>>> Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 
>>> and poking around to see how to do things.
>>> 
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259
>>> 
>>>> On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote:
>>>> 
>>>> Andrew, u would be better off using Mahout's RowSimilarityJob for what u r 
>>>> trying to accomplish.
>>>> 
>>>>  1.  It does give u pair-wise distances
>>>>  2.  U can specify the Distance measure u r looking to use
>>>>  3.  There's the old MapReduce impl and the Spark DSL impl per ur 
>>>> preference.
>>>> 
>>>> From: Andrew Musselman <andrew.mussel...@gmail.com>
>>>> To: Reza Zadeh <r...@databricks.com> 
>>>> Cc: user <user@spark.apache.org> 
>>>> Sent: Saturday, January 17, 2015 11:29 AM
>>>> Subject: Re: Row similarities
>>>> 
>>>> Thanks Reza, interesting approach.  I think what I actually want is to 
>>>> calculate pair-wise distance, on second thought.  Is there a pattern for 
>>>> that?
>>>> 
>>>> 
>>>> 
>>>>> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>>> 
>>>>> You can use K-means with a suitably large k. Each cluster should 
>>>>> correspond to rows that are similar to one another.
>>>>> 
>>>>> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
>>>>> <andrew.mussel...@gmail.com> wrote:
>>>>> What's a good way to calculate similarities between all vector-rows in a 
>>>>> matrix or RDD[Vector]?
>>>>> 
>>>>> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm 
>>>>> going down a good path to transpose a matrix in order to run that.
>

Re: Row similarities

Reply via email to