Re: Row similarities

Andrew Musselman Sat, 17 Jan 2015 19:31:25 -0800

Makes sense.


> On Jan 17, 2015, at 6:27 PM, Reza Zadeh <r...@databricks.com> wrote:
> 
> We're focused on providing block matrices, which makes transposition simple: 
> https://issues.apache.org/jira/browse/SPARK-3434
> 
>> On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a 
>> transpose—it’s optimized out. Mahout has had a stand alone row matrix 
>> transpose since day 1 and supports it in the Spark version. Can’t really do 
>> matrix algebra without it even though it’s often possible to optimize it 
>> away. 
>> 
>> Row similarity with LLR is much simpler than cosine since you only need 
>> non-zero sums for column, row, and matrix elements so rowSimilarity is 
>> implemented in Mahout for Spark. Full blown row similarity including all the 
>> different similarity methods (long since implemented in hadoop mapreduce) 
>> hasn’t been moved to spark yet.
>> 
>> Yep, rows are not covered in the blog, my mistake. Too bad it has a lot of 
>> uses and can at very least be optimized for output matrix symmetry.
>> 
>> On Jan 17, 2015, at 11:44 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
>> wrote:
>> 
>> Yeah okay, thanks.
>> 
>> On Jan 17, 2015, at 11:15 AM, Reza Zadeh <r...@databricks.com> wrote:
>> 
>>> Pat, columnSimilarities is what that blog post is about, and is already 
>>> part of Spark 1.2.
>>> 
>>> rowSimilarities in a RowMatrix is a little more tricky because you can't 
>>> transpose a RowMatrix easily, and is being tracked by this JIRA: 
>>> https://issues.apache.org/jira/browse/SPARK-4823
>>> 
>>> Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for 
>>> example the number of rows in your RowMatrix is less than 1m, you can 
>>> transpose it and use rowSimilarities.
>>> 
>>> 
>>>> On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel <p...@occamsmachete.com> 
>>>> wrote:
>>>> BTW it looks like row and column similarities (cosine based) are coming to 
>>>> MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the 
>>>> master yet. Does anyone know the status?
>>>> 
>>>> See: 
>>>> https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
>>>> 
>>>> Also the method for computation reduction (make it less than O(n^2)) seems 
>>>> rooted in cosine. A different computation reduction method is used in the 
>>>> Mahout code tied to LLR. Seems like we should get these together.
>>>>  
>>>> On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
>>>> wrote:
>>>> 
>>>> Excellent, thanks Pat.
>>>> 
>>>> On Jan 17, 2015, at 9:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>>> 
>>>>> Mahout’s Spark implementation of rowsimilarity is in the Scala 
>>>>> SimilarityAnalysis class. It actually does either row or column 
>>>>> similarity but only supports LLR at present. It does [AA’] for columns or 
>>>>> [A’A] for rows first then calculates the distance (LLR) for non-zero 
>>>>> elements. This is a major optimization for sparse matrices. As I recall 
>>>>> the old hadoop code only did this for half the matrix since it’s 
>>>>> symmetric but that optimization isn’t in the current code because the 
>>>>> downsampling is done as LLR is calculated, so the entire similarity 
>>>>> matrix is never actually calculated unless you disable downsampling. 
>>>>> 
>>>>> The primary use is for recommenders but I’ve used it (in the test suite) 
>>>>> for row-wise text token similarity too. 
>>>>> 
>>>>> On Jan 17, 2015, at 9:00 AM, Andrew Musselman 
>>>>> <andrew.mussel...@gmail.com> wrote:
>>>>> 
>>>>> Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 
>>>>> and poking around to see how to do things.
>>>>> 
>>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259
>>>>> 
>>>>>> On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> Andrew, u would be better off using Mahout's RowSimilarityJob for what u 
>>>>>> r trying to accomplish.
>>>>>> 
>>>>>>  1.  It does give u pair-wise distances
>>>>>>  2.  U can specify the Distance measure u r looking to use
>>>>>>  3.  There's the old MapReduce impl and the Spark DSL impl per ur 
>>>>>> preference.
>>>>>> 
>>>>>> From: Andrew Musselman <andrew.mussel...@gmail.com>
>>>>>> To: Reza Zadeh <r...@databricks.com> 
>>>>>> Cc: user <user@spark.apache.org> 
>>>>>> Sent: Saturday, January 17, 2015 11:29 AM
>>>>>> Subject: Re: Row similarities
>>>>>> 
>>>>>> Thanks Reza, interesting approach.  I think what I actually want is to 
>>>>>> calculate pair-wise distance, on second thought.  Is there a pattern for 
>>>>>> that?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>>>>> 
>>>>>>> You can use K-means with a suitably large k. Each cluster should 
>>>>>>> correspond to rows that are similar to one another.
>>>>>>> 
>>>>>>> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
>>>>>>> <andrew.mussel...@gmail.com> wrote:
>>>>>>> What's a good way to calculate similarities between all vector-rows in 
>>>>>>> a matrix or RDD[Vector]?
>>>>>>> 
>>>>>>> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure 
>>>>>>> I'm going down a good path to transpose a matrix in order to run that.
>

Re: Row similarities

Reply via email to