Re: Row similarities

Pat Ferrel Sun, 18 Jan 2015 09:07:15 -0800

Right, done with matrix blocks. Seems like a lot of duplicate effort. but 
that’s the way of OSS sometimes.

I didn’t see transpose in the Jira. Are there plans for transpose and 
rowSimilarity without transpose? The latter seems easier than columnSimilarity 
in the general/naive case. Thresholds could also be used there.

A threshold for downsampling is going to be extremely hard to use in practice, 
but not sure what the threshold applies to so I must skimmed too fast.  If the 
threshold were some number of sigmas it would be more usable and could extend 
to non-cosine similarities. Would add a non-trivial step to the calc, of 
course. But without it I’ve never seen a threshold actually used effectively 
and I’ve tried. It’s so dependent on the dataset.

Observation as a question: One of the reasons I use the Mahout Spark R-like DSL 
is for row and column similarity (uses in cooccurrence recommenders) and you 
can count on a robust full linear algebra implementation. Once you have full 
linear algebra on top of an optimizer many of the factorization methods are 
dead simple. Seems like MLlib has chosen to do optimized higher level 
algorithms first instead of full linear algebra first?

On Jan 17, 2015, at 6:27 PM, Reza Zadeh <r...@databricks.com> wrote:

We're focused on providing block matrices, which makes transposition simple: 
https://issues.apache.org/jira/browse/SPARK-3434 
<https://issues.apache.org/jira/browse/SPARK-3434>

On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a 
transpose—it’s optimized out. Mahout has had a stand alone row matrix transpose 
since day 1 and supports it in the Spark version. Can’t really do matrix 
algebra without it even though it’s often possible to optimize it away. 

Row similarity with LLR is much simpler than cosine since you only need 
non-zero sums for column, row, and matrix elements so rowSimilarity is 
implemented in Mahout for Spark. Full blown row similarity including all the 
different similarity methods (long since implemented in hadoop mapreduce) 
hasn’t been moved to spark yet.

Yep, rows are not covered in the blog, my mistake. Too bad it has a lot of uses 
and can at very least be optimized for output matrix symmetry.

On Jan 17, 2015, at 11:44 AM, Andrew Musselman <andrew.mussel...@gmail.com 
<mailto:andrew.mussel...@gmail.com>> wrote:

Yeah okay, thanks.

On Jan 17, 2015, at 11:15 AM, Reza Zadeh <r...@databricks.com 
<mailto:r...@databricks.com>> wrote:

> Pat, columnSimilarities is what that blog post is about, and is already part 
> of Spark 1.2.
> 
> rowSimilarities in a RowMatrix is a little more tricky because you can't 
> transpose a RowMatrix easily, and is being tracked by this JIRA: 
> https://issues.apache.org/jira/browse/SPARK-4823 
> <https://issues.apache.org/jira/browse/SPARK-4823>
> 
> Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for 
> example the number of rows in your RowMatrix is less than 1m, you can 
> transpose it and use rowSimilarities.
> 
> 
> On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel <p...@occamsmachete.com 
> <mailto:p...@occamsmachete.com>> wrote:
> BTW it looks like row and column similarities (cosine based) are coming to 
> MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the 
> master yet. Does anyone know the status?
> 
> See: 
> https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
>  
> <https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html>
> 
> Also the method for computation reduction (make it less than O(n^2)) seems 
> rooted in cosine. A different computation reduction method is used in the 
> Mahout code tied to LLR. Seems like we should get these together.
>  
> On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.mussel...@gmail.com 
> <mailto:andrew.mussel...@gmail.com>> wrote:
> 
> Excellent, thanks Pat.
> 
> On Jan 17, 2015, at 9:27 AM, Pat Ferrel <p...@occamsmachete.com 
> <mailto:p...@occamsmachete.com>> wrote:
> 
>> Mahout’s Spark implementation of rowsimilarity is in the Scala 
>> SimilarityAnalysis class. It actually does either row or column similarity 
>> but only supports LLR at present. It does [AA’] for columns or [A’A] for 
>> rows first then calculates the distance (LLR) for non-zero elements. This is 
>> a major optimization for sparse matrices. As I recall the old hadoop code 
>> only did this for half the matrix since it’s symmetric but that optimization 
>> isn’t in the current code because the downsampling is done as LLR is 
>> calculated, so the entire similarity matrix is never actually calculated 
>> unless you disable downsampling. 
>> 
>> The primary use is for recommenders but I’ve used it (in the test suite) for 
>> row-wise text token similarity too.  
>> 
>> On Jan 17, 2015, at 9:00 AM, Andrew Musselman <andrew.mussel...@gmail.com 
>> <mailto:andrew.mussel...@gmail.com>> wrote:
>> 
>> Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 and 
>> poking around to see how to do things.
>> 
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259 
>> <https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259>
>> 
>> On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com 
>> <mailto:suneel_mar...@yahoo.com>> wrote:
>> 
>>> Andrew, u would be better off using Mahout's RowSimilarityJob for what u r 
>>> trying to accomplish.
>>> 
>>>  1.  It does give u pair-wise distances
>>>  2.  U can specify the Distance measure u r looking to use
>>>  3.  There's the old MapReduce impl and the Spark DSL impl per ur 
>>> preference.
>>> 
>>> From: Andrew Musselman <andrew.mussel...@gmail.com 
>>> <mailto:andrew.mussel...@gmail.com>>
>>> To: Reza Zadeh <r...@databricks.com <mailto:r...@databricks.com>> 
>>> Cc: user <user@spark.apache.org <mailto:user@spark.apache.org>> 
>>> Sent: Saturday, January 17, 2015 11:29 AM
>>> Subject: Re: Row similarities
>>> 
>>> Thanks Reza, interesting approach.  I think what I actually want is to 
>>> calculate pair-wise distance, on second thought.  Is there a pattern for 
>>> that?
>>> 
>>> 
>>> 
>>> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com 
>>> <mailto:r...@databricks.com>> wrote:
>>> 
>>>> You can use K-means 
>>>> <https://spark.apache.org/docs/latest/mllib-clustering.html> with a 
>>>> suitably large k. Each cluster should correspond to rows that are similar 
>>>> to one another.
>>>> 
>>>> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman 
>>>> <andrew.mussel...@gmail.com <mailto:andrew.mussel...@gmail.com>> wrote:
>>>> What's a good way to calculate similarities between all vector-rows in a 
>>>> matrix or RDD[Vector]?
>>>> 
>>>> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm 
>>>> going down a good path to transpose a matrix in order to run that.
>>>> 
>>> 
>>> 
>> 
> 
>

Re: Row similarities

Reply via email to