Re: Row similarities

Reza Zadeh Sat, 17 Jan 2015 18:30:07 -0800

We're focused on providing block matrices, which makes transposition
simple: https://issues.apache.org/jira/browse/SPARK-3434


On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a
> transpose—it’s optimized out. Mahout has had a stand alone row matrix
> transpose since day 1 and supports it in the Spark version. Can’t really do
> matrix algebra without it even though it’s often possible to optimize it
> away.
>
> Row similarity with LLR is much simpler than cosine since you only need
> non-zero sums for column, row, and matrix elements so rowSimilarity is
> implemented in Mahout for Spark. Full blown row similarity including all
> the different similarity methods (long since implemented in hadoop
> mapreduce) hasn’t been moved to spark yet.
>
> Yep, rows are not covered in the blog, my mistake. Too bad it has a lot of
> uses and can at very least be optimized for output matrix symmetry.
>
> On Jan 17, 2015, at 11:44 AM, Andrew Musselman <andrew.mussel...@gmail.com>
> wrote:
>
> Yeah okay, thanks.
>
> On Jan 17, 2015, at 11:15 AM, Reza Zadeh <r...@databricks.com> wrote:
>
> Pat, columnSimilarities is what that blog post is about, and is already
> part of Spark 1.2.
>
> rowSimilarities in a RowMatrix is a little more tricky because you can't
> transpose a RowMatrix easily, and is being tracked by this JIRA:
> https://issues.apache.org/jira/browse/SPARK-4823
>
> Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for
> example the number of rows in your RowMatrix is less than 1m, you can
> transpose it and use rowSimilarities.
>
>
> On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
>
>> BTW it looks like row and column similarities (cosine based) are coming
>> to MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in
>> the master yet. Does anyone know the status?
>>
>> See:
>> https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
>>
>> Also the method for computation reduction (make it less than O(n^2))
>> seems rooted in cosine. A different computation reduction method is used in
>> the Mahout code tied to LLR. Seems like we should get these together.
>>
>> On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.mussel...@gmail.com>
>> wrote:
>>
>> Excellent, thanks Pat.
>>
>> On Jan 17, 2015, at 9:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>
>> Mahout’s Spark implementation of rowsimilarity is in the Scala
>> SimilarityAnalysis class. It actually does either row or column similarity
>> but only supports LLR at present. It does [AA’] for columns or [A’A] for
>> rows first then calculates the distance (LLR) for non-zero elements. This
>> is a major optimization for sparse matrices. As I recall the old hadoop
>> code only did this for half the matrix since it’s symmetric but that
>> optimization isn’t in the current code because the downsampling is done as
>> LLR is calculated, so the entire similarity matrix is never actually
>> calculated unless you disable downsampling.
>>
>> The primary use is for recommenders but I’ve used it (in the test suite)
>> for row-wise text token similarity too.
>>
>> On Jan 17, 2015, at 9:00 AM, Andrew Musselman <andrew.mussel...@gmail.com>
>> wrote:
>>
>> Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259
>> and poking around to see how to do things.
>>
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259
>>
>> On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com>
>> wrote:
>>
>> Andrew, u would be better off using Mahout's RowSimilarityJob for what u
>> r trying to accomplish.
>>
>>  1.  It does give u pair-wise distances
>>  2.  U can specify the Distance measure u r looking to use
>>  3.  There's the old MapReduce impl and the Spark DSL impl per ur
>> preference.
>>
>>   ------------------------------
>>  *From:* Andrew Musselman <andrew.mussel...@gmail.com>
>> *To:* Reza Zadeh <r...@databricks.com>
>> *Cc:* user <user@spark.apache.org>
>> *Sent:* Saturday, January 17, 2015 11:29 AM
>> *Subject:* Re: Row similarities
>>
>> Thanks Reza, interesting approach.  I think what I actually want is to
>> calculate pair-wise distance, on second thought.  Is there a pattern for
>> that?
>>
>>
>>
>> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote:
>>
>> You can use K-means
>> <https://spark.apache.org/docs/latest/mllib-clustering.html> with a
>> suitably large k. Each cluster should correspond to rows that are similar
>> to one another.
>>
>> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman <
>> andrew.mussel...@gmail.com> wrote:
>>
>> What's a good way to calculate similarities between all vector-rows in a
>> matrix or RDD[Vector]?
>>
>> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm
>> going down a good path to transpose a matrix in order to run that.
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Row similarities

Reply via email to