Makes sense.
> On Jan 17, 2015, at 6:27 PM, Reza Zadeh <r...@databricks.com> wrote: > > We're focused on providing block matrices, which makes transposition simple: > https://issues.apache.org/jira/browse/SPARK-3434 > >> On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel <p...@occamsmachete.com> wrote: >> In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a >> transpose—it’s optimized out. Mahout has had a stand alone row matrix >> transpose since day 1 and supports it in the Spark version. Can’t really do >> matrix algebra without it even though it’s often possible to optimize it >> away. >> >> Row similarity with LLR is much simpler than cosine since you only need >> non-zero sums for column, row, and matrix elements so rowSimilarity is >> implemented in Mahout for Spark. Full blown row similarity including all the >> different similarity methods (long since implemented in hadoop mapreduce) >> hasn’t been moved to spark yet. >> >> Yep, rows are not covered in the blog, my mistake. Too bad it has a lot of >> uses and can at very least be optimized for output matrix symmetry. >> >> On Jan 17, 2015, at 11:44 AM, Andrew Musselman <andrew.mussel...@gmail.com> >> wrote: >> >> Yeah okay, thanks. >> >> On Jan 17, 2015, at 11:15 AM, Reza Zadeh <r...@databricks.com> wrote: >> >>> Pat, columnSimilarities is what that blog post is about, and is already >>> part of Spark 1.2. >>> >>> rowSimilarities in a RowMatrix is a little more tricky because you can't >>> transpose a RowMatrix easily, and is being tracked by this JIRA: >>> https://issues.apache.org/jira/browse/SPARK-4823 >>> >>> Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for >>> example the number of rows in your RowMatrix is less than 1m, you can >>> transpose it and use rowSimilarities. >>> >>> >>>> On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel <p...@occamsmachete.com> >>>> wrote: >>>> BTW it looks like row and column similarities (cosine based) are coming to >>>> MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the >>>> master yet. Does anyone know the status? >>>> >>>> See: >>>> https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html >>>> >>>> Also the method for computation reduction (make it less than O(n^2)) seems >>>> rooted in cosine. A different computation reduction method is used in the >>>> Mahout code tied to LLR. Seems like we should get these together. >>>> >>>> On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.mussel...@gmail.com> >>>> wrote: >>>> >>>> Excellent, thanks Pat. >>>> >>>> On Jan 17, 2015, at 9:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote: >>>> >>>>> Mahout’s Spark implementation of rowsimilarity is in the Scala >>>>> SimilarityAnalysis class. It actually does either row or column >>>>> similarity but only supports LLR at present. It does [AA’] for columns or >>>>> [A’A] for rows first then calculates the distance (LLR) for non-zero >>>>> elements. This is a major optimization for sparse matrices. As I recall >>>>> the old hadoop code only did this for half the matrix since it’s >>>>> symmetric but that optimization isn’t in the current code because the >>>>> downsampling is done as LLR is calculated, so the entire similarity >>>>> matrix is never actually calculated unless you disable downsampling. >>>>> >>>>> The primary use is for recommenders but I’ve used it (in the test suite) >>>>> for row-wise text token similarity too. >>>>> >>>>> On Jan 17, 2015, at 9:00 AM, Andrew Musselman >>>>> <andrew.mussel...@gmail.com> wrote: >>>>> >>>>> Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 >>>>> and poking around to see how to do things. >>>>> >>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259 >>>>> >>>>>> On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com> >>>>>> wrote: >>>>>> >>>>>> Andrew, u would be better off using Mahout's RowSimilarityJob for what u >>>>>> r trying to accomplish. >>>>>> >>>>>> 1. It does give u pair-wise distances >>>>>> 2. U can specify the Distance measure u r looking to use >>>>>> 3. There's the old MapReduce impl and the Spark DSL impl per ur >>>>>> preference. >>>>>> >>>>>> From: Andrew Musselman <andrew.mussel...@gmail.com> >>>>>> To: Reza Zadeh <r...@databricks.com> >>>>>> Cc: user <user@spark.apache.org> >>>>>> Sent: Saturday, January 17, 2015 11:29 AM >>>>>> Subject: Re: Row similarities >>>>>> >>>>>> Thanks Reza, interesting approach. I think what I actually want is to >>>>>> calculate pair-wise distance, on second thought. Is there a pattern for >>>>>> that? >>>>>> >>>>>> >>>>>> >>>>>>> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote: >>>>>>> >>>>>>> You can use K-means with a suitably large k. Each cluster should >>>>>>> correspond to rows that are similar to one another. >>>>>>> >>>>>>> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman >>>>>>> <andrew.mussel...@gmail.com> wrote: >>>>>>> What's a good way to calculate similarities between all vector-rows in >>>>>>> a matrix or RDD[Vector]? >>>>>>> >>>>>>> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure >>>>>>> I'm going down a good path to transpose a matrix in order to run that. >