We're focused on providing block matrices, which makes transposition simple: https://issues.apache.org/jira/browse/SPARK-3434
On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a > transpose—it’s optimized out. Mahout has had a stand alone row matrix > transpose since day 1 and supports it in the Spark version. Can’t really do > matrix algebra without it even though it’s often possible to optimize it > away. > > Row similarity with LLR is much simpler than cosine since you only need > non-zero sums for column, row, and matrix elements so rowSimilarity is > implemented in Mahout for Spark. Full blown row similarity including all > the different similarity methods (long since implemented in hadoop > mapreduce) hasn’t been moved to spark yet. > > Yep, rows are not covered in the blog, my mistake. Too bad it has a lot of > uses and can at very least be optimized for output matrix symmetry. > > On Jan 17, 2015, at 11:44 AM, Andrew Musselman <andrew.mussel...@gmail.com> > wrote: > > Yeah okay, thanks. > > On Jan 17, 2015, at 11:15 AM, Reza Zadeh <r...@databricks.com> wrote: > > Pat, columnSimilarities is what that blog post is about, and is already > part of Spark 1.2. > > rowSimilarities in a RowMatrix is a little more tricky because you can't > transpose a RowMatrix easily, and is being tracked by this JIRA: > https://issues.apache.org/jira/browse/SPARK-4823 > > Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for > example the number of rows in your RowMatrix is less than 1m, you can > transpose it and use rowSimilarities. > > > On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel <p...@occamsmachete.com> > wrote: > >> BTW it looks like row and column similarities (cosine based) are coming >> to MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in >> the master yet. Does anyone know the status? >> >> See: >> https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html >> >> Also the method for computation reduction (make it less than O(n^2)) >> seems rooted in cosine. A different computation reduction method is used in >> the Mahout code tied to LLR. Seems like we should get these together. >> >> On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.mussel...@gmail.com> >> wrote: >> >> Excellent, thanks Pat. >> >> On Jan 17, 2015, at 9:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote: >> >> Mahout’s Spark implementation of rowsimilarity is in the Scala >> SimilarityAnalysis class. It actually does either row or column similarity >> but only supports LLR at present. It does [AA’] for columns or [A’A] for >> rows first then calculates the distance (LLR) for non-zero elements. This >> is a major optimization for sparse matrices. As I recall the old hadoop >> code only did this for half the matrix since it’s symmetric but that >> optimization isn’t in the current code because the downsampling is done as >> LLR is calculated, so the entire similarity matrix is never actually >> calculated unless you disable downsampling. >> >> The primary use is for recommenders but I’ve used it (in the test suite) >> for row-wise text token similarity too. >> >> On Jan 17, 2015, at 9:00 AM, Andrew Musselman <andrew.mussel...@gmail.com> >> wrote: >> >> Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 >> and poking around to see how to do things. >> >> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259 >> >> On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com> >> wrote: >> >> Andrew, u would be better off using Mahout's RowSimilarityJob for what u >> r trying to accomplish. >> >> 1. It does give u pair-wise distances >> 2. U can specify the Distance measure u r looking to use >> 3. There's the old MapReduce impl and the Spark DSL impl per ur >> preference. >> >> ------------------------------ >> *From:* Andrew Musselman <andrew.mussel...@gmail.com> >> *To:* Reza Zadeh <r...@databricks.com> >> *Cc:* user <user@spark.apache.org> >> *Sent:* Saturday, January 17, 2015 11:29 AM >> *Subject:* Re: Row similarities >> >> Thanks Reza, interesting approach. I think what I actually want is to >> calculate pair-wise distance, on second thought. Is there a pattern for >> that? >> >> >> >> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote: >> >> You can use K-means >> <https://spark.apache.org/docs/latest/mllib-clustering.html> with a >> suitably large k. Each cluster should correspond to rows that are similar >> to one another. >> >> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman < >> andrew.mussel...@gmail.com> wrote: >> >> What's a good way to calculate similarities between all vector-rows in a >> matrix or RDD[Vector]? >> >> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm >> going down a good path to transpose a matrix in order to run that. >> >> >> >> >> >> >> > >