Sab, not sure what you require for the similarity metric or your use case but you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise) here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html <http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html>. These are optimized for LLR based “similarity” which is very simple to calculate since you don’t use either the item weight or the entire row or column vector values. Downsampling is done by number of values per column (or row) and by LLR strength. This keeps it to O(n)
They run pretty fast and only use memory if you use the version that attaches application IDs to the rows and columns. Using SimilarityAnalysis.cooccurrence may help. It’s in the Spark/Scala part of Mahout. On Mar 2, 2015, at 12:56 PM, Reza Zadeh <r...@databricks.com> wrote: Hi Sab, The current method is optimized for having many rows and few columns. In your case it is exactly the opposite. We are working on your case, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 <https://issues.apache.org/jira/browse/SPARK-4823> Your case is very common, so I will put some time into building it. In the meantime, if you're looking for groups of similar points, consider using K-means - it will get you clusters of similar rows with euclidean distance. Best, Reza On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan <sabarish.sasidha...@manthan.com <mailto:sabarish.sasidha...@manthan.com>> wrote: Hi Reza I see that ((int, int), double) pairs are generated for any combination that meets the criteria controlled by the threshold. But assuming a simple 1x10K matrix that means I would need atleast 12GB memory per executor for the flat map just for these pairs excluding any other overhead. Is that correct? How can we make this scale for even larger n (when m stays small) like 100 x 5 million. One is by using higher thresholds. The other is that I use a SparseVector to begin with. Are there any other optimizations I can take advantage of? Thanks Sab