Hi Sab, The current method is optimized for having many rows and few columns. In your case it is exactly the opposite. We are working on your case, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Your case is very common, so I will put some time into building it.
In the meantime, if you're looking for groups of similar points, consider using K-means - it will get you clusters of similar rows with euclidean distance. Best, Reza On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > Hi Reza > > I see that ((int, int), double) pairs are generated for any combination > that meets the criteria controlled by the threshold. But assuming a simple > 1x10K matrix that means I would need atleast 12GB memory per executor for > the flat map just for these pairs excluding any other overhead. Is that > correct? How can we make this scale for even larger n (when m stays small) > like 100 x 5 million. One is by using higher thresholds. The other is that > I use a SparseVector to begin with. Are there any other optimizations I can > take advantage of? > > Thanks > Sab > >