Hi Sab,
The current method is optimized for having many rows and few columns. In
your case it is exactly the opposite. We are working on your case, tracked
by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823
Your case is very common, so I will put some time into building it.

In the meantime, if you're looking for groups of similar points, consider
using K-means - it will get you clusters of similar rows with euclidean
distance.

Best,
Reza


On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> ​Hi Reza
> ​​
> I see that ((int, int), double) pairs are generated for any combination
> that meets the criteria controlled by the threshold. But assuming a simple
> 1x10K matrix that means I would need atleast 12GB memory per executor for
> the flat map just for these pairs excluding any other overhead. Is that
> correct? How can we make this scale for even larger n (when m stays small)
> like 100 x 5 million.​ One is by using higher thresholds. The other is that
> I use a SparseVector to begin with. Are there any other optimizations I can
> take advantage of?
>
> ​Thanks
> Sab
>
>

Reply via email to