Column based similarities work well if the columns are mild (10K, 100K, we actually scaled it to 1.5M columns but it really stress tests the shuffle and it needs to tune the shuffle parameters)...You can either use dimsum sampling or come up with your own threshold based on your application that you can apply in reduceByKey (you have to change the code to use combineByKey and add your filters before shuffling the keys to reducer)...
The other variant that you are mentioning is row based similarity flow which is tracked in the following JIRA where I am interesting in doing no shuffle but use broadcast and mapPartitions. I will open up the PR soon but it is compute intensive and I am experimenting with BLAS optimizations... https://issues.apache.org/jira/browse/SPARK-4823 Your case of 100 x 5 million (tranpose of it) for example is very common in matrix factorization where you have user factors and product factors which will typically be 5 million x 100 dense matrix and you want to compute user->user and item->item similarities... You are right that sparsity helps but you can't apply sparsity (for example pick topK) before doing the dot products...so it is still a compute intensive operation... On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan < [email protected]> wrote: > Hi Reza > > I see that ((int, int), double) pairs are generated for any combination > that meets the criteria controlled by the threshold. But assuming a simple > 1x10K matrix that means I would need atleast 12GB memory per executor for > the flat map just for these pairs excluding any other overhead. Is that > correct? How can we make this scale for even larger n (when m stays small) > like 100 x 5 million. One is by using higher thresholds. The other is that > I use a SparseVector to begin with. Are there any other optimizations I can > take advantage of? > > Thanks > Sab > >
