Hi Dennis, sorry for the delayed reply and thanks for the article. I digged
into it and found that if you have a GPU, the CUBLAS library beats the
BLAS/ATLAS implementation in the Matrix package for 'large' problems. Here's
what I mean,

its = 2500
dim = 1750

X = matrix(rnorm(its*dim),its, dim) 

system.time({C=matrix(0, dim, dim);for(i in 1:its)C = C + (X[i,] %o%
X[i,])}) # single thread breakup calculation 
system.time({C1 = t(X) %*% X})                                               
# single thread - BLAS matrix mult
system.time({C2 = crossprod(X)})                                              
# single thread - BLAS matrix mult
library(gputools)
system.time({C3 = gpuCrossprod(X, X)})                                     
# multithread - CUBLAS cublasSgemm function
print(all.equal(C,C1,C2,C3))
   user  system elapsed 
 27.210   6.680  33.342 
   user  system elapsed 
  6.260   0.000   5.982 
   user  system elapsed 
  4.340   0.000   4.284 
   user  system elapsed 
   1.49    0.00    1.48 
[1] TRUE

The last line shows a x3 speed up, using my dated graphics card which has 16
cores, compared to my cpu which is a quad core. I should be able to try this
out on a 512 core card in the next few days, and will post the result.

All the best,

Aj 

--
View this message in context: 
http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3355139.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to