Hi Dennis, sorry for the delayed reply and thanks for the article. I digged into it and found that if you have a GPU, the CUBLAS library beats the BLAS/ATLAS implementation in the Matrix package for 'large' problems. Here's what I mean,
its = 2500 dim = 1750 X = matrix(rnorm(its*dim),its, dim) system.time({C=matrix(0, dim, dim);for(i in 1:its)C = C + (X[i,] %o% X[i,])}) # single thread breakup calculation system.time({C1 = t(X) %*% X}) # single thread - BLAS matrix mult system.time({C2 = crossprod(X)}) # single thread - BLAS matrix mult library(gputools) system.time({C3 = gpuCrossprod(X, X)}) # multithread - CUBLAS cublasSgemm function print(all.equal(C,C1,C2,C3)) user system elapsed 27.210 6.680 33.342 user system elapsed 6.260 0.000 5.982 user system elapsed 4.340 0.000 4.284 user system elapsed 1.49 0.00 1.48 [1] TRUE The last line shows a x3 speed up, using my dated graphics card which has 16 cores, compared to my cpu which is a quad core. I should be able to try this out on a 512 core card in the next few days, and will post the result. All the best, Aj -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3355139.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.