Hi Stefan,

thats really interesting - I never though of trying to benchmark Linux-64
against OSX (a friend who works on large databases, says OSX performs better
than Linux in his work!). Thanks for posting your comparison, and your hints
:)

i) I guess you have a very fast CPU (Core i7 or so, I guess?),  - only quad
core i5 but I'm trying to get access to a quad core i7, might make a
difference for openCL code?

ii) a very poor BLAS implementation - I installed the latest ATLAS package
for Ubuntu 10.04 LTS, which gives a x6 speed up?? I'm tempted & interested
in recompiling R-2.12.2 linked to the MKL (which I guess the vecLib BLAS
library uses ?), but it seems a tricky thing to do ?? To be honest I'm not
sure how this new ATLAS library works, i.e. is it seqential or
mulithtreaded?

iii) and a desktop graphics card - installed a GTX570 today which has 480
cuda cores, my previous card had 16 cores and half the bandwidth

The results of a setup with the new ATLAS library and GTX570 are a pleasant
improvement :).

  user  system elapsed    --  for loop, single thread
 29.790   7.400  37.243
   user  system elapsed   -- new ATLAS, t(X)%*%X
  1.480   0.000   1.479
   user  system elapsed   -- new ATLAS, crossprod(X)
  0.740   0.000   0.739
   user  system elapsed   -- new GPU, gputools::crossprod(X)
*  0.190   0.040   0.228*

I would be really interested to find out what the results would be on a OSX
machine with a fancy GPU. I read that a 2x512 core card is going to be
released by Nvidia in the next couple of weeks, and CUDA 4.0 is due for
public release in a few months. So may be you want to keep CUDA on your
radar?

I managed to write my first R function/package using CUDA code at the
weekend. Its a fairly simple but tedious process once you have some CUDA
code which compiles, and all you want to do is to port it to R. (in the Unix
case at least). For example you can write a simple C wrapper along the lines
of the rinterface.c code in gputools. Then modify the Makefile.in and
configure.ac files in this package as required, and you should be set to
configure, make and install into R.

I'm working on non-parametric regression, and optimization at the moment and
the speed up using CUDA has been worth the effort :)

All the best,

Ajay






On 15 March 2011 11:22, Stefan Evert-3 [via R] <
ml-node+3356302-1299160144-215...@n4.nabble.com> wrote:

> Hi Ajay,
>
> thanks for this comparison, which prodded me to give CUDA another try on my
> now somewhat aging MacBook Pro.
>
> > Hi Dennis, sorry for the delayed reply and thanks for the article. I
> digged
> > into it and found that if you have a GPU, the CUBLAS library beats the
> > BLAS/ATLAS implementation in the Matrix package for 'large' problems.
>
> I guess you have a very fast CPU (Core i7 or so, I guess?), a very poor
> BLAS implementation and a desktop graphics card?
>
> >   user  system elapsed    -- for loop, single thread
> > 27.210   6.680  33.342
> >   user  system elapsed    -- BLAS mat mult
> >  6.260   0.000   5.982
> >   user  system elapsed    -- BLAS crossprod
> >  4.340   0.000   4.284
> >   user  system elapsed    -- CUDA gpuCrossprod
> >   1.49    0.00    1.48
>
> Just to put these numbers in perspective, here are my results for a MacBook
> Pro running Mac OS X 10.6.6 (Core 2 Duo, 2.5 GHz, 6 GB DDR2 RAM, Nvidia
> GeForce 8600M GT with 512 MB RAM -- I suppose it's the "M" that breaks my
> performance here).
>
> >    user  system elapsed    -- for loop, single thread
> > 141.034  35.299 153.783
> >    user  system elapsed    -- BLAS mat mult
> >   2.791   0.025   1.805
> >    user  system elapsed    -- BLAS crossprod
> >   1.419   0.039   0.863
> >    user  system elapsed    -- CUDA gpuCrossprod
> >   1.431   0.119   1.718
>
>
> As you can see, my CPU/RAM is about 5x slower than your machine, CUDA is
> slightly slower (my card has 32 cores, but may have lower memory bandwidth
> and/or clock rate if yours is a desktop card), but vecLib BLAS beats CUDA by
> a factor of 2.
>
>
> Kudos to the gputools developers: despite what the README says, the package
> compiles out of the box on Mac OS X 10.6, 64-bit R 2.12.1, with CUDA release
> 3.2.  Thanks for this convenient package!
>
>
> Best regards,
> Stefan Evert
>
> [ [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=3356302&i=0&by-user=t>|
> http://purl.org/stefan.evert ]
>
> ______________________________________________
> [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=3356302&i=1&by-user=t>mailing 
> list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3356302.html
>  To unsubscribe from Speed up sum of outer products?, click 
> here<http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3330160&code=YWpheXRhbGF0aUBnb29nbGVtYWlsLmNvbXwzMzMwMTYwfC0zNjU4ODgwNDc=>.
>
>


--
View this message in context: 
http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3382639.html
Sent from the R help mailing list archive at Nabble.com.
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to