Hi everyone!

I just saw Jan de Leeuw's e-mail about his BLAS benchmarks, which is an 
interesting coincidence -- I've been wanting to ask the list about some 
inexplicable (and somewhat disappointing) matrix benchmark results on 64-bit 
Mac OS X with vecLib BLAS.

My problem seems to be different from Jan's, though, as the CPU load on my 
computer indicates that both cores are used; so it can't be an issue with using 
multiple threads.   That's why I'm starting a new thread.

For a little bit of background: I'm looking for the fastest method to calculate 
inner products between large sets of vectors and have been benchmarking various 
algorithms for this purpose, most of them performing a matrix multiplication M 
%*% t(M) in different ways.  Test were run on an early 2008 MacBook Pro with 
Intel Core 2 Duo, 2.5 GHz and 6 GB RAM, using R 2.11.1 on Leopard (10.5.8) and 
Snow Leopard (10.6.4); I also tried today's R-devel with the same results.

To my big surprise, matrix operations are _much_ slower in 64-bit R than in 
32-bit R (controlled with the --arch option).  This was completely unexpected, 
as 64-bit code is usually a little faster than the equivalent 32-bit code 
(5%-10% in my experience).  Here are some benchmark results (MOPS is an 
estimate for million of multiply-accumulate operations per second):

> ------------------------------------------------------------------------
> MacBook Pro 4,1 (2008), Intel Core 2 Duo T9300 2.5 GHz, 6MB L2 Cache, 800 MHz 
> FSB, GeForce 8600M GT
> Mac OS X 10.5.8, R 2.11.1, 32-bit, vecLib BLAS (default)
> 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      2.213  5963.45   3.977
> inner tcrossprod D      1.216 10852.89   2.033
> inner crossprod t(M) D  1.230 10729.36   2.038
> 
> ------------------------------------------------------------------------
> MacBook Pro 4,1 (2008), Intel Core 2 Duo T9300 2.5 GHz, 6MB L2 Cache, 800 MHz 
> FSB, GeForce 8600M GT
> Mac OS X 10.5.8, R 2.11.1, 64-bit, vecLib BLAS (default)
> 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      3.087  4275.06   5.149
> inner tcrossprod D      2.743  4811.20   3.542
> inner crossprod t(M) D  2.012  6559.20   3.072

As you can see, the 64-bit code is much slower than the 32-bit code especially 
for the (t)crossprod operation, and my beautiful (and expensive :) MacBook is 
even outperformed by a high-end netbook running Linux:

> ------------------------------------------------------------------------
> Acer 1810TX, Intel Pentium Dual Core U4100 1.3 GHz, 2MB L2 Cache, 800 MHz 
> FSB, Intel GMA X4500
> Ubuntu Linux 10.04LTS "Lucid Lynx" 64-bit, R 2.11.1, 64-bit, reference BLAS 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      1.674  7883.58    1.66
> inner tcrossprod D      0.966 13661.61    0.97
> inner crossprod t(M) D 18.672   706.79   18.65


After some experimentation, it seems that the culprit is Apple's vecLib.  If I 
switch to the reference BLAS shipped with R, I get the expected slight 
advantage for the 64-bit code, and overall performance increases considerably:

> ------------------------------------------------------------------------
> MacBook Pro 4,1 (2008), Intel Core 2 Duo T9300 2.5 GHz, 6MB L2 Cache, 800 MHz 
> FSB, GeForce 8600M GT
> Mac OS X 10.5.8, R 2.11.1, 32-bit, reference BLAS
> 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      1.253 10532.41   1.213
> inner tcrossprod D      0.721 18303.90   0.692
> inner crossprod t(M) D 12.845  1027.41  12.651

> MacBook Pro 4,1 (2008), Intel Core 2 Duo T9300 2.5 GHz, 6MB L2 Cache, 800 MHz 
> FSB, GeForce 8600M GT
> Mac OS X 10.5.8, R 2.11.1, 64-bit, reference BLAS
> 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      1.216 10852.89   1.187
> inner tcrossprod D      0.683 19322.27   0.658
> inner crossprod t(M) D 13.018  1013.76  12.525

I thought this might be a fluke of my particular hardware and software setup, 
but Jan's results lead me to believe that this may be a general problem with 
vecLib.  Has anybody else on the list observed similar behaviour?  If so, would 
it make sense to change the default to the reference BLAS?  In my benchmarks, 
it was consistently faster than vecLib (except for crossprod), but there may be 
other operations and situations in which vecLib performs better.


Some other remarks and observation:

 - The reference BLAS performs very poorly on crossprod() as opposed to 
tcrossprod(), while they're equally fast with vecLib.  If one is aware of this, 
it's relatively easy to work around in most situations, though (as t() is 
relatively cheap).

 - I've also tried the standard Ubuntu ATLAS instead of the reference BLAS, 
which performed very poorly at around 2000 MOPS.  Optimising BLAS libraries 
seems to be a tricky business ...

 - My vectors are very sparse (part of the task I'm benchmarking for).  This 
may have an influence on the result (if there are special optimisations for 0 
entries in the BLAS libraries), but I doubt this is the case.

 - I did some benchmarks for Euclidean distances between the vectors as well, 
finding that dist() is an extremely slow operation -- I had been aware of this, 
just not how bad the situation really is.  dist() runs at about 160 MOPS, while 
a (numerically unstable) approximation with matrix operations is almost 8x 
faster.


If you want to try for yourself, you can check out the benchmark code and the 
sample data set I used from R-Forge:

        svn checkout 
svn://scm.r-forge.r-project.org/svnroot/wordspace/illustrations benchmark 

Then run the script "matrix_benchmarks.R" in the new directory benchmark/.

I'd be interested to hear about substantially different results on other Mac 
computers / R versions.  Has anybody got a highly optimised BLAS on the Mac?


Best wishes,
Stefan

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Reply via email to