Thanks for all the responses. It was never my intention to write a 
sophisticated code that can compete with BLAS, we have BLAS for that. I 
just wanted to see how much you lose with a simple code. I guess the 
generic_matmulmul is at the level of simplicity that I am still willing to 
consider, and already significantly outperforms my own naive matrix matrix 
multiplication. In the same setup, generic_matmulmul stagnates to a speed 
13 times slower than BLAS, without showing any effect of the cache being 
saturated (which is of course what it was written for).
  

Reply via email to