Thanks for all the responses. It was never my intention to write a
sophisticated code that can compete with BLAS, we have BLAS for that. I
just wanted to see how much you lose with a simple code. I guess the
generic_matmulmul is at the level of simplicity that I am still willing to
consider, and already significantly outperforms my own naive matrix matrix
multiplication. In the same setup, generic_matmulmul stagnates to a speed
13 times slower than BLAS, without showing any effect of the cache being
saturated (which is of course what it was written for).