Hello, I had a lot of fun optimizing some inner loops in the couple of few days. Generally, I was able to churn out a last little bit of performance by writing out broadcast!()s that appeared in the inner loop.
However, when I tried to replace a final inner-loop vector operation by a BLAS equivalent, or one from NumericExtensions, execution time shot up enormously. I don't understand why this is, I have the feeling it might be related to cache-behaviour in the CPU and/or difference in inlining. I've tried to isolate the behaviour in this gist <https://gist.github.com/davidavdav/a2332f6620c3bb259d51>, where I have kept the structure and dimensioning of the original task in place but replaced some operations by rand!(). In the gist, the main focus is the difference between mydot()---which is just an implementation of sumsq()---and the NumericExtensions version sumsq(). Plain usage of sumsq() is a bit faster than mydot(), but inside the inner loop it is about 10x as slow on my machine (a mac laptop). Does anyone know what might be going on here? Thanks, ---david