Hello, 

I had a lot of fun optimizing some inner loops in the couple of few days. 
 Generally, I was able to churn out a last little bit of performance by 
writing out broadcast!()s that appeared in the inner loop.  

However, when I tried to replace a final inner-loop vector operation by a 
BLAS equivalent, or one from NumericExtensions, execution time shot up 
enormously.  I don't understand why this is, I have the feeling it might be 
related to cache-behaviour in the CPU and/or difference in inlining.  

I've tried to isolate the behaviour in this gist 
<https://gist.github.com/davidavdav/a2332f6620c3bb259d51>, where I have 
kept the structure and dimensioning of the original task in place but 
replaced some operations by rand!().  In the gist, the main focus is the 
difference between mydot()---which is just an implementation of 
sumsq()---and the NumericExtensions version sumsq(). 

Plain usage of sumsq() is a bit faster than mydot(), but inside the inner 
loop it is about 10x as slow on my machine (a mac laptop).  Does anyone 
know what might be going on here?

Thanks, 

---david

Reply via email to