I used the advice from: http://julia.readthedocs.org/en/latest/manual/performance-tips/ Which means mydot looks like this now: function mydot{T}(x::Array{T}) s = zero(T) @simd for i =1:length(x) @inbounds s += x[i]*x[i] end s end
This leads to the same timing on my machine. Is that what you're looking for? Am Samstag, 8. November 2014 10:20:39 UTC+1 schrieb David van Leeuwen: > > Hello, > > I had a lot of fun optimizing some inner loops in the couple of few days. > Generally, I was able to churn out a last little bit of performance by > writing out broadcast!()s that appeared in the inner loop. > > However, when I tried to replace a final inner-loop vector operation by a > BLAS equivalent, or one from NumericExtensions, execution time shot up > enormously. I don't understand why this is, I have the feeling it might be > related to cache-behaviour in the CPU and/or difference in inlining. > > I've tried to isolate the behaviour in this gist > <https://gist.github.com/davidavdav/a2332f6620c3bb259d51>, where I have > kept the structure and dimensioning of the original task in place but > replaced some operations by rand!(). In the gist, the main focus is the > difference between mydot()---which is just an implementation of > sumsq()---and the NumericExtensions version sumsq(). > > Plain usage of sumsq() is a bit faster than mydot(), but inside the inner > loop it is about 10x as slow on my machine (a mac laptop). Does anyone > know what might be going on here? > > Thanks, > > ---david >