I used the advice 
from: http://julia.readthedocs.org/en/latest/manual/performance-tips/
Which means mydot looks like this now:
function mydot{T}(x::Array{T})
    s = zero(T)
    @simd for i =1:length(x)
       @inbounds s += x[i]*x[i]
    end
    s
end

This leads to the same timing on my machine.
Is that what you're looking for?

Am Samstag, 8. November 2014 10:20:39 UTC+1 schrieb David van Leeuwen:
>
> Hello, 
>
> I had a lot of fun optimizing some inner loops in the couple of few days. 
>  Generally, I was able to churn out a last little bit of performance by 
> writing out broadcast!()s that appeared in the inner loop.  
>
> However, when I tried to replace a final inner-loop vector operation by a 
> BLAS equivalent, or one from NumericExtensions, execution time shot up 
> enormously.  I don't understand why this is, I have the feeling it might be 
> related to cache-behaviour in the CPU and/or difference in inlining.  
>
> I've tried to isolate the behaviour in this gist 
> <https://gist.github.com/davidavdav/a2332f6620c3bb259d51>, where I have 
> kept the structure and dimensioning of the original task in place but 
> replaced some operations by rand!().  In the gist, the main focus is the 
> difference between mydot()---which is just an implementation of 
> sumsq()---and the NumericExtensions version sumsq(). 
>
> Plain usage of sumsq() is a bit faster than mydot(), but inside the inner 
> loop it is about 10x as slow on my machine (a mac laptop).  Does anyone 
> know what might be going on here?
>
> Thanks, 
>
> ---david
>

Reply via email to