sumsq internally invokes BLAS.dot.

When the length of x is small, there can be large overhead. Also, 
NumericExtension is not recommended for now. Important functions have been 
moved to Julia Base. You should use sumabs2 in the Julia Base instead, 
which has a more sophisticated strategy, i.e. when x is short, it uses 
naive for-loop; when x is long it calls BLAS.

Best,
Dahua


On Sunday, November 9, 2014 12:34:46 PM UTC+8, Erik Schnetter wrote:
>
> How large is length(x)? 
> What BLAS implementation is providing sum_sq? 
>
> -erik 
>
> On Sat, Nov 8, 2014 at 6:22 PM, David van Leeuwen 
> <david.va...@gmail.com <javascript:>> wrote: 
> > No, the problem is not optimizing the inner loop---I understand that the 
> > @inbounds works a bit faster (which is probably why sumsq() works faster 
> > outside the loop). 
> > 
> > The problem is that `sumsq()` is about 10 times as slow as `mydot()` 
> when it 
> > is used in the inner loop.  I don't understand why.  They should be 
> similar 
> > in performance, but maybe there is some overhead in calling a function 
> from 
> > a module that completely kille the inner loop, which is not there when I 
> use 
> > (my own) function living in the same global name space. 
> > 
> > ---david 
> > 
> > On Saturday, November 8, 2014 11:45:07 AM UTC+1, Simon Danisch wrote: 
> >> 
> >> I used the advice from: 
> >> http://julia.readthedocs.org/en/latest/manual/performance-tips/ 
> >> Which means mydot looks like this now: 
> >> function mydot{T}(x::Array{T}) 
> >>     s = zero(T) 
> >>     @simd for i =1:length(x) 
> >>        @inbounds s += x[i]*x[i] 
> >>     end 
> >>     s 
> >> end 
> >> 
> >> This leads to the same timing on my machine. 
> >> Is that what you're looking for? 
> >> 
> >> Am Samstag, 8. November 2014 10:20:39 UTC+1 schrieb David van Leeuwen: 
> >>> 
> >>> Hello, 
> >>> 
> >>> I had a lot of fun optimizing some inner loops in the couple of few 
> days. 
> >>> Generally, I was able to churn out a last little bit of performance by 
> >>> writing out broadcast!()s that appeared in the inner loop. 
> >>> 
> >>> However, when I tried to replace a final inner-loop vector operation 
> by a 
> >>> BLAS equivalent, or one from NumericExtensions, execution time shot up 
> >>> enormously.  I don't understand why this is, I have the feeling it 
> might be 
> >>> related to cache-behaviour in the CPU and/or difference in inlining. 
> >>> 
> >>> I've tried to isolate the behaviour in this gist, where I have kept 
> the 
> >>> structure and dimensioning of the original task in place but replaced 
> some 
> >>> operations by rand!().  In the gist, the main focus is the difference 
> >>> between mydot()---which is just an implementation of sumsq()---and the 
> >>> NumericExtensions version sumsq(). 
> >>> 
> >>> Plain usage of sumsq() is a bit faster than mydot(), but inside the 
> inner 
> >>> loop it is about 10x as slow on my machine (a mac laptop).  Does 
> anyone know 
> >>> what might be going on here? 
> >>> 
> >>> Thanks, 
> >>> 
> >>> ---david 
>
>
>
> -- 
> Erik Schnetter <schn...@cct.lsu.edu <javascript:>> 
> http://www.perimeterinstitute.ca/personal/eschnetter/ 
>

Reply via email to