You said the size(x) is 1000, 60 and sumsq needs to work over 60 values.

The problem seems not that 60 is too small, but that you are doing this 
along rows instead of columns.

If you have an m x n matrix, and you do the computation per column like the 
following:
```julia
for j = 1:n
    r[j] = sumabs2(view(x,:,j))
end
```
It would be very well optimized.

But if you are doing things per row, you may run into lots of performance 
issues. First, the per-row access pattern is not cache-friendly, especially 
when you have a large number of rows (note that the distance between 
adjacent elements along a row is m). Also, you either have to copy each row 
to an intermediate vector or to work with an non-contiguous vector. Neither 
way will get you fantastic performance.

The rule of thumb is that try to organize your data in a way that you work 
in a per-column fashion (instead of per-row).

- Dahua


On Sunday, November 9, 2014 7:22:56 PM UTC+8, David van Leeuwen wrote:
>
> Hi, 
>
> On Sunday, November 9, 2014 5:34:46 AM UTC+1, Erik Schnetter wrote:
>>
>> How large is length(x)? 
>>
>  
> in the example on the gist, size(x) = 10000, 60.  So the sumsq() needs to 
> be performed over 60 values, admittedly not a lot. 
>
> What are the overheads?  Is the external library call, or does BLAS.dot() 
> need to do a whole load of admin to figure out the correct approach?
>  
>
>> What BLAS implementation is providing sum_sq? 
>>
>> -erik 
>>
>> On Sat, Nov 8, 2014 at 6:22 PM, David van Leeuwen 
>> <david.va...@gmail.com> wrote: 
>> > No, the problem is not optimizing the inner loop---I understand that 
>> the 
>> > @inbounds works a bit faster (which is probably why sumsq() works 
>> faster 
>> > outside the loop). 
>> > 
>> > The problem is that `sumsq()` is about 10 times as slow as `mydot()` 
>> when it 
>> > is used in the inner loop.  I don't understand why.  They should be 
>> similar 
>> > in performance, but maybe there is some overhead in calling a function 
>> from 
>> > a module that completely kille the inner loop, which is not there when 
>> I use 
>> > (my own) function living in the same global name space. 
>> > 
>> > ---david 
>> > 
>> > On Saturday, November 8, 2014 11:45:07 AM UTC+1, Simon Danisch wrote: 
>> >> 
>> >> I used the advice from: 
>> >> http://julia.readthedocs.org/en/latest/manual/performance-tips/ 
>> >> Which means mydot looks like this now: 
>> >> function mydot{T}(x::Array{T}) 
>> >>     s = zero(T) 
>> >>     @simd for i =1:length(x) 
>> >>        @inbounds s += x[i]*x[i] 
>> >>     end 
>> >>     s 
>> >> end 
>> >> 
>> >> This leads to the same timing on my machine. 
>> >> Is that what you're looking for? 
>> >> 
>> >> Am Samstag, 8. November 2014 10:20:39 UTC+1 schrieb David van Leeuwen: 
>> >>> 
>> >>> Hello, 
>> >>> 
>> >>> I had a lot of fun optimizing some inner loops in the couple of few 
>> days. 
>> >>> Generally, I was able to churn out a last little bit of performance 
>> by 
>> >>> writing out broadcast!()s that appeared in the inner loop. 
>> >>> 
>> >>> However, when I tried to replace a final inner-loop vector operation 
>> by a 
>> >>> BLAS equivalent, or one from NumericExtensions, execution time shot 
>> up 
>> >>> enormously.  I don't understand why this is, I have the feeling it 
>> might be 
>> >>> related to cache-behaviour in the CPU and/or difference in inlining. 
>> >>> 
>> >>> I've tried to isolate the behaviour in this gist, where I have kept 
>> the 
>> >>> structure and dimensioning of the original task in place but replaced 
>> some 
>> >>> operations by rand!().  In the gist, the main focus is the difference 
>> >>> between mydot()---which is just an implementation of sumsq()---and 
>> the 
>> >>> NumericExtensions version sumsq(). 
>> >>> 
>> >>> Plain usage of sumsq() is a bit faster than mydot(), but inside the 
>> inner 
>> >>> loop it is about 10x as slow on my machine (a mac laptop).  Does 
>> anyone know 
>> >>> what might be going on here? 
>> >>> 
>> >>> Thanks, 
>> >>> 
>> >>> ---david 
>>
>>
>>
>> -- 
>> Erik Schnetter <schn...@cct.lsu.edu> 
>> http://www.perimeterinstitute.ca/personal/eschnetter/ 
>>
>

Reply via email to