Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

Pauli Virtanen Wed, 08 Jul 2009 15:55:39 -0700

On 2009-07-08, Charles R Harris <charlesr.har...@gmail.com> wrote:
[clip]
> How do the benchmarks compare with just making contiguous copies? Which is
> blocking of sort, I suppose.


I think that's slower than just walking over the discontiguous 
array:

        1) The new code: (on the Athlon machine)

        $ ./bench-reduce.py ../../../build/lib.linux-i686-2.6 8000,260 0,1 0

           <8000> {260}    [ <2080> {8} ]
          -15 vs. 0.876923  : 1 2
        new 8000,260 0,1 0 27.7  29.5  28.9  30.2  30.3


        2) Old code, with "x.transpose(1,0).copy().sum(axis=-1)":

        $ ./bench-reduce.py '' -o 8000,260 0,1 0
        old 8000,260 0,1 0 5.03  5.13  5.11  5.07  4.87


        3) Old code, with "x.sum(axis=0)"

        $ ./bench-reduce.py '' 8000,260 0,1 0
        old 8000,260 0,1 0 18.6  17.3  16.6  18.9  17.2

Last five numbers on the rows are repeats per second, five 
repeats. (Above, the trailing dimension 260 was chosen so that it 
maximizes the speed of the old code: 260 is almost twice faster 
than 256... This seems to be a cache padding effect.)


I think this is expected: the main bottleneck is probably the 
memory bandwidth, so when you make a copy and then reduce it, you 
round-trip the array twice through the CPU cache. This is likely 
to be more expensive than a single trip, even if it's 
discontiguous.

Also, I'm not sure if copying a discontiguous array in Numpy 
actually has any cache optimization, or whether it just walks the 
array in virtual C-order copying the data. If so, then it 
probably hits similar cache problems as the non-optimized 
reduction operation.

-- 
Pauli Virtanen

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

Reply via email to