On 2009-07-08, Charles R Harris <charlesr.har...@gmail.com> wrote: [clip] > How do the benchmarks compare with just making contiguous copies? Which is > blocking of sort, I suppose.
I think that's slower than just walking over the discontiguous array: 1) The new code: (on the Athlon machine) $ ./bench-reduce.py ../../../build/lib.linux-i686-2.6 8000,260 0,1 0 <8000> {260} [ <2080> {8} ] -15 vs. 0.876923 : 1 2 new 8000,260 0,1 0 27.7 29.5 28.9 30.2 30.3 2) Old code, with "x.transpose(1,0).copy().sum(axis=-1)": $ ./bench-reduce.py '' -o 8000,260 0,1 0 old 8000,260 0,1 0 5.03 5.13 5.11 5.07 4.87 3) Old code, with "x.sum(axis=0)" $ ./bench-reduce.py '' 8000,260 0,1 0 old 8000,260 0,1 0 18.6 17.3 16.6 18.9 17.2 Last five numbers on the rows are repeats per second, five repeats. (Above, the trailing dimension 260 was chosen so that it maximizes the speed of the old code: 260 is almost twice faster than 256... This seems to be a cache padding effect.) I think this is expected: the main bottleneck is probably the memory bandwidth, so when you make a copy and then reduce it, you round-trip the array twice through the CPU cache. This is likely to be more expensive than a single trip, even if it's discontiguous. Also, I'm not sure if copying a discontiguous array in Numpy actually has any cache optimization, or whether it just walks the array in virtual C-order copying the data. If so, then it probably hits similar cache problems as the non-optimized reduction operation. -- Pauli Virtanen _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion