Hi, Ticket #1143 points out that Numpy's reduction operations are not always cache friendly. I worked a bit on tuning them.
Just to tickle some interest, a "pathological" case before optimization: In [1]: import numpy as np In [2]: x = np.zeros((80000, 256)) In [3]: %timeit x.sum(axis=0) 10 loops, best of 3: 850 ms per loop After optimization: In [1]: import numpy as np In [2]: x = np.zeros((80000, 256)) In [3]: %timeit x.sum(axis=0) 10 loops, best of 3: 78.5 ms per loop For comparison, a reduction operation on a contiguous array of the same size: In [4]: x = np.zeros((256, 80000)) In [5]: %timeit x.sum(axis=1) 10 loops, best of 3: 88.9 ms per loop Funnily enough, it's actually slower than the reduction over the axis with the larger stride. The improvement factor depends on the CPU and its cache size. The code is here: http://github.com/pv/numpy-work/ticket/1143-speedup-reduce A benchmark script numpy/core/src/bench-reduce.py is included. Just do ./bench-reduce.py ../../../build/lib.linux-i686-2.6 | tee output.txt to run it against the system Numpy, and ./bench-reduce.py plot < output.txt to plot the results. Here's one result set: http://www.iki.fi/pav/tmp/xeon.png http://www.iki.fi/pav/tmp/xeon.txt The y-axis is proportional to the number of elemental (addition) operations per second, x-axis the total size of the array. Vertical lines show how the result changed by the cache optimization. Factor of 2 improvement seems common in the test cases (typically, arrays that are "skinny" in some dimension). There are also some factors of 10, especially for arrays larger than the 6144 kb cache of this particular CPU. Also, for this CPU, there are no significant regressions. On an older CPU (slower, smaller cache), the situation is slightly different: http://www.iki.fi/pav/tmp/athlon.png http://www.iki.fi/pav/tmp/athlon.txt On average, it's still an improvement in many cases. However, now there are more regressions. The significant ones (factor of 1/2) are N-D arrays where the reduction runs over an axis with a small number of elements. The optimization is in evaluating the reduction operation with a blocked loop (pseudocode) for block_idx in loop_over_blocks_except_axis(arr): result[block_idx] = arr[block_idx,0] for j in xrange(1, N): result[block_idx] += arr[block_idx,j] } ufunc loop instead of the usual ufunc reduction loop for idx in loop_over_array_except_axis(arr): result[idx] = arr[idx,0] } for j in xrange(1, N): } ufunc loop result[idx] += arr[idx,j] } The blocks are chosen to minimize the stride of the inmost ufunc loop, maximizing reuse of elements fitting in a single cache line. This has to be balanced against avoiding inner loops over very small dimensions, which incur some overhead. For N-D arrays, I also attempt to lump several dimensions into the ufunc loop, if the data can be iterated over with a single stride. The decision heuristics are in ufunc_object.c:_construct_reduce, and the new outer loop NOBUFFER_UFUNCREDUCE_TRANSPOSE in ufunc_object.c:PyUFunc_Reduce. Unfortunately, improving the performance using the above scheme comes at the cost of some slightly murky heuristics. I didn't manage to come up with an optimal decision rule, so they are partly empirical. There is one parameter tuning the cross-over between minimizing stride and avoiding small dimensions. (This is more or less straightforward.) Another empirical decision is required in choosing whether to use the usual reduction loop, which is better in some cases, or the blocked loop. How to make this latter choice is not so clear to me. The present approach seems to mostly work, as far as the benchmark produces improvements and produces no severe regressions. However, I think the regressions for the older CPU point out that something still needs to be tuned. The main suspect is the latter decision rule, but I'm not sure what the correct solution here is. Comments? I'm not a cache optimization expert, so bright ideas about better decision rules and further optimizations would be appreciated. Also, I think this addition could be nice to have included in Numpy. The main value IMHO is in boosting the most cache-inefficient corner cases. I think many users do not realize that np.zeros((256, 80000)).sum(axis=1) is currently more efficient than np.zeros((80000, 256)).sum(axis=0). And as shown above, there is no real reason to have this performance difference... Cheers, Pauli _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion