Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-13 Thread Pauli Virtanen
Wed, 08 Jul 2009 22:16:22 +, Pauli Virtanen kirjoitti: [clip] On an older CPU (slower, smaller cache), the situation is slightly different: http://www.iki.fi/pav/tmp/athlon.png http://www.iki.fi/pav/tmp/athlon.txt On average, it's still an improvement in many cases. However,

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-13 Thread Citi, Luca
Hi Pauli, in my PC I have tried this and some of the regressions disappear, maybe you can give it a try. At the present state is compiler- and architecture-dependent, therefore not the best choice. But it may be worth trying. Best, Luca /* My additions are unindented */ /*

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-09 Thread Matthieu Brucher
2009/7/9 Pauli Virtanen pav...@iki.fi: On 2009-07-08, Stéfan van der Walt ste...@sun.ac.za wrote: I know very little about cache optimality, so excuse the triviality of this question: Is it possible to design this loop optimally (taking into account certain build-time measurable parameters),

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-09 Thread David Cournapeau
Matthieu Brucher wrote: Unfortunately, this is not possible. We've been playing with blocking loops for a long time in finite difference schemes, and it is always compiler dependent You mean CPU dependent, right ? I can't see how a reasonable optimizing compiler could make a big difference on

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-09 Thread David Warde-Farley
On 8-Jul-09, at 6:16 PM, Pauli Virtanen wrote: Just to tickle some interest, a pathological case before optimization: In [1]: import numpy as np In [2]: x = np.zeros((8, 256)) In [3]: %timeit x.sum(axis=0) 10 loops, best of 3: 850 ms per loop After optimization: In

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-09 Thread Matthieu Brucher
2009/7/9 David Cournapeau da...@ar.media.kyoto-u.ac.jp: Matthieu Brucher wrote: Unfortunately, this is not possible. We've been playing with blocking loops for a long time in finite difference schemes, and it is always compiler dependent You mean CPU dependent, right ? I can't see how a

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-09 Thread Pauli Virtanen
Thu, 09 Jul 2009 09:54:26 +0200, Matthieu Brucher kirjoitti: 2009/7/9 Pauli Virtanen pav...@iki.fi: [clip] I'm still kind of hoping that it's possible to make some minimal assumptions about CPU caches in general, and have a rule that decides a code path that is good enough, if not optimal.

[Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-08 Thread Pauli Virtanen
Hi, Ticket #1143 points out that Numpy's reduction operations are not always cache friendly. I worked a bit on tuning them. Just to tickle some interest, a pathological case before optimization: In [1]: import numpy as np In [2]: x = np.zeros((8, 256)) In [3]: %timeit

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-08 Thread Charles R Harris
On Wed, Jul 8, 2009 at 4:16 PM, Pauli Virtanen pav...@iki.fipav%2...@iki.fi wrote: Hi, Ticket #1143 points out that Numpy's reduction operations are not always cache friendly. I worked a bit on tuning them. Just to tickle some interest, a pathological case before optimization: In

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-08 Thread Stéfan van der Walt
Hi Pauli 2009/7/9 Pauli Virtanen pav...@iki.fi: Unfortunately, improving the performance using the above scheme comes at the cost of some slightly murky heuristics.  I didn't manage to come up with an optimal decision rule, so they are partly empirical. There is one parameter tuning the

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-08 Thread Pauli Virtanen
On 2009-07-08, Charles R Harris charlesr.har...@gmail.com wrote: [clip] How do the benchmarks compare with just making contiguous copies? Which is blocking of sort, I suppose. I think that's slower than just walking over the discontiguous array: 1) The new code: (on the Athlon

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-08 Thread Pauli Virtanen
On 2009-07-08, Stéfan van der Walt ste...@sun.ac.za wrote: I know very little about cache optimality, so excuse the triviality of this question: Is it possible to design this loop optimally (taking into account certain build-time measurable parameters), or is it the kind of thing that can only

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-08 Thread Robert Kern
On Wed, Jul 8, 2009 at 18:02, Pauli Virtanenpav...@iki.fi wrote: On 2009-07-08, Stéfan van der Walt ste...@sun.ac.za wrote: I know very little about cache optimality, so excuse the triviality of this question: Is it possible to design this loop optimally (taking into account certain build-time

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-08 Thread David Cournapeau
On Thu, Jul 9, 2009 at 8:02 AM, Pauli Virtanenpav...@iki.fi wrote: I don't think we want to go the ATNumPy route, or even have tunable parameters chosen at build or compile time. Detecting things like cache size at compile time should not be too difficult, at least for common platforms. Even

Re: [Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

2009-07-08 Thread Charles R Harris
On Wed, Jul 8, 2009 at 5:02 PM, Pauli Virtanen pav...@iki.fipav%2...@iki.fi wrote: On 2009-07-08, Stéfan van der Walt ste...@sun.ac.za wrote: I know very little about cache optimality, so excuse the triviality of this question: Is it possible to design this loop optimally (taking into