ti, 2010-06-15 kello 11:35 -0400, Anne Archibald kirjoitti: [clip: Numpy ufunc inner loops are slow] > This is a bit strange. I think I'd still vote for including this > optimization, since one hopes the inner loop will get faster at some > point. (If nothing else, the code generator can probably be made to > generate specialized loops for common cases).
Ok, apparently I couldn't read. Numpy already unrolls the ufunc loop for C-contiguous arrays, that's the reason there was no difference. This should be fairer: x = np.zeros((2,)*20).transpose([1,0]+range(2,20)) y = x.flatten() %timeit x+x 10 loops, best of 3: 122 ms per loop %timeit y+y 10 loops, best of 3: 19.9 ms per loop And with the loop order optimization: %timeit x+x 10 loops, best of 3: 20 ms per loop To get an idea how things look without collapsing the loops, one can engineer nastier arrays: x = np.zeros([50*5,50*3,50*2])[:-23:5, :-29:3, 31::2] %timeit x+x 100 loops, best of 3: 2.83 ms (3.59 ms) per loop [clip: minimize sum of strides] > I suspect that it isn't optimal. As I understand it, the key > restriction imposed by the cache architecture is that an entire cache > line - 64 bytes or so - must be loaded at once. For strides that are > larger than 64 bytes I suspect that it simply doesn't matter how big > they are. (There may be some subtle issues with cache associativity > but this would be extremely architecture-dependent.) So I would say, > first do any dimensions in which some or all strides are less than 64 > bytes, starting from the smallest. After that, any order you like is > fine. This is all up to designing a suitable objective function. Suppose `s_{i,j}` is the stride of array `i` in dimension `j`. The order-by-sum-of-strides rule reads f_j = \sum_i |s_{i,j}| and an axis permutation which makes the sequence `f_j` decreasing is chosen. Other heuristics would use a different f_j. One metric that I used previously when optimizing the reductions, was the number of elements that fit in a cache line, as you suggested. The sum of this for all arrays should be maximized in the inner loops. So one could take the objective function f_j = - C_1 C_2 \sum_i CACHELINE / (1 + C_2 |s_{i,j}|) with some constants C_1 and C_2, which, perhaps, would give better results. Note that the stride can be zero for broadcasted arrays, so we need to add a cutoff for that. With a bias toward C order, one could then have f_j = - C_3 j - C_1 C_2 \sum_i CACHELINE / (1 + C_2 |s_{i,j}|) Now the question would be just choosing C_1, C_2, and C_3 suitably. Another optimization could be flipping negative strides. This sounds a bit dangerous, though, so one would need to think if it could e.g. break a += a[::-1] etc. [clip] > I'm more worried this may violate some users' assumptions. If a user > knows they need an array to be in C order, really they should use > ascontiguousarray. But as it stands it's enough to make sure that it's > newly-allocated as the result of an arithmetic expression. Incautious > users could suddenly start seeing copies happening, or even transposed > arrays being passed to, C and Fortran code. Yes, it's a semantic change, and we have to make it consciously (and conscientiously, to boot :). I believe f2py checked the contiguity of intent(inout) arguments? The same should go for most C code that accepts Numpy arrays, at least PyArray_IsContiguous should be checked before assuming it is true. But this would still mean that code that previously worked, could start throwing up exceptions. Personally, I believe this change is worth making, with suitable mention in the release notes. > It's also worth exploring common usage patterns to make sure that > numpy still gravitates towards using C contiguous arrays for > everything. I'm imagining a user who at some point adds a transposed > array to a normal array, then does a long series of computations on > the result. We want as many of those operations as possible to operate > on contiguous arrays, but it's possible that an out-of-order array > could propagate indefinitely, forcing all loops to be done with one > array having large strides, and resulting in output that is stil > out-of-order. I think, at present, non-C-contiguous arrays will propagate indefinitely. > Some preference for C contiguous output is worth adding. Suggestions for better heuristics are accepted, just state it in the form of an objective function :) > It would also be valuable to build some kind of test harness to track > the memory layout of the arrays generated in some "typical" > calculations as well as the ordering of the loop used. > > More generally, this problem is exactly what ATLAS is for - finding > cache-optimal orders for linear algebra. So we shouldn't expect it to > be simple. Yes, I do not think any of us is expecting it to be simple. I don't think we can aim for the optimal solution, since it is ill-defined, but only for one that is "good enough in practice". -- Pauli Virtanen _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion