On 01/01/2008, Neal Becker <[EMAIL PROTECTED]> wrote: > This is a c-api question. > > I'm trying to get iterators that are both fast and reasonably general. I > did confirm that iterating using just the general PyArrayIterObject > protocol is not as fast as using c-style pointers for contiguous arrays.
I'd like to point out that "contiguous" can be misleading as it is used in numpy. An array is flagged contiguous if all the elements are contiguous *and* they are arranged as a C-ordered multidimensional array - i.e., the last index changes the fastest as you walk through the array elements, and the next-to-last chenges the next fastest, and so on. > Please confirm if my understanding is correct. There are 2 cases to > consider. There are plain old dense arrays, which will be contiguous, and > there are array 'views' or 'slices'. The views will not be contiguous. > However, in all cases, the strides between elements in one dimension are > constant. This last point is key. As long as the stride in the last > dimension is a constant, a fast iterator is possible. > > I put together a small test, which seems to work for the cases I tried. > The idea is to use the general iterator (via PyArray_IterAllButAxis), but > iteration over the last dimension is done with a c-style pointer. This is not an unreasonable approach, but it can suffer badly if the last dimension is "small", for example if you have an array of shape (1000,1000,2). Such arrays can arise for example from working with complex numbers as pairs of real numbers. Where is the acceleration coming from in this code? You can walk through a multidimensional array in pure C, just incrementing pointers as needed, without too much difficulty. If you want to save bookkeeping overhead, you can partially flatten the array: if your last dimension has stride 7 and length 11, and your second-to-last dimension has stride 77, you can flatten (using reshape, in python, to get a view) the last two dimensions of the array and use a nice quick iterator on the combined dimension. For a "C contiguous" array, this means you just walk through the whole array with a single layer of looping. I suspect that the place you will see big slowdowns is where data is accessed in an order that doesn't make sense from a cache point of view. If four-byte chunks of data are spaced every eight bytes, then the processor spends twice as much time waiting for cache loads as if they are spaced every four bytes. And most numpy tasks are almost certainly memory-bound - all the ufuncs I can think of almost certainly are (arctan of a double almost certainly takes less time than loading and storing a double). If you walk through an array in the wrong order - changing the big strides fastest and the small strides slowest - you'll almost certainly have to shuffle the whole array through cache eight times or more. This is even ignoring cache misses. (To cut down on cache misses you can use the GCC prefetch instructions, or god-knows-what equivalent in other compilers.) I seem to recall that numpy already reorders its walks through arrays within ufuncs - does it do so with cache in mind? For that matter, is information on the CPU's caches available to numpy? Anne P.S. Here is code to flatten an array as much as possible without changing the order of flatiter: def flatteniter(A): i=0 while i<len(A.shape)-1: if A.shape[i+1]*A.strides[i+1]==A.strides[i]: newshape = A.shape[:i]+(A.shape[i]*A.shape[i+1],)+A.shape[i+2:] A = A.reshape(newshape) else: i+=1 return A This is a quick hack, as you can see I was slack about docstrings and whatnot. But routinely running it over every array before applying flatiter wouldn't hurt anything (though for all I know flatiter may do this already). -A _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion