[Numpy-discussion] fast iteration (I think I've got it)

2008-01-01 Thread Neal Becker
This is a c-api question.

I'm trying to get iterators that are both fast and reasonably general.  I
did confirm that iterating using just the general PyArrayIterObject
protocol is not as fast as using c-style pointers for contiguous arrays.

Please confirm if my understanding is correct.  There are 2 cases to
consider.  There are plain old dense arrays, which will be contiguous, and
there are array 'views' or 'slices'.  The views will not be contiguous. 
However, in all cases, the strides between elements in one dimension are
constant.  This last point is key.  As long as the stride in the last
dimension is a constant, a fast iterator is possible.

I put together a small test, which seems to work for the cases I tried.
The idea is to use the general iterator (via PyArray_IterAllButAxis), but
iteration over the last dimension is done with a c-style pointer.

I have tested it with these cases, and the results appear correct:

a = array ((xrange(4),xrange(4,8),xrange(8,12)), int)
b = a[:,::2]
c = a[:,1::2]
d = a[:,-1::-1]
sum3(a)
0 1 2 3 
4 5 6 7 
8 9 10 11 
sum3(b)
0 2 
4 6 
8 10 
sum3(c) 
1 3 
5 7 
9 11 
sum3 (d)
3 2 1 0 
7 6 5 4 
11 10 9 8 

templatetypename T
inline T sum3 (object const o) {
  PyArrayObject* arr = (PyArrayObject*)(o.ptr());
  int last_dim = arr-nd - 1;
  PyArrayIterObject* it = (PyArrayIterObject*)(PyArray_IterAllButAxis
(o.ptr(), last_dim));

  for (int i = 0 ; i  arr-nd; ++i)
std::cout  arr-dimensions[i]  ' ';
  std::cout  '\n';
  while (it-index  it-size) {
T * dptr = (T*)(it-dataptr);
const int stride = arr-strides[last_dim];
for (int i = 0; i  arr-dimensions[last_dim]; ++i) {
  std::cout  *dptr  ' ';
  dptr += stride/sizeof(T);
}
std::cout  '\n';
  
PyArray_ITER_NEXT(it);
  }
  return 0;
}


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fast iteration (I think I've got it)

2008-01-01 Thread Anne Archibald
On 01/01/2008, Neal Becker [EMAIL PROTECTED] wrote:
 This is a c-api question.

 I'm trying to get iterators that are both fast and reasonably general.  I
 did confirm that iterating using just the general PyArrayIterObject
 protocol is not as fast as using c-style pointers for contiguous arrays.

I'd like to point out that contiguous can be misleading as it is
used in numpy. An array is flagged contiguous if all the elements are
contiguous *and* they are arranged as a C-ordered multidimensional
array - i.e., the last index changes the fastest as you walk through
the array elements, and the next-to-last chenges the next fastest, and
so on.

 Please confirm if my understanding is correct.  There are 2 cases to
 consider.  There are plain old dense arrays, which will be contiguous, and
 there are array 'views' or 'slices'.  The views will not be contiguous.
 However, in all cases, the strides between elements in one dimension are
 constant.  This last point is key.  As long as the stride in the last
 dimension is a constant, a fast iterator is possible.

 I put together a small test, which seems to work for the cases I tried.
 The idea is to use the general iterator (via PyArray_IterAllButAxis), but
 iteration over the last dimension is done with a c-style pointer.

This is not an unreasonable approach, but it can suffer badly if the
last dimension is small, for example if you have an array of shape
(1000,1000,2). Such arrays can arise for example from working with
complex numbers as pairs of real numbers.

Where is the acceleration coming from in this code? You can walk
through a multidimensional array in pure C, just incrementing pointers
as needed, without too much difficulty.

If you want to save bookkeeping overhead, you can partially flatten
the array: if your last dimension has stride 7 and length 11, and your
second-to-last dimension has stride 77, you can flatten (using
reshape, in python, to get a view) the last two dimensions of the
array and use a nice quick iterator on the combined dimension. For a
C contiguous array, this means you just walk through the whole array
with a single layer of looping.

I suspect that the place you will see big slowdowns is where data is
accessed in an order that doesn't make sense from a cache point of
view. If four-byte chunks of data are spaced every eight bytes, then
the processor spends twice as much time waiting for cache loads as if
they are spaced every four bytes. And most numpy tasks are almost
certainly memory-bound - all the ufuncs I can think of almost
certainly are (arctan of a double almost certainly takes less time
than loading and storing a double). If you walk through an array in
the wrong order - changing the big strides fastest and the small
strides slowest - you'll almost certainly have to shuffle the whole
array through cache eight times or more. This is even ignoring cache
misses. (To cut down on cache misses you can use the GCC prefetch
instructions, or god-knows-what equivalent in other compilers.)

I seem to recall that numpy already reorders its walks through arrays
within ufuncs - does it do so with cache in mind? For that matter, is
information on the CPU's caches available to numpy?

Anne

P.S. Here is code to flatten an array as much as possible without
changing the order of flatiter:

def flatteniter(A):
i=0
while ilen(A.shape)-1:
if A.shape[i+1]*A.strides[i+1]==A.strides[i]:
newshape = A.shape[:i]+(A.shape[i]*A.shape[i+1],)+A.shape[i+2:]
A = A.reshape(newshape)
else:
i+=1
return A

This is a quick hack, as you can see I was slack about docstrings and
whatnot. But routinely running it over every array before applying
flatiter wouldn't hurt anything (though for all I know flatiter may do
this already). -A
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fast iteration (I think I've got it)

2008-01-01 Thread Neal Becker
Thank you for the response.  I'm afraid I haven't explained what I'm doing.

I have a lot of c++ code written in a generic c++ interface style.  (The
code is for signal processing, but that is irrelevant). 

The code is generic for data types as well as container types.

To accomplish this, the interface uses the boost::range interface concept. 
An example of using this interface:

templatetypename in_t
void F (in_t const in) {
  typename boost::range_const_iteratorin_t::type i = boost::begin (in);
  for (; i != boost::end (in); ++i)
do_something_with (*i);
}

In the above, in_t is a container type.  It could be std::vectorint, for
example.

The concept has:
  range_iteratorin_t::type and range_const_iteratorin_t::type are the
iterator types for the range
  begin(in) gives and iterator pointing to the beginning
  ++i increments the iterator
  *i derefs the iterator

This allows writing functions that work with different container types.  For
example, std::vector, boost::ublas::vector.

I'm trying to make this work with numpy.

To do this, I'm using boost::iterator_facade to create appropriate
iterators.  In the simplest case, this is just a wrapper around
PyArrayIterObject*.

Using this directly results in code equivalent to:
* uses PyArray_IterNew to create the iterator,
* PyArray_ITER_NEXT to increment the iterator,
* it-dataptr for deref

This will be slower than necessary.

So, I hope this explains the motivation behind the plan.  I'm hoping that I
can use PyArray_IterAllButAxis to iterate over arbitrary numpy arrays, but
with the inner loop using dptr += stride/sizeof(T) to speed the access.  I
believe that all numpy arrays have a constant stride over any one
dimension, which would allow this to work.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion