Re: [Numpy-discussion] NumPy re-factoring project

2010-06-15 Thread Sturla Molden
A very radical solution would be to get rid of all C, and go for a pure 
Python solution. NumPy could build up a text string with OpenCL code on 
the fly, and use the OpenCL driver as a JIT compiler for fast array 
expressions. Most GPUs and CPUs will support OpenCL, and thus there will 
be no need for a compiled language like C or Fortran for fast 
computation in the near future.

Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-15 Thread Sturla Molden
Den 15.06.2010 18:30, skrev Sturla Molden:
 A very radical solution would be to get rid of all C, and go for a 
 pure Python solution. NumPy could build up a text string with OpenCL 
 code on the fly, and use the OpenCL driver as a JIT compiler for 
 fast array expressions. Most GPUs and CPUs will support OpenCL, and 
 thus there will be no need for a compiled language like C or Fortran 
 for fast computation in the near future.

Paradoxically, with OpenCL, Python is going to be better for fast 
numerical computation than C or Fortran, because Python is better at 
generating an manipulating dynamic text.

Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-13 Thread Pauli Virtanen
Sun, 13 Jun 2010 06:54:29 +0200, Sturla Molden wrote:
[clip: memory management only in the interface]

You forgot views: if memory management is done in the interface layer, it 
must also make sure that the memory pointed to by a view is never moved 
around, and not freed before all the views are freed.

This probably cannot be done with existing Python objects, since IIRC, 
they always own their own memory. So one should refactor the memory 
management out of ndarray on the interface side.

The core probably also needs to decide when to create views (indexing), 
so a callback to the interface is needed.

 Second: If we cannot figure out how much to allocate before starting to
 use C (very unlikely), 

The point is that while you in principle can figure it out, you don't 
*want to* do this work in the interface layer. Computing the output size 
is just code duplication.

 we can always use a callback to Python. Or we can

This essentially amounts to having a

NpyArray_resize

in the interface layer, which the core can call.

Here, however, one has the requirement that all arrays needed by core are 
always passed in to it, including temporary arrays. Sounds a bit like 
Fortran 77 -- it's not going to be bad, if the number of temporaries is 
not very large.

If the core needs to allocate arrays by itself, this will unevitably lead 
to memory management that cannot avoided by having the allocation done in 
the interface layer, as the garbage collector must not mistake a 
temporary array referenced only by the core as unused.

An alternative here is to have an allocation routine

NpyArray_new_tmp
NpyArray_del_tmp

whose allocated memory is released automatically, if left dangling, after 
the call to the core is completed. This would be useful for temporary 
arrays, although then one would have to be careful not ever to return 
memory allocated by this back to the caller.

 have two C functions, the first determining the amount of allocation,
 the second doing the computation.

That sounds like a PITA, think about LAPACK.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Sebastian Walter
On Thu, Jun 10, 2010 at 6:48 PM, Sturla Molden stu...@molden.no wrote:

 I have a few radical suggestions:

 1. Use ctypes as glue to the core DLL, so we can completely forget about
 refcounts and similar mess. Why put manual reference counting and error
 handling in the core? It's stupid.

I totally agree, I  thought that the refactoring was supposed to provide
simple data structures and simple algorithms to perform the C equivalents of
sin,cos,exp, dot, +,-,*,/, dot, inv, ...

Let me explain at an example what I expected:

In the core C numpy library there would be new  numpy_array struct
with attributes

 numpy_array-buffer
 numpy_array-dtype
 numpy_array-ndim
 numpy_array-shape
 numpy_array-strides
 numpy_array-owndata
etc.

that replaces the current  PyArrayObject which contains Python C API stuff:

typedef struct PyArrayObject {
PyObject_HEAD
char *data; /* pointer to raw data buffer */
int nd; /* number of dimensions, also called ndim */
npy_intp *dimensions;   /* size in each dimension */
npy_intp *strides;  /* bytes to jump to get to the
   next element in each dimension */
PyObject *base; /* This object should be decref'd
   upon deletion of array */
/* For views it points to the original array */
/* For creation from buffer object it points
   to an object that shold be decref'd on
   deletion */
/* For UPDATEIFCOPY flag this is an array
   to-be-updated upon deletion of this one */
PyArray_Descr *descr;   /* Pointer to type structure */
int flags;  /* Flags describing array -- see below*/
PyObject *weakreflist;  /* For weakreferences */
void *buffer_info;  /* Data used by the buffer interface */
} PyArrayObject;



Example:
--

If one calls the following Python code
x = numpy.zeros((N,M,K), dtype=float)
the memory allocation would be done on the Python side.

Calling a ufunc like
y = numpy.sin(x)
would first allocate the memory for y on the Python side
and then call a C function a la
numpy_unary_ufunc( double (*fcn_ptr)(double), numpy_array *x, numpy_array *y)

If y is already allocated, one would call
y = numpy.sin(x, out = y)

Similarly z = x*y
would first allocate the memory for z and then call a C function a la
numpy_binary_ufunc( double (*fcn_ptr)(double, double), numpy_array *x,
numpy_array *y, numpy_array *z)


similarly other functions like dot:
z = dot(x,y, out = z)

would simply call a C function a la
numpy_dot( numpy_array *x, numpy_array *y, numpy_array *z)


If one wants to use numpy functions on the C side only, one would use
the numpy_array struct manually.
I.e. one has to do the memory management oneself in C. Which is
perfectly ok since one is just interested in using
the algorithms.


 2. The core should be a plain DLL, loadable with ctypes. (I know David
 Cournapeau and Robert Kern is going to hate this.) But if Python can have a
 custom loader for .pyd files, so can NumPy for it's core DLL. For ctypes we
 just need to specify a fully qualified path to the DLL, which can be read
 from a config file or whatever.

That would probably the easiest way.


 3. ctypes will take care of all the mess regarding the GIL. Again there is
 no need to re-invent the wheel. ctypes.CDLL releases the GIL when calling
 into C, and re-acquires before making callbacks to Python. ctypes.PyDLL
 keeps the GIL for legacy library code that are not threadsafe.

I have not much experience with parallelizing codes.
If one implements the algorithms side effect free it should be quite
easy to parallelize them on the C level, not?


 ctypes will also make porting to other Python implementations easier (or
 even other languages: Ruby, JacaScript) easier. Not to mention that it will
 make NumPy impervious to changes in the Python C API.

 4. Write the core in C++ or Fortran (95/2003), not C. ANSI C (aka C89). Use
 std::vector instead of malloc/free for C++, and possibly templates for
 generics on various dtypes.

The only reason I see for C++ is the possibility to use meta programming which
is very ill-designed. I'd rather like to see some simple code
preprocessing on C code than
C++ template meta programming. And it should be possible to avoid
mallocs in the C code, not?



 5. Allow OpenMP pragmas in the core. If arrays are above a certain size, it
 should switch to multi-threading.

 Sturla


Just my 2 cents,

Sebastian









 Den 10.06.2010 15:26, skrev Charles R Harris:

 A few thoughts came to mind while reading the initial writeup.

 1) How is the GIL handled in the callbacks.
 2) What about error handling? That is tricky to get right, especially in C
 and with reference counting.
 3) Is there a 

Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread David Cournapeau
On Sat, Jun 12, 2010 at 10:27 PM, Sebastian Walter
sebastian.wal...@gmail.com wrote:
 On Thu, Jun 10, 2010 at 6:48 PM, Sturla Molden stu...@molden.no wrote:

 I have a few radical suggestions:

 1. Use ctypes as glue to the core DLL, so we can completely forget about
 refcounts and similar mess. Why put manual reference counting and error
 handling in the core? It's stupid.

 I totally agree, I  thought that the refactoring was supposed to provide
 simple data structures and simple algorithms to perform the C equivalents of
 sin,cos,exp, dot, +,-,*,/, dot, inv, ...

 Let me explain at an example what I expected:

 In the core C numpy library there would be new  numpy_array struct
 with attributes

  numpy_array-buffer
  numpy_array-dtype
  numpy_array-ndim
  numpy_array-shape
  numpy_array-strides
  numpy_array-owndata
 etc.

 that replaces the current  PyArrayObject which contains Python C API stuff:

 typedef struct PyArrayObject {
        PyObject_HEAD
        char *data;             /* pointer to raw data buffer */
        int nd;                 /* number of dimensions, also called ndim */
        npy_intp *dimensions;       /* size in each dimension */
        npy_intp *strides;          /* bytes to jump to get to the
                                   next element in each dimension */
        PyObject *base;         /* This object should be decref'd
                                   upon deletion of array */
                                /* For views it points to the original array */
                                /* For creation from buffer object it points
                                   to an object that shold be decref'd on
                                   deletion */
                                /* For UPDATEIFCOPY flag this is an array
                                   to-be-updated upon deletion of this one */
        PyArray_Descr *descr;   /* Pointer to type structure */
        int flags;              /* Flags describing array -- see below*/
        PyObject *weakreflist;  /* For weakreferences */
        void *buffer_info;      /* Data used by the buffer interface */
 } PyArrayObject;



 Example:
 --

 If one calls the following Python code
 x = numpy.zeros((N,M,K), dtype=float)
 the memory allocation would be done on the Python side.

 Calling a ufunc like
 y = numpy.sin(x)
 would first allocate the memory for y on the Python side
 and then call a C function a la
 numpy_unary_ufunc( double (*fcn_ptr)(double), numpy_array *x, numpy_array *y)

 If y is already allocated, one would call
 y = numpy.sin(x, out = y)

 Similarly z = x*y
 would first allocate the memory for z and then call a C function a la
 numpy_binary_ufunc( double (*fcn_ptr)(double, double), numpy_array *x,
 numpy_array *y, numpy_array *z)


 similarly other functions like dot:
 z = dot(x,y, out = z)

 would simply call a C function a la
 numpy_dot( numpy_array *x, numpy_array *y, numpy_array *z)


 If one wants to use numpy functions on the C side only, one would use
 the numpy_array struct manually.
 I.e. one has to do the memory management oneself in C. Which is
 perfectly ok since one is just interested in using
 the algorithms.

Anything non trivial will require memory allocation and object
ownership conventions. If the goal is interoperation with other
languages and vm, you may want to use something else than plain
malloc, to interact better with the allocation strategies of the host
platform (reference counting, garbage collection).


 The only reason I see for C++ is the possibility to use meta programming which
 is very ill-designed. I'd rather like to see some simple code
 preprocessing on C code than
 C++ template meta programming.

I don't think anyone is seriously considering changing languages.
Especially if interoperation is desired, C is miles ahead of C++
anyway.

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Sturla Molden
Den 12.06.2010 15:57, skrev David Cournapeau:
 Anything non trivial will require memory allocation and object
 ownership conventions. If the goal is interoperation with other
 languages and vm, you may want to use something else than plain
 malloc, to interact better with the allocation strategies of the host
 platform (reference counting, garbage collection).



We can allocate memory on the Python side e.g. using Python's bytearray, 
ctypes array, or a Python string.

We don't need the convenstion of ownership if we use a Python object 
to store the data buffer (e.g. bytearray). It owns itself, and commit 
suicide when garbage collected.

Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Christopher Barker
David Cournapeau wrote:
 In the core C numpy library there would be new  numpy_array struct
 with attributes

  numpy_array-buffer

 Anything non trivial will require memory allocation and object
 ownership conventions.

I totally agree -- I've been thinking for a while about a core array 
data structure that you could use from C/C++, and would interact well 
with numpy and Python -- it would be even better if it WAS numpy.

I was thinking that at the root of it would be a data_block object 
(the buffer in the above), that would have a reference counting system. 
It would be its own system, but hopefully be able to link to Python's 
easily when used with Python.

we're talking C here, so trying to have a full fledged memory management 
would be practically inventing another language, but if the just the 
data blocks of memory could be managed, that would allow things like 
sub-arrays, and slices, and views to work well.

I as partly inspired by how useless C++ std::valarrays appeared to be -- 
a slice of a valarray is not the same as a valarray, because the 
original valrray is managing the memory (and I may have that wrong) -- 
what this said to me is that there needs to be a central system managing 
the memory blocks, while the user code can manage the higher level objects.

If the C core were left with absolutely no memory management, then I 
think we'd have a next to useless system in raw C.


-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Dag Sverre Seljebotn
Christopher Barker wrote:
 David Cournapeau wrote:
 In the core C numpy library there would be new  numpy_array struct
 with attributes

  numpy_array-buffer

 Anything non trivial will require memory allocation and object
 ownership conventions.

 I totally agree -- I've been thinking for a while about a core array
 data structure that you could use from C/C++, and would interact well
 with numpy and Python -- it would be even better if it WAS numpy.

 I was thinking that at the root of it would be a data_block object
 (the buffer in the above), that would have a reference counting system.
 It would be its own system, but hopefully be able to link to Python's
 easily when used with Python.

I think taking PEP 3118, strip out the Python-specific things, and then
add memory management conventions, would be a good starting point.

Simply a simple header file/struct definition and specification, which
could in time become a de facto way of exporting multidimensional array
data between C libraries, between Fortran and C and so on (Kurt Smith's
fwrap could easily be adapted to support it). The C-NumPy would then be a
library on top of this spec (mainly ufuncs operating on such structs).

The memory management conventions needs some thought, as you say, because
of slices -- but a central memory allocator is not good enough because one
would often be accessing memory that's allocated with other purposes in
mind (and should not be deallocated, or deallocated in a special way). So
refcounts + deallocator callback seems reasonable.

(Not that I'm involved in this, just my 2 cents.)

Dag Sverre

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Charles R Harris
On Sat, Jun 12, 2010 at 11:38 AM, Dag Sverre Seljebotn 
da...@student.matnat.uio.no wrote:

 Christopher Barker wrote:
  David Cournapeau wrote:
  In the core C numpy library there would be new  numpy_array struct
  with attributes
 
   numpy_array-buffer
 
  Anything non trivial will require memory allocation and object
  ownership conventions.
 
  I totally agree -- I've been thinking for a while about a core array
  data structure that you could use from C/C++, and would interact well
  with numpy and Python -- it would be even better if it WAS numpy.
 
  I was thinking that at the root of it would be a data_block object
  (the buffer in the above), that would have a reference counting system.
  It would be its own system, but hopefully be able to link to Python's
  easily when used with Python.

 I think taking PEP 3118, strip out the Python-specific things, and then
 add memory management conventions, would be a good starting point.

 Simply a simple header file/struct definition and specification, which
 could in time become a de facto way of exporting multidimensional array
 data between C libraries, between Fortran and C and so on (Kurt Smith's
 fwrap could easily be adapted to support it). The C-NumPy would then be a
 library on top of this spec (mainly ufuncs operating on such structs).

 The memory management conventions needs some thought, as you say, because
 of slices -- but a central memory allocator is not good enough because one
 would often be accessing memory that's allocated with other purposes in
 mind (and should not be deallocated, or deallocated in a special way). So
 refcounts + deallocator callback seems reasonable.

 (Not that I'm involved in this, just my 2 cents.)


This is more the way I see things, except I would divide the bottom layer
into two parts, views and memory. The memory can come from many places --
memmaps, user supplied buffers, etc. -- but we should provide a simple
reference counted allocator for the default. The views correspond more to
PEP 3118 and simply provide data types, dimensions, and strides, much as
arrays do now. However, I would confine the data types to those available in
C with a bit extra information as to precision, because.  Object arrays
would be a special case of pointer arrays (void pointer arrays?) and
structured arrays/Unicode might be a special case of char arrays. The more
complicated dtypes would then be built on top of those. Some things just
won't be portable, pointers in particular, but such is life.

As to languages, I think we should stay with C. C++ has much to offer for
this sort of thing but would be quite a big jump and maybe not as universal
as C. FORTRAN is harder to come by than C and older versions didn't have
such things as unsigned integers. I really haven't used FORTRAN since the 77
version, so haven't much idea what the modern version looks like, but I do
suspect we have more C programmers than FORTRAN programmers, and adding a
language translation on top of a design refactoring is just going to
complicate things.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Charles R Harris
On Sat, Jun 12, 2010 at 1:35 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Sat, Jun 12, 2010 at 11:38 AM, Dag Sverre Seljebotn 
 da...@student.matnat.uio.no wrote:

 Christopher Barker wrote:
  David Cournapeau wrote:
  In the core C numpy library there would be new  numpy_array struct
  with attributes
 
   numpy_array-buffer
 
  Anything non trivial will require memory allocation and object
  ownership conventions.
 
  I totally agree -- I've been thinking for a while about a core array
  data structure that you could use from C/C++, and would interact well
  with numpy and Python -- it would be even better if it WAS numpy.
 
  I was thinking that at the root of it would be a data_block object
  (the buffer in the above), that would have a reference counting system.
  It would be its own system, but hopefully be able to link to Python's
  easily when used with Python.

 I think taking PEP 3118, strip out the Python-specific things, and then
 add memory management conventions, would be a good starting point.

 Simply a simple header file/struct definition and specification, which
 could in time become a de facto way of exporting multidimensional array
 data between C libraries, between Fortran and C and so on (Kurt Smith's
 fwrap could easily be adapted to support it). The C-NumPy would then be a
 library on top of this spec (mainly ufuncs operating on such structs).

 The memory management conventions needs some thought, as you say, because
 of slices -- but a central memory allocator is not good enough because one
 would often be accessing memory that's allocated with other purposes in
 mind (and should not be deallocated, or deallocated in a special way). So
 refcounts + deallocator callback seems reasonable.

 (Not that I'm involved in this, just my 2 cents.)


 This is more the way I see things, except I would divide the bottom layer
 into two parts, views and memory. The memory can come from many places --
 memmaps, user supplied buffers, etc. -- but we should provide a simple
 reference counted allocator for the default. The views correspond more to
 PEP 3118 and simply provide data types, dimensions, and strides, much as
 arrays do now. However, I would confine the data types to those available in
 C with a bit extra information as to precision, because.  Object arrays
 would be a special case of pointer arrays (void pointer arrays?) and
 structured arrays/Unicode might be a special case of char arrays. The more
 complicated dtypes would then be built on top of those. Some things just
 won't be portable, pointers in particular, but such is life.

 As to languages, I think we should stay with C. C++ has much to offer for
 this sort of thing but would be quite a big jump and maybe not as universal
 as C. FORTRAN is harder to come by than C and older versions didn't have
 such things as unsigned integers. I really haven't used FORTRAN since the 77
 version, so haven't much idea what the modern version looks like, but I do
 suspect we have more C programmers than FORTRAN programmers, and adding a
 language translation on top of a design refactoring is just going to
 complicate things.


Oh, and we should have iterators for the views. So the base would be memory
+ views + iterators.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Francesc Alted
2010/6/12 Charles R Harris charlesr.har...@gmail.com


 This is more the way I see things, except I would divide the bottom layer
 into two parts, views and memory. The memory can come from many places --
 memmaps, user supplied buffers, etc. -- but we should provide a simple
 reference counted allocator for the default. The views correspond more to
 PEP 3118 and simply provide data types, dimensions, and strides, much as
 arrays do now. However, I would confine the data types to those available in
 C with a bit extra information as to precision, because.  Object arrays
 would be a special case of pointer arrays (void pointer arrays?) and
 structured arrays/Unicode might be a special case of char arrays. The more
 complicated dtypes would then be built on top of those. Some things just
 won't be portable, pointers in particular, but such is life.


Well, if data block of ndarray is going to be refactored that much, it
might be an interesting idea if it would support transparent compression.
As I see the things, it is clear that the gap between CPU power and memory
bandwith is widening and that trend will continue for long time.  This means
that having the possibility to deal with compressed data transparently,
would impact not only in a less memory consumption, but also in improved
memory access.

Getting improved memory access by using compression may sound a bit crazy,
but it is not.  For example, the Blosc [1]  project (which is undergoing the
latests testing steps before becoming stable) is showing signs that, by
using multimedia extensions in CPUs and multi-threading techniques, this is
becoming possible (as shown in [2]).

Of course, this optimization may have not much sense for accelerating
computations in NumPy if it cannot get rid of the temporary variables
problem (which I hope this can be resolved some day), but still, it could be
useful improving performance with other features (like copying, broadcasting
or performing selections more efficiently).

[1] http://blosc.pytables.org/
[2] http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks

Just another wish into the bag ;-)

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Benjamin Root
If I could, I would like to throw out another possible feature that might
need to be taken into consideration for designing the implementation of
numpy arrays.

One thing I found somewhat lacking -- if that is the right term -- is a way
to convolve a numpy array with an arbitrary windowing element.  I managed to
do some fancy indexing approaches, but they never did seem efficient.

So -- if it is at all possible -- I would like to see a design that not only
vectorizes well, but could also be used be used efficiently in arbitrary (or
fancy) indexing schemes as well.

Maybe what I am asking is technically infeasible... I don't know.  Just
something to consider.

I'll get back to doing documentation...
Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Sebastian Walter
On Sat, Jun 12, 2010 at 3:57 PM, David Cournapeau courn...@gmail.com wrote:
 On Sat, Jun 12, 2010 at 10:27 PM, Sebastian Walter
 sebastian.wal...@gmail.com wrote:
 On Thu, Jun 10, 2010 at 6:48 PM, Sturla Molden stu...@molden.no wrote:

 I have a few radical suggestions:

 1. Use ctypes as glue to the core DLL, so we can completely forget about
 refcounts and similar mess. Why put manual reference counting and error
 handling in the core? It's stupid.

 I totally agree, I  thought that the refactoring was supposed to provide
 simple data structures and simple algorithms to perform the C equivalents of
 sin,cos,exp, dot, +,-,*,/, dot, inv, ...

 Let me explain at an example what I expected:

 In the core C numpy library there would be new  numpy_array struct
 with attributes

  numpy_array-buffer
  numpy_array-dtype
  numpy_array-ndim
  numpy_array-shape
  numpy_array-strides
  numpy_array-owndata
 etc.

 that replaces the current  PyArrayObject which contains Python C API stuff:

 typedef struct PyArrayObject {
        PyObject_HEAD
        char *data;             /* pointer to raw data buffer */
        int nd;                 /* number of dimensions, also called ndim */
        npy_intp *dimensions;       /* size in each dimension */
        npy_intp *strides;          /* bytes to jump to get to the
                                   next element in each dimension */
        PyObject *base;         /* This object should be decref'd
                                   upon deletion of array */
                                /* For views it points to the original array 
 */
                                /* For creation from buffer object it points
                                   to an object that shold be decref'd on
                                   deletion */
                                /* For UPDATEIFCOPY flag this is an array
                                   to-be-updated upon deletion of this one */
        PyArray_Descr *descr;   /* Pointer to type structure */
        int flags;              /* Flags describing array -- see below*/
        PyObject *weakreflist;  /* For weakreferences */
        void *buffer_info;      /* Data used by the buffer interface */
 } PyArrayObject;



 Example:
 --

 If one calls the following Python code
 x = numpy.zeros((N,M,K), dtype=float)
 the memory allocation would be done on the Python side.

 Calling a ufunc like
 y = numpy.sin(x)
 would first allocate the memory for y on the Python side
 and then call a C function a la
 numpy_unary_ufunc( double (*fcn_ptr)(double), numpy_array *x, numpy_array *y)

 If y is already allocated, one would call
 y = numpy.sin(x, out = y)

 Similarly z = x*y
 would first allocate the memory for z and then call a C function a la
 numpy_binary_ufunc( double (*fcn_ptr)(double, double), numpy_array *x,
 numpy_array *y, numpy_array *z)


 similarly other functions like dot:
 z = dot(x,y, out = z)

 would simply call a C function a la
 numpy_dot( numpy_array *x, numpy_array *y, numpy_array *z)


 If one wants to use numpy functions on the C side only, one would use
 the numpy_array struct manually.
 I.e. one has to do the memory management oneself in C. Which is
 perfectly ok since one is just interested in using
 the algorithms.

 Anything non trivial will require memory allocation and object
 ownership conventions. If the goal is interoperation with other
 languages and vm, you may want to use something else than plain
 malloc, to interact better with the allocation strategies of the host
 platform (reference counting, garbage collection).

I'm just saying that the host platform could do the memory
management and not libnumpy.
I.e. libnumpy could be just a collection of algorithms.
Reimplementing half of the Python C API somehow doesn't feel right to me.

Those users who like to use C++ could write a class with methods that
internally call the
libnumpy functions:

-- example code -
class Array{
numpy_array *_array;

public:
const Array operator+(Array rhs) const {


Array retval( ... arguments for the right type and dimensions of
the output...);
numpy_add((*this)-_array, rhs-_array, retval-_array);
   return retval;
}
};
-- end code -

I.e. let C++ do all the memory management and type inference but the
numpy core C API does the number crunching.
In other languages (Python, Ruby, R,  whatever) one would implement a
similar class.

I cannot speak for others, but something about these lines is what I'd
love to see since it would make it
relatively easy to use numpy functionality even in existing C/C++/R/Ruby codes.


Sebastian





 The only reason I see for C++ is the possibility to use meta programming 
 which
 is very ill-designed. I'd rather like to see some simple code
 preprocessing on C code than
 C++ template meta programming.

 I don't think anyone is seriously considering changing languages.
 Especially if interoperation is desired, C is miles 

Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Dag Sverre Seljebotn
Charles Harris wrote:
 On Sat, Jun 12, 2010 at 11:38 AM, Dag Sverre Seljebotn 
 da...@student.matnat.uio.no wrote:

 Christopher Barker wrote:
  David Cournapeau wrote:
  In the core C numpy library there would be new  numpy_array struct
  with attributes
 
   numpy_array-buffer
 
  Anything non trivial will require memory allocation and object
  ownership conventions.
 
  I totally agree -- I've been thinking for a while about a core array
  data structure that you could use from C/C++, and would interact well
  with numpy and Python -- it would be even better if it WAS numpy.
 
  I was thinking that at the root of it would be a data_block object
  (the buffer in the above), that would have a reference counting
 system.
  It would be its own system, but hopefully be able to link to Python's
  easily when used with Python.

 I think taking PEP 3118, strip out the Python-specific things, and then
 add memory management conventions, would be a good starting point.

 Simply a simple header file/struct definition and specification, which
 could in time become a de facto way of exporting multidimensional array
 data between C libraries, between Fortran and C and so on (Kurt Smith's
 fwrap could easily be adapted to support it). The C-NumPy would then be
 a
 library on top of this spec (mainly ufuncs operating on such structs).

 The memory management conventions needs some thought, as you say,
 because
 of slices -- but a central memory allocator is not good enough because
 one
 would often be accessing memory that's allocated with other purposes in
 mind (and should not be deallocated, or deallocated in a special way).
 So
 refcounts + deallocator callback seems reasonable.

 (Not that I'm involved in this, just my 2 cents.)


 This is more the way I see things, except I would divide the bottom layer
 into two parts, views and memory. The memory can come from many places --
 memmaps, user supplied buffers, etc. -- but we should provide a simple
 reference counted allocator for the default. The views correspond more to
 PEP 3118 and simply provide data types, dimensions, and strides, much as
 arrays do now. However, I would confine the data types to those available
 in
 C with a bit extra information as to precision, because.  Object arrays
 would be a special case of pointer arrays (void pointer arrays?) and
 structured arrays/Unicode might be a special case of char arrays. The more
 complicated dtypes would then be built on top of those. Some things just
 won't be portable, pointers in particular, but such is life.

 As to languages, I think we should stay with C. C++ has much to offer for
 this sort of thing but would be quite a big jump and maybe not as
 universal
 as C. FORTRAN is harder to come by than C and older versions didn't have
 such things as unsigned integers. I really haven't used FORTRAN since the
 77
 version, so haven't much idea what the modern version looks like, but I do
 suspect we have more C programmers than FORTRAN programmers, and adding a
 language translation on top of a design refactoring is just going to
 complicate things.

I'm not sure how Fortran got into this, but if it was from what I wrote: I
certainly didn't suggest any part of NumPy is written in Fortran. Sorry
for causing confusion.

What I meant: If there's an ndarray.h which basically just contains the
minimum C version of PEP 3118 for passing around and accessing arrays.
Without any runtime dependencies -- of course, writing code using it will
be much easier when using the C-NumPy runtime library, but for simply
exporting data from an existing library one could do without the runtime
dependency.

(To be more precise about fwrap: In fwrap there's a need to pass
multidimensional, flexible-size array data back and forth between
(autogenerated) Fortran and (autogenerated) C. It currently defines its
own structs (or extra arguments, but boils down to the same thing...) for
this purpose, but if an ndarray.h could create a standard for passing
array data then that would be a natural choice instead.)

Dag Sverre

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Charles R Harris
On Sat, Jun 12, 2010 at 2:56 PM, Dag Sverre Seljebotn 
da...@student.matnat.uio.no wrote:

 Charles Harris wrote:
  On Sat, Jun 12, 2010 at 11:38 AM, Dag Sverre Seljebotn 
  da...@student.matnat.uio.no wrote:
 
  Christopher Barker wrote:
   David Cournapeau wrote:
   In the core C numpy library there would be new  numpy_array struct
   with attributes
  
numpy_array-buffer
  
   Anything non trivial will require memory allocation and object
   ownership conventions.
  
   I totally agree -- I've been thinking for a while about a core array
   data structure that you could use from C/C++, and would interact well
   with numpy and Python -- it would be even better if it WAS numpy.
  
   I was thinking that at the root of it would be a data_block object
   (the buffer in the above), that would have a reference counting
  system.
   It would be its own system, but hopefully be able to link to Python's
   easily when used with Python.
 
  I think taking PEP 3118, strip out the Python-specific things, and then
  add memory management conventions, would be a good starting point.
 
  Simply a simple header file/struct definition and specification, which
  could in time become a de facto way of exporting multidimensional array
  data between C libraries, between Fortran and C and so on (Kurt Smith's
  fwrap could easily be adapted to support it). The C-NumPy would then be
  a
  library on top of this spec (mainly ufuncs operating on such structs).
 
  The memory management conventions needs some thought, as you say,
  because
  of slices -- but a central memory allocator is not good enough because
  one
  would often be accessing memory that's allocated with other purposes in
  mind (and should not be deallocated, or deallocated in a special way).
  So
  refcounts + deallocator callback seems reasonable.
 
  (Not that I'm involved in this, just my 2 cents.)
 
 
  This is more the way I see things, except I would divide the bottom layer
  into two parts, views and memory. The memory can come from many places --
  memmaps, user supplied buffers, etc. -- but we should provide a simple
  reference counted allocator for the default. The views correspond more to
  PEP 3118 and simply provide data types, dimensions, and strides, much as
  arrays do now. However, I would confine the data types to those available
  in
  C with a bit extra information as to precision, because.  Object arrays
  would be a special case of pointer arrays (void pointer arrays?) and
  structured arrays/Unicode might be a special case of char arrays. The
 more
  complicated dtypes would then be built on top of those. Some things just
  won't be portable, pointers in particular, but such is life.
 
  As to languages, I think we should stay with C. C++ has much to offer for
  this sort of thing but would be quite a big jump and maybe not as
  universal
  as C. FORTRAN is harder to come by than C and older versions didn't have
  such things as unsigned integers. I really haven't used FORTRAN since the
  77
  version, so haven't much idea what the modern version looks like, but I
 do
  suspect we have more C programmers than FORTRAN programmers, and adding a
  language translation on top of a design refactoring is just going to
  complicate things.

 I'm not sure how Fortran got into this, but if it was from what I wrote: I
 certainly didn't suggest any part of NumPy is written in Fortran. Sorry
 for causing confusion.


It was a suggestion by someone else way back in the beginning. I think
changing languages is a great way to get side tracked and spend the next
year just hacking about.


 What I meant: If there's an ndarray.h which basically just contains the
 minimum C version of PEP 3118 for passing around and accessing arrays.
 Without any runtime dependencies -- of course, writing code using it will
 be much easier when using the C-NumPy runtime library, but for simply
 exporting data from an existing library one could do without the runtime
 dependency.


Well, that would be just one bit. If we want to offer a *functioning* system
we have to implement quite a bit more.


 (To be more precise about fwrap: In fwrap there's a need to pass
 multidimensional, flexible-size array data back and forth between
 (autogenerated) Fortran and (autogenerated) C. It currently defines its
 own structs (or extra arguments, but boils down to the same thing...) for
 this purpose, but if an ndarray.h could create a standard for passing
 array data then that would be a natural choice instead.)


Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread David Cournapeau
On Sun, Jun 13, 2010 at 2:00 AM, Sturla Molden stu...@molden.no wrote:
 Den 12.06.2010 15:57, skrev David Cournapeau:
 Anything non trivial will require memory allocation and object
 ownership conventions. If the goal is interoperation with other
 languages and vm, you may want to use something else than plain
 malloc, to interact better with the allocation strategies of the host
 platform (reference counting, garbage collection).



 We can allocate memory on the Python side e.g. using Python's bytearray,
 ctypes array, or a Python string.

 We don't need the convenstion of ownership if we use a Python object
 to store the data buffer (e.g. bytearray). It owns itself, and commit
 suicide when garbage collected.

But the point is to get rid of the python dependency, and if you don't
allow any api call to allocate memory, there is not much left to
implement in the core.

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Sturla Molden
Den 13.06.2010 02:39, skrev David Cournapeau:

 But the point is to get rid of the python dependency, and if you don't
 allow any api call to allocate memory, there is not much left to
 implement in the core.



Memory allocation is platform dependent. A CPython version could use 
bytearray, other platforms could e.g. use arrays from a gc managed heap 
(IronPython, Jython). Because memory management is platform dependent, 
it does not naturally belong in the core.

Having ndarrays allocate buffers with malloc() and own buffers they've 
allocated complicate things terribly. The purpose of ownership is to 
know which ndarray is to call free() on the buffer. Why go through all 
that pain? It's just a duplication of Python's garbage collection. 
Re-inventing the wheel is stupid. Let buffers be Python objects that 
clean themselves up.

If NumPy does not allocate memory on it's own, there will be no leaks 
due to errors in NumPy.

There is still work to do in the core, i.e. the computational loops in 
array operators, broadcasting, ufuncs, copying data between buffers, etc.

The C functions in the core would then be called with the output array 
already allocated.

Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread David Cournapeau
On Sun, Jun 13, 2010 at 11:39 AM, Sturla Molden stu...@molden.no wrote:

 If NumPy does not allocate memory on it's own, there will be no leaks
 due to errors in NumPy.

 There is still work to do in the core, i.e. the computational loops in
 array operators, broadcasting, ufuncs, copying data between buffers, etc.

 The C functions in the core would then be called with the output array
 already allocated.

This only works in simple cases. What do you do when you don't know
the output size ? How do you deal with reallocation ? How do you deal
with ufunc and broadcasting buffering without allocating in the
library ? I don't see how that's possible. Before calling things
stupid, you better have a solution to those issues,

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion



Re: [Numpy-discussion] NumPy re-factoring project

2010-06-12 Thread Sturla Molden
Den 13.06.2010 05:47, skrev David Cournapeau:

 This only works in simple cases. What do you do when you don't know
 the output size ?

First: If you don't know, you don't know. Then you're screwed and C is 
not going to help.

Second: If we cannot figure out how much to allocate before starting to 
use C (very unlikely), we can always use a callback to Python. Or we can 
have two C functions, the first determining the amount of allocation, 
the second doing the computation.


   How do you deal with reallocation ?

That is platform dependent. A buffer object could e.g. have a realloc 
method.


 How do you deal
 with ufunc and broadcasting buffering without allocating in the
 library ?

We allocate whatever we need on the Python side, possibly with a 
callback to Python if that's what we need.


Sturla





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Sebastien Binet
On Fri, 11 Jun 2010 00:25:17 +0200, Sturla Molden stu...@molden.no wrote:
 Den 10.06.2010 22:07, skrev Travis Oliphant:
 
  2. The core should be a plain DLL, loadable with ctypes. (I know David 
  Cournapeau and Robert Kern is going to hate this.) But if Python can have 
  a custom loader for .pyd files, so can NumPy for it's core DLL. For ctypes 
  we just need to specify a fully qualified path to the DLL, which can be 
  read from a config file or whatever.
   
  This approach does not build a new Python type in compiled code.   There 
  are speed disadvantages to this --- especially for the numpy scalars.
 
 There are at least four speed penalties in what I suggested:
 
 - the ctypes overhead is bigger than using Python's C API manually (or 
 Cython).
 - there is a speed penalty for scalars and small arrays. (Previously 
 seen in numarray vs. numeric.)
 - bytearray is a bit slower to create than to malloc a buffer.
 - arrays with dtype=object
 
 The advantages are:
 
 - the code is easier to read and maintain (clean separation of Python and C)
 - more portable code (no Python C API dependence)
 - no obscure memory leaks to track down (bytearray instead of 
 malloc/free and no manual ref counting).
 
 
 By the way: Is Cython mature enough to toss all C out the door? Since 
 Cython has syntax for PEP 3118 buffer objects, we should theoretically 
 be able to implement NumPy in Cython. Then we don't have the speed 
 penalty and no difficult C to maintain. But if the idea is to make a 
 Python independent C library from the core, it is probably a bit counter 
 productive...

as a long time lurker, I really like the idea of having a .so as core
numpy but I'd really hate to have to go thru ctypes as a necessary (or
rather, default) layer to use numpy-core from python.
it of course depends on the granularity at which you wrap and use
numpy-core but tight loops calling ctypes ain't gonna be pretty
performance-wise.

I really think using cython would be a better option.
one'd get the python2 - python3 transition for free.

but, again, I am just a lurker here on numpy-list :)

cheers,
sebastien.

-- 
#
# Dr. Sebastien Binet
# Laboratoire de l'Accelerateur Lineaire
# Universite Paris-Sud XI
# Batiment 200
# 91898 Orsay
#
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Francesc Alted
A Friday 11 June 2010 02:27:18 Sturla Molden escrigué:
  Another thing I did when reimplementing lfilter was copy-in copy-out
  for strided arrays.
 
  What is copy-in copy out ? I am not familiar with this term ?
 
 Strided memory access is slow. So it often helps to make a temporary
 copy that are contiguous.

In my experience, this technique will only work well with strided arrays if 
you are going to re-use the data of these temporaries in cache, or your data 
is unaligned.  But if you are going to use the data only once (and this is 
very common in NumPy element-wise operations), this is rather counter-
productive for strided arrays.

For example, in numexpr, we made a lot of different tests comparing copy-in 
copy-out and direct access techniques for strided arrays.  The result was 
that operations with direct access showed significantly better performance 
with strided arrays.  On the contrary, for unaligned arrays the copy-in copy-
out technique gave better results.

Look at these times, where the arrays where unidimensional with a length of 1 
million element each, but the results can be extrapolated to larger, 
multidimensional arrays (the original benchmark file is bench/vml_timing.py):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 
Numexpr version:   1.3.2.dev169 
 
NumPy version: 1.4.1rc2 
 
Python version:2.6.1 (r261:67515, Feb  3 2009, 17:34:37)
 
[GCC 4.3.2 [gcc-4_3-branch revision 141291]]
 
Platform:  linux2-x86_64
 
AMD/Intel CPU? True 
 
VML available? True 
 
VML/MKL version:   Intel(R) Math Kernel Library Version 10.1.0 Product Build 
081809.14 for Intel(R) 64 architecture applications 

   
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


To start with, times between numpy and numexpr are very similar for very 
simple expressions (except for unaligned arrays, where copy-in copy-out 
works pretty well for numexpr):

*** Expression: i2  0  

numpy: 0.0016   

numpy strided: 0.0037   

  numpy unaligned: 0.0086   

  numexpr: 0.0016 Speed-up of numexpr over numpy: 0.9512

  numexpr strided: 0.0039 Speed-up of numexpr over numpy: 0.964 

numexpr unaligned: 0.0042 Speed-up of numexpr over numpy: 2.0598


When doing some basic operations (mind that there are no temporaries here, so 
numpy should be not in great disadvantage), direct access to strided data goes 
between 2x and 3x faster than numpy:

*** Expression: f3+f4
numpy: 0.0060
numpy strided: 0.0176
  numpy unaligned: 0.0166
  numexpr: 0.0052 Speed-up of numexpr over numpy: 1.1609
  numexpr strided: 0.0086 Speed-up of numexpr over numpy: 2.0584
numexpr unaligned: 0.0099 Speed-up of numexpr over numpy: 1.6785

*** Expression: f3+i2
numpy: 0.0060
numpy strided: 0.0176
  numpy unaligned: 0.0176
  numexpr: 0.0031 Speed-up of numexpr over numpy: 1.9137
  numexpr strided: 0.0061 Speed-up of numexpr over numpy: 2.8789
numexpr unaligned: 0.0078 Speed-up of numexpr over numpy: 2.2411

Notice how, until now, absolute times in numexpr and strided arrays (using the 
direct technique) are faster than the unaligned case (copy-in copy-out).

Also, when evaluating transcendental expressions (numexpr uses Intel's Vector 
Math Library, VML, here), direct access is again faster than NumPy:

*** Expression: exp(f3)
numpy: 0.0150  
numpy strided: 0.0155  
  numpy unaligned: 0.0222  
  numexpr: 0.0030 Speed-up of numexpr over numpy: 5.0268
  numexpr strided: 0.0081 Speed-up of numexpr over numpy: 1.9086
numexpr unaligned: 0.0066 

Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Pauli Virtanen
Thu, 10 Jun 2010 23:56:56 +0200, Sturla Molden wrote:
[clip]
 Also about array iterators in NumPy's C base (i.e. for doing something
 along an axis): we don't need those. There is a different way of coding
 which leads to faster code.
 
 1. Collect an array of pointers to each subarray (e.g. using
 std::vectordtype* or dtype**)
 2. Dispatch on the pointer array...

This is actually what the current ufunc code does.

The innermost dimension is handled via the ufunc loop, which is a simple 
for loop with constant-size step and is given a number of iterations. The 
array iterator objects are used only for stepping through the outer 
dimensions. That is, it essentially steps through your dtype** array, 
without explicitly constructing it.

An abstraction for iteration through an array is useful, also for   
building the dtype** array, so we should probably retain them.

For multidimensional arrays, you run here into the optimization problem 
of choosing the order of axes so that the memory access pattern makes a 
maximally efficient use of the cache. Currently, this optimization 
heuristic is really simple, and makes the wrong choice even in the simple 
2D case.

We could in principle use OpenMP to parallelize the outer loop, as long 
as the inner ufunc part is implemented in C, and does not go back into 
Python. But if the problem is memory-bound, it is not clear that 
parallelization helps.

Another task for optimization would perhaps be to implement specialized 
loops for each operation, currently we do one function indirection per 
iteration which probably costs something. But again, it may be that the 
operations are already bounded by memory speed, and this would not 
improve performance.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Hans Meine
On Thursday 10 June 2010 22:28:28 Pauli Virtanen wrote:
 Some places where Openmp could probably help are in the inner ufunc
 loops. However, improving the memory efficiency of the data access
 pattern is another low-hanging fruit for multidimensional arrays.

I was about to mention this when the thread started.. at least I want to raise 
awareness of the current problems again, so they can be taken into account 
while refactoring.  Theoretically, this should be really low-hanging fruit, 
but I have no idea how much work it is.

The problem is: NumPy uses C-order indexing by default.  While it supports F-
order arrays, too, using F-oder arrays leads to unnecessary slow-downs, since 
the dimensions are walked through from left to right, and not in memory order.

Ideally, algorithms would get wrapped in between two additional 
pre-/postprocessing steps:

1) Preprocessing: After broadcasting, transpose the input arrays such that 
they become C order.  More specifically, sort the strides of one (e.g. the 
first/the largest) input array decreasingly, and apply the same transposition 
to the other arrays (in + out if given).

2) Perform the actual algorithm, producing a C-order output array.

3) Apply the reverse transposition to the output array.

This would
- fix the bug that operations (e.g. addition) on F-order arrays are slower 
than on C-order arrays,
- fix the bug that the order of the output arrays differs from that of the 
input arrays (this is very annoying when dealing with 3rd party libraries that 
use and expect F-order everywhere), and
- be easy for elementwise operations, but
- need further thinking / special handling of operations where the number  
mapping of dimensions of the input  output arrays is more complex.

Last, but not least, here is a pointer to our previous discussion of the topic 
in the thread Unnecessarily bad performance of elementwise operators with 
Fortran-arrays:
  http://www.mail-archive.com/numpy-discussion@scipy.org/msg05100.html

Have a nice day,
  Hans
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Pauli Virtanen
Fri, 11 Jun 2010 10:29:28 +0200, Hans Meine wrote:
[clip]
 Ideally, algorithms would get wrapped in between two additional
 pre-/postprocessing steps:
 
 1) Preprocessing: After broadcasting, transpose the input arrays such
 that they become C order.  More specifically, sort the strides of one
 (e.g. the first/the largest) input array decreasingly, and apply the
 same transposition to the other arrays (in + out if given).
 
 2) Perform the actual algorithm, producing a C-order output array.
 
 3) Apply the reverse transposition to the output array.
[clip]

The post/preprocessing steps are fairly easy to implement in the ufunc 
machinery.

The problem here was that this would result to a subtle semantic change, 
as output arrays from ufuncs would no longer necessarily be in C order.

We need to explicitly to *decide* to make this change, before proceeding. 
I'd say that we should simply do this, as nobody should depend on this 
assumption.

I think I there was some code ready to implement this shuffling. I'll try 
to dig it out and implement the shuffling.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Hans Meine
On Friday 11 June 2010 10:38:28 Pauli Virtanen wrote:
 Fri, 11 Jun 2010 10:29:28 +0200, Hans Meine wrote:
  Ideally, algorithms would get wrapped in between two additional
  pre-/postprocessing steps:
  
  1) Preprocessing: After broadcasting, transpose the input arrays such
  that they become C order.  More specifically, sort the strides of one
  (e.g. the first/the largest) input array decreasingly, and apply the
  same transposition to the other arrays (in + out if given).
  
  2) Perform the actual algorithm, producing a C-order output array.
  
  3) Apply the reverse transposition to the output array.
 
 [clip]
 
 The post/preprocessing steps are fairly easy to implement in the ufunc
 machinery.
 
 The problem here was that this would result to a subtle semantic change,
 as output arrays from ufuncs would no longer necessarily be in C order.

I don't see it as a problem, really.  People who rely on C order will also 
give C order input arrays to the algorithms.

 We need to explicitly to *decide* to make this change, before proceeding.
 I'd say that we should simply do this, as nobody should depend on this
 assumption.

+1

I seem to recall that even Travis, or at least David C. gave a similar opinion 
in the mentioned thread in the past.

 I think I there was some code ready to implement this shuffling. I'll try
 to dig it out and implement the shuffling.

That would be great!

Ullrich Köthe has implemented this for our VIGRA/numpy bindings:
  http://tinyurl.com/fast-ufunc
At the bottom you can see that he basically wraps all numpy.ufuncs he can find 
in the numpy top-level namespace automatically.

Looking forward to this,
  Hans
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Sturla Molden
Den 11.06.2010 10:17, skrev Pauli Virtanen:
 1. Collect an array of pointers to each subarray (e.g. using
 std::vectordtype*  or dtype**)
 2. Dispatch on the pointer array...
  
 This is actually what the current ufunc code does.

 The innermost dimension is handled via the ufunc loop, which is a simple
 for loop with constant-size step and is given a number of iterations. The
 array iterator objects are used only for stepping through the outer
 dimensions. That is, it essentially steps through your dtype** array,
 without explicitly constructing it.



Yes, exactly my point. And because the iterator does not explicitely 
construct the array, it sucks for parallel programming (e.g. with OpenMP):

- The iterator becomes a bottleneck to which access must be serialized 
with a mutex.
- We cannot do proper work scheduling (load balancing)



 We could in principle use OpenMP to parallelize the outer loop, as long
 as the inner ufunc part is implemented in C, and does not go back into
 Python.

Yes. And for that we need to explicitely construct a pointer array.


 But if the problem is memory-bound, it is not clear that
 parallelization helps.


That is often the case, yes.

Memory access patterns is one of the most important issues speed wise, 
particularly when working with strided array slices.

It would also make sence to evaluate expressions like y = b*x + a 
without a temporary array for b*x. I know roughly how to do it, but 
don't have time to look at it before next year. (Yes I know about 
numexpr, I am talking about plain Python code.)


Sturla




















___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Sturla Molden
Den 11.06.2010 09:14, skrev Sebastien Binet:
 it of course depends on the granularity at which you wrap and use
 numpy-core but tight loops calling ctypes ain't gonna be pretty
 performance-wise.


Tight loops in Python are never pretty.

The purpose of vectorization with NumPy is to avoid tight Python loops.

 I really think using cython would be a better option.
 one'd get the python2-  python3 transition for free.



Perhaps. Cython has nice syntax for PEP 3118 buffers. But there are 
downsides to this:
- Cython is not C. We want the core to be a C library independent of Python.
- Cython depends on CPython.
- Compiling Cython generated C takes time.
- Linkage can be a PITA (I use 64-bit Python now, and import libraries 
for gcc are missing.)

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Benjamin Root
On Fri, Jun 11, 2010 at 8:31 AM, Sturla Molden stu...@molden.no wrote:



 It would also make sence to evaluate expressions like y = b*x + a
 without a temporary array for b*x. I know roughly how to do it, but
 don't have time to look at it before next year. (Yes I know about
 numexpr, I am talking about plain Python code.)


 Sturla



If I may chime in here with my own experience with NumPy code...

I typically use older, weaker computers for my work.  I am not doing
real-time modeling or some other really advanced, difficult computations.
For me, NumPy works fast enough, even on an EeePC.  My main issue is the
one given above by Sturla.  I find that NumPy's memory usage can go
out-of-control very easily in long mathematical expressions.  With a mix of
constants and large matricies, each step in the order of operations seems to
take up more memory.  Often, I would run into a major slow-down from
trashing the swap.  This is fairly trivial to get around by operating over
slices of the matrices at a time, but -- to me -- all of this talk about
optimizing the speed of the operations without addressing the temporaries
issue is like trying to tune-up the engine of a car without bothering to
take the lead weights out of the trunk.

Just my 2 cents.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Anne Archibald
On 11 June 2010 11:12, Benjamin Root ben.r...@ou.edu wrote:


 On Fri, Jun 11, 2010 at 8:31 AM, Sturla Molden stu...@molden.no wrote:


 It would also make sence to evaluate expressions like y = b*x + a
 without a temporary array for b*x. I know roughly how to do it, but
 don't have time to look at it before next year. (Yes I know about
 numexpr, I am talking about plain Python code.)


 Sturla



 If I may chime in here with my own experience with NumPy code...

 I typically use older, weaker computers for my work.  I am not doing
 real-time modeling or some other really advanced, difficult computations.
 For me, NumPy works fast enough, even on an EeePC.  My main issue is the
 one given above by Sturla.  I find that NumPy's memory usage can go
 out-of-control very easily in long mathematical expressions.  With a mix of
 constants and large matricies, each step in the order of operations seems to
 take up more memory.  Often, I would run into a major slow-down from
 trashing the swap.  This is fairly trivial to get around by operating over
 slices of the matrices at a time, but -- to me -- all of this talk about
 optimizing the speed of the operations without addressing the temporaries
 issue is like trying to tune-up the engine of a car without bothering to
 take the lead weights out of the trunk.

I should say, though, that I've gone through the process of removing
all temporary allocation using ufunc output arguments (np.add(a,b,c))
only to discover that it didn't actually save any memory and it was
slower. The nice thing about temporaries is that they're, well,
temporary; they go away.

On the other hand, since memory reads are very slow, optimizations
that do more calculation per load/store could make a very big
difference, eliminating temporaries as a side effect.

Anne

 Just my 2 cents.

 Ben Root

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Dag Sverre Seljebotn
Sturla Molden wrote:
 Den 11.06.2010 09:14, skrev Sebastien Binet:
 it of course depends on the granularity at which you wrap and use
 numpy-core but tight loops calling ctypes ain't gonna be pretty
 performance-wise.

 
 Tight loops in Python are never pretty.
 
 The purpose of vectorization with NumPy is to avoid tight Python loops.
 
 I really think using cython would be a better option.
 one'd get the python2-  python3 transition for free.


 
 Perhaps. Cython has nice syntax for PEP 3118 buffers. But there are 
 downsides to this:

IMO I think the syntax is not useful in this setting.

 - Cython is not C. We want the core to be a C library independent of Python.

Cython could fill the role you propose that ctypes takes. The advantage is

  a) (IMO) nicer syntax...well, at least can be less verbose than ctypes

  b) Avoids having to structure the whole API around what can be done in 
Python+ctypes. A pure ctypes+pure C solution would have a tendency to 
push stuff that logically belongs Python-module-side into pure C side, 
which could be a very bad thing.

  c) Possible to iteratively move more and more from C-extension/Cython 
over to C side, rather than a full rewrite in ctypes + C.

There's disadvantages re compilation, but see below.

I believe this is rather long term though. Stuff like this has to be 
done iteratively; one must not rewrite from scratch.

 - Cython depends on CPython.
 - Compiling Cython generated C takes time.
 - Linkage can be a PITA (I use 64-bit Python now, and import libraries 
 for gcc are missing.)

Well, if you can't link CPython extension modules you're going to have 
major problem with about every single scientific Python library out there.

Unless you actually tackle the root of the problem, rewriting NumPy with 
ctypes won't help -- you also need to rewrite matplotlib, mayavi, 
PyTables, SciPy, mpi4py etc. etc. etc. etc. in ctypes + pure C without 
the CPython API.

In other words, I don't think this is a very reasonable argument. (Well, 
perhaps NumPy is in some sense more basic than these others.)

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Sturla Molden
Den 11.06.2010 17:17, skrev Anne Archibald:

 On the other hand, since memory reads are very slow, optimizations
 that do more calculation per load/store could make a very big
 difference, eliminating temporaries as a side effect.


Yes, that's the main issue, not the extra memory they use.

Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-11 Thread Pauli Virtanen
Fri, 11 Jun 2010 15:31:45 +0200, Sturla Molden wrote:
[clip]
 The innermost dimension is handled via the ufunc loop, which is a
 simple for loop with constant-size step and is given a number of
 iterations. The array iterator objects are used only for stepping
 through the outer dimensions. That is, it essentially steps through
 your dtype** array, without explicitly constructing it.

 Yes, exactly my point. And because the iterator does not explicitely
 construct the array, it sucks for parallel programming (e.g. with
 OpenMP):
 
 - The iterator becomes a bottleneck to which access must be serialized
 with a mutex.
 - We cannot do proper work scheduling (load balancing)

I don't necessarily agree: you can do

for parallelized outer loop {
critical section {
p = get iterator pointer
++iterator
}
inner loop in region `p`
}

This does allow load balancing etc., as a free processor can immediately 
grab the next available slice. Also, it would be easier to implement with 
OpenMP pragmas in the current code base.

Of course, the assumption here is that the outer iterator overhead is 
small compared to the duration of the inner loop. This must then be 
compared to the memory access overhead involved in the dtype** array.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Charles R Harris
On Wed, Jun 9, 2010 at 5:27 PM, Jason McCampbell
jmccampb...@enthought.comwrote:

 Hi everyone,

 This is a follow-up to Travis's message on the re-factoring project from
 May 25th and the subsequent discussion. For background, I am a developer at
 Enthought working on the NumPy re-factoring project with Travis and Scott.
 The immediate goal from our perspective is to re-factor the core of NumPy
 into two architectural layers: a true core that is CPython-independent and
 an interface layer that binds the core to CPython.

 A design proposal is now posted on the NumPy developer wiki:
 http://projects.scipy.org/numpy/wiki/NumPyRefactoring

 The write-up is a fairly high-level description of what we think the split
 will look like and how to deal with issues such as memory management.  There
 are also placeholders listed as 'TBD' where more investigation is still
 needed and will be filled in over time.  At the end of the page there is a
 section on the C API with a link to a function-by-function breakdown of the
 C API and whether the function belongs in the interface layer, the core, or
 need to be split between the two.  All functions listed as 'core' will
 continue to have an interface-level wrapper with the same name to ensure
 source-compatibility.

 All of this, particularly the interface/core function designations, is a
 first analysis and in flux. The goal is to get the information out and
 elicit discussion and feedback from the community.


A few thoughts came to mind while reading the initial writeup.

1) How is the GIL handled in the callbacks.
2) What about error handling? That is tricky to get right, especially in C
and with reference counting.
3) Is there a general policy as to how the reference counting should be
handled in specific functions? That is, who does the reference
incrementing/decrementing?
4) Boost has some reference counted pointers, have you looked at them? C++
is admittedly a very different animal for this sort of application.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Charles R Harris
On Thu, Jun 10, 2010 at 7:26 AM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Wed, Jun 9, 2010 at 5:27 PM, Jason McCampbell 
 jmccampb...@enthought.com wrote:

 Hi everyone,

 This is a follow-up to Travis's message on the re-factoring project from
 May 25th and the subsequent discussion. For background, I am a developer at
 Enthought working on the NumPy re-factoring project with Travis and Scott.
 The immediate goal from our perspective is to re-factor the core of NumPy
 into two architectural layers: a true core that is CPython-independent and
 an interface layer that binds the core to CPython.

 A design proposal is now posted on the NumPy developer wiki:
 http://projects.scipy.org/numpy/wiki/NumPyRefactoring

 The write-up is a fairly high-level description of what we think the split
 will look like and how to deal with issues such as memory management.  There
 are also placeholders listed as 'TBD' where more investigation is still
 needed and will be filled in over time.  At the end of the page there is a
 section on the C API with a link to a function-by-function breakdown of the
 C API and whether the function belongs in the interface layer, the core, or
 need to be split between the two.  All functions listed as 'core' will
 continue to have an interface-level wrapper with the same name to ensure
 source-compatibility.

 All of this, particularly the interface/core function designations, is a
 first analysis and in flux. The goal is to get the information out and
 elicit discussion and feedback from the community.


 A few thoughts came to mind while reading the initial writeup.

 1) How is the GIL handled in the callbacks.
 2) What about error handling? That is tricky to get right, especially in C
 and with reference counting.
 3) Is there a general policy as to how the reference counting should be
 handled in specific functions? That is, who does the reference
 incrementing/decrementing?
 4) Boost has some reference counted pointers, have you looked at them? C++
 is admittedly a very different animal for this sort of application.

 I've thought that PyArray_SetNumericOps, PyArray_SetNumericOps should have
getter/setter functions so that the structure doesn't have to be made public
in the split up files.

Can the casting functions be implemented as ufuncs? Or at least look like
ufuncs so they can handle strided memory. Likewise for some of the other
things in arraytypes.c.src.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Jason McCampbell
Hi Chuck,

Good questions.  Responses inline below...

Jason

On Thu, Jun 10, 2010 at 8:26 AM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Wed, Jun 9, 2010 at 5:27 PM, Jason McCampbell 
 jmccampb...@enthought.com wrote:

 Hi everyone,

 This is a follow-up to Travis's message on the re-factoring project from
 May 25th and the subsequent discussion. For background, I am a developer at
 Enthought working on the NumPy re-factoring project with Travis and Scott.
 The immediate goal from our perspective is to re-factor the core of NumPy
 into two architectural layers: a true core that is CPython-independent and
 an interface layer that binds the core to CPython.

 A design proposal is now posted on the NumPy developer wiki:
 http://projects.scipy.org/numpy/wiki/NumPyRefactoring

 The write-up is a fairly high-level description of what we think the split
 will look like and how to deal with issues such as memory management.  There
 are also placeholders listed as 'TBD' where more investigation is still
 needed and will be filled in over time.  At the end of the page there is a
 section on the C API with a link to a function-by-function breakdown of the
 C API and whether the function belongs in the interface layer, the core, or
 need to be split between the two.  All functions listed as 'core' will
 continue to have an interface-level wrapper with the same name to ensure
 source-compatibility.

 All of this, particularly the interface/core function designations, is a
 first analysis and in flux. The goal is to get the information out and
 elicit discussion and feedback from the community.


 A few thoughts came to mind while reading the initial writeup.

 1) How is the GIL handled in the callbacks.


How to handle the GIL still requires some thought.  The cleanest way, IMHO,
would is for the interface layer to release the lock prior to calling into
the core and then each callback function in the interface is responsible for
re-acquiring it.  That's straightforward to define as a rule and should work
well in general, but I'm worried about potential performance issues if/when
a callback is called in a loop.  A few optimization points is ok, but too
many and it will just be a source of heisenbugs.

One other option is to just use the existing release/acquire macros in NumPy
and redirect them to the interface layer.  Any app that isn't CPython would
just leave those callback pointers NULL.  It's less disruptive but leaves
some very CPython-specific behavior in the core.


 2) What about error handling? That is tricky to get right, especially in C
 and with reference counting.


The error reporting functions in the core will likely look a lot like the
CPython functions - they seem general enough.  The biggest change is the
CPython ones take a PyObject as the error type.  99% of the errors reported
in NumPy use one of a half-dozen pre-defined types that are easy to
translate.  There is at least one case where an object type (complex number)
is dynamically and used as the type, but so far I believe it's only one
case.

The reference counting does get a little more complex because a core routine
will need to decref the core object on error and the interface layer will
need to similarly detect the error and potentially do it's own decref.  Each
layer is still responsible for it's own clean up, but there are now two
opportunities to introduce leaks.


 3) Is there a general policy as to how the reference counting should be
 handled in specific functions? That is, who does the reference
 incrementing/decrementing?


Both layers should implement the existing policy for the objects that it
manages. Essentially a function can use it's caller's reference but needs to
increment the count if it's going to store it.  A new instance is returned
with a refcnt of 1 and the caller needs to clean it up when it's no longer
needed.  But that means that if the core returns a new NpyArray instance to
the interface layer, the receiving function in the interface must allocate a
PyObject wrapper around it and set the wrapper's refcnt to 1 before
returning it.

Is that what you were asking?

4) Boost has some reference counted pointers, have you looked at them? C++
 is admittedly a very different animal for this sort of application.


There is also need to replace the usage of PyDict and other uses of CPython
for basic data structures that aren't present in C.  Having access to C++
for this and reference counting would be nice, but has the potential to
break builds for everyone who use the C API.  I think it's worth discussing
for the future but a bigger (and possibly more contentious) change than we
are able to take on for this project.


 Chuck

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org

Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Sturla Molden


I have a few radical suggestions:

1. Use ctypes as glue to the core DLL, so we can completely forget about 
refcounts and similar mess. Why put manual reference counting and error 
handling in the core? It's stupid.


2. The core should be a plain DLL, loadable with ctypes. (I know David 
Cournapeau and Robert Kern is going to hate this.) But if Python can 
have a custom loader for .pyd files, so can NumPy for it's core DLL. For 
ctypes we just need to specify a fully qualified path to the DLL, which 
can be read from a config file or whatever.


3. ctypes will take care of all the mess regarding the GIL. Again there 
is no need to re-invent the wheel. ctypes.CDLL releases the GIL when 
calling into C, and re-acquires before making callbacks to Python. 
ctypes.PyDLL keeps the GIL for legacy library code that are not threadsafe.


ctypes will also make porting to other Python implementations easier (or 
even other languages: Ruby, JacaScript) easier. Not to mention that it 
will make NumPy impervious to changes in the Python C API.


4. Write the core in C++ or Fortran (95/2003), not C. ANSI C (aka C89). 
Use std::vector instead of malloc/free for C++, and possibly templates 
for generics on various dtypes.


5. Allow OpenMP pragmas in the core. If arrays are above a certain size, 
it should switch to multi-threading.


Sturla






Den 10.06.2010 15:26, skrev Charles R Harris:


A few thoughts came to mind while reading the initial writeup.

1) How is the GIL handled in the callbacks.
2) What about error handling? That is tricky to get right, especially 
in C and with reference counting.
3) Is there a general policy as to how the reference counting should 
be handled in specific functions? That is, who does the reference 
incrementing/decrementing?
4) Boost has some reference counted pointers, have you looked at them? 
C++ is admittedly a very different animal for this sort of application.


Chuck


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
   


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Sturla Molden
Den 10.06.2010 18:48, skrev Sturla Molden:
 ctypes will also make porting to other Python implementations easier 
 (or even other languages: Ruby, JacaScript) easier. Not to mention 
 that it will make NumPy impervious to changes in the Python C API.

Linking is also easier with ctypes. I started getting a lot of linker 
errors when switching to 64-bit Python on Windows. Just throwing out all 
of Cython et al. in favour of plain C keeps the problem away.

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Sturla Molden


Another suggestion I'd like to make is bytearray as memory buffer for 
the ndarray. An ndarray could just store or extend a bytearray, instead 
of having to deal with malloc/free and and the mess that comes with it. 
Python takes care of the reference counts for bytearrays, and ctypes 
calls the NumPy core DLL. Yes it will not work with older versions of 
Python. But for a complete refactoring (NumPy 3000), backwards 
compatibility should not matter much. (It's easier to backport bytearray 
than track down memory leaks in NumPy's core.)


Sturla


Den 10.06.2010 18:48, skrev Sturla Molden:


I have a few radical suggestions:

1. Use ctypes as glue to the core DLL, so we can completely forget 
about refcounts and similar mess. Why put manual reference counting 
and error handling in the core? It's stupid.


2. The core should be a plain DLL, loadable with ctypes. (I know David 
Cournapeau and Robert Kern is going to hate this.) But if Python can 
have a custom loader for .pyd files, so can NumPy for it's core DLL. 
For ctypes we just need to specify a fully qualified path to the DLL, 
which can be read from a config file or whatever.


3. ctypes will take care of all the mess regarding the GIL. Again 
there is no need to re-invent the wheel. ctypes.CDLL releases the GIL 
when calling into C, and re-acquires before making callbacks to 
Python. ctypes.PyDLL keeps the GIL for legacy library code that are 
not threadsafe.


ctypes will also make porting to other Python implementations easier 
(or even other languages: Ruby, JacaScript) easier. Not to mention 
that it will make NumPy impervious to changes in the Python C API.


4. Write the core in C++ or Fortran (95/2003), not C. ANSI C (aka 
C89). Use std::vector instead of malloc/free for C++, and possibly 
templates for generics on various dtypes.


5. Allow OpenMP pragmas in the core. If arrays are above a certain 
size, it should switch to multi-threading.


Sturla






Den 10.06.2010 15:26, skrev Charles R Harris:


A few thoughts came to mind while reading the initial writeup.

1) How is the GIL handled in the callbacks.
2) What about error handling? That is tricky to get right, especially 
in C and with reference counting.
3) Is there a general policy as to how the reference counting should 
be handled in specific functions? That is, who does the reference 
incrementing/decrementing?
4) Boost has some reference counted pointers, have you looked at 
them? C++ is admittedly a very different animal for this sort of 
application.


Chuck


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
   



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
   


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread David Cournapeau
On Fri, Jun 11, 2010 at 6:56 AM, Sturla Molden stu...@molden.no wrote:


 Also about array iterators in NumPy's C base (i.e. for doing something
 along an axis): we don't need those. There is a different way of coding
 which leads to faster code.

 1. Collect an array of pointers to each subarray (e.g. using
 std::vectordtype* or dtype**)
 2. Dispatch on the pointer array...

Do you have the code for this ? That's something I wanted to do, but
never took the time to do. Faster generic iterator would be nice, but
very hard to do in general.

 Another thing I did when reimplementing lfilter was copy-in copy-out
 for strided arrays.

What is copy-in copy out ? I am not familiar with this term ?

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Travis Oliphant

On Jun 10, 2010, at 11:48 AM, Sturla Molden wrote:

 
 I have a few radical suggestions:

There are some good ideas there.   I suspect we can't address all of them in 
the course of this re-factoring effort, but I really appreciate you putting 
them out there, because they are useful things to consider. 

 
 1. Use ctypes as glue to the core DLL, so we can completely forget about 
 refcounts and similar mess. Why put manual reference counting and error 
 handling in the core? It's stupid. 

If we could get to a single core DLL, then perhaps this would work.   It may be 
very difficult to actually get to that point though from where we are because 
there is a lot to the current CPython interface that would have to be 
re-thought.   Right now, it looks like there is still a need for an interface 
layer.  

 
 2. The core should be a plain DLL, loadable with ctypes. (I know David 
 Cournapeau and Robert Kern is going to hate this.) But if Python can have a 
 custom loader for .pyd files, so can NumPy for it's core DLL. For ctypes we 
 just need to specify a fully qualified path to the DLL, which can be read 
 from a config file or whatever.

This approach does not build a new Python type in compiled code.   There are 
speed disadvantages to this --- especially for the numpy scalars. 


 
 5. Allow OpenMP pragmas in the core. If arrays are above a certain size, it 
 should switch to multi-threading.

This is an interesting idea. 

-Travis

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Sturla Molden
Den 10.06.2010 22:28, skrev Pauli Virtanen:

 Some places where Openmp could probably help are in the inner ufunc
 loops. However, improving the memory efficiency of the data access
 pattern is another low-hanging fruit for multidimensional arrays.



Getting the intermediate array out of expressions like a + b would be 
very cool. But it is not as trivial as it seems.

Possibility of controlling aliasing (cf. Fortran and C99) can help. For 
example, for binary array operators we know that the output array (a 
temporary intermediate) is not aliased with the two arguments.

ufuncs involving trancendental functions would benefit from OpenMP. 
O(N**2) ufuncs like var and std would also benefit.

We can also use copy-in copy-out for strided array access (see below).

 I'm not sure how Openmp plays together with thread-safety; especially if
 used in outer constructs. Whether this is possible to do easily is not so
 clear, as Numpy uses e.g. iterator and loop objects, which cannot
 probably be shared between different openmp threads. So one would have to
 do some extra work to parallelize these parts manually.



OpenMP has pragmas for serializing access to code, which is:

#pragma omp critical
{
  // done by all threads
  // serialized access (we can also name the lock)
}

#pragma omp atomic
a += 1 // atomic operation (e.g. nice for refcounts, instead a big lock)

#pragma omp single
{
  // done by one thread (random or named)
}

#pragma omp master
{
  // done by the master thread only
  // e.g. used when calling back to the Python C API
}

#pragma omp barrier
// wait for all threads here

That's about all you need to know about thread synchronization in OpenMP...


 But probably if multithreading is desired, openmp sounds like the way to
 go...



Anyone profient in C or Fortran can learn OpenMP in 10 minutes, and it 
is close to infinitely more easy than manual multithreading. And now it 
is supported by the majority of compilers.

Also about array iterators in NumPy's C base (i.e. for doing something 
along an axis): we don't need those. There is a different way of coding 
which leads to faster code.

1. Collect an array of pointers to each subarray (e.g. using 
std::vectordtype* or dtype**)
2. Dispatch on the pointer array...

I used this to reimplement lfilter in scipy. It is just a lot faster 
than array iterators. It also allows us to use OpenMP in a very natural 
way. For multidimensional array I have a recursive function for 
collecting the array of subarray pointers, but for 2D and 3D with can 
use nested for-loops.

Another thing I did when reimplementing lfilter was copy-in copy-out 
for strided arrays. This also helps a lot. It might seem paradoxical as 
we need to iterate over the strided array(s) to make the copies. But in 
practice it is faster by a factor of 2 or 3. Fortran compilers also tend 
to do this optimization for strided arrays. It is even mentioned in the 
Fortran standard as explicitely allowed, as it has consequences for 
array pointers (they can be invalidated by function calls if a 
contigiuous copy is made).

Sturla



















___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Pauli Virtanen
Thu, 10 Jun 2010 18:48:04 +0200, Sturla Molden wrote:
[clip]
 5. Allow OpenMP pragmas in the core. If arrays are above a certain size,
 it should switch to multi-threading.

Some places where Openmp could probably help are in the inner ufunc 
loops. However, improving the memory efficiency of the data access 
pattern is another low-hanging fruit for multidimensional arrays.

I'm not sure how Openmp plays together with thread-safety; especially if 
used in outer constructs. Whether this is possible to do easily is not so 
clear, as Numpy uses e.g. iterator and loop objects, which cannot 
probably be shared between different openmp threads. So one would have to 
do some extra work to parallelize these parts manually.

But probably if multithreading is desired, openmp sounds like the way to 
go...

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread David Cournapeau
On Fri, Jun 11, 2010 at 7:25 AM, Sturla Molden stu...@molden.no wrote:
 Den 10.06.2010 22:07, skrev Travis Oliphant:

 2. The core should be a plain DLL, loadable with ctypes. (I know David 
 Cournapeau and Robert Kern is going to hate this.) But if Python can have a 
 custom loader for .pyd files, so can NumPy for it's core DLL. For ctypes we 
 just need to specify a fully qualified path to the DLL, which can be read 
 from a config file or whatever.

I am not sure why you say I would hate it - that's exactly what I
would like numpy core to be (a library). Ctypes is an implementation
detail. For reference counting, I am not sure we can get away with it
unless we use something like LUA stack technique for the core. The big
issue with reference counting is how to deal with non C VM - there has
to be known good practices to make libraries reusable from say the CLI
or the JVM, but I don't know them.



 By the way: Is Cython mature enough to toss all C out the door?

Maybe not all, but I am convinced most of it could. Today, I would not
have rewritten lfilter in C, for example, I would have just used
cython. One issue with cython compared to C is that it makes
compilation much slower altogether, at least with gcc. I have never
been able to pin-point the exact issue (it does not seem like gcc
should be that slow compiling the generated C).

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Sturla Molden
Den 10.06.2010 22:07, skrev Travis Oliphant:

 2. The core should be a plain DLL, loadable with ctypes. (I know David 
 Cournapeau and Robert Kern is going to hate this.) But if Python can have a 
 custom loader for .pyd files, so can NumPy for it's core DLL. For ctypes we 
 just need to specify a fully qualified path to the DLL, which can be read 
 from a config file or whatever.
  
 This approach does not build a new Python type in compiled code.   There are 
 speed disadvantages to this --- especially for the numpy scalars.

There are at least four speed penalties in what I suggested:

- the ctypes overhead is bigger than using Python's C API manually (or 
Cython).
- there is a speed penalty for scalars and small arrays. (Previously 
seen in numarray vs. numeric.)
- bytearray is a bit slower to create than to malloc a buffer.
- arrays with dtype=object

The advantages are:

- the code is easier to read and maintain (clean separation of Python and C)
- more portable code (no Python C API dependence)
- no obscure memory leaks to track down (bytearray instead of 
malloc/free and no manual ref counting).


By the way: Is Cython mature enough to toss all C out the door? Since 
Cython has syntax for PEP 3118 buffer objects, we should theoretically 
be able to implement NumPy in Cython. Then we don't have the speed 
penalty and no difficult C to maintain. But if the idea is to make a 
Python independent C library from the core, it is probably a bit counter 
productive...


Sturla









___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Sturla Molden

Den 11.06.2010 00:57, skrev David Cournapeau:

Do you have the code for this ? That's something I wanted to do, but
never took the time to do. Faster generic iterator would be nice, but
very hard to do in general.

   



/* this computes the start adress for every vector along a dimension
(axis) of an ndarray */

typedef PyArrayObject ndarray;

inline char *datapointer(ndarray *a, npy_intp *indices)
{
char *data = a-data;
int i;
npy_intp j;
for (i=0; i  a-nd; i++) {
j = indices[i];
data += j * a-strides[i];
}
return data;
}

static double ** _get_axis_pointer(npy_intp *indices, int axis,
   ndarray *a, double **axis_ptr_array, int dim)
{
/* recursive axis pointer search for 4 dimensions or more */
npy_intp i;
double *axis_ptr;
if (dim == a-nd) {
/* done recursing dimensions,
compute address of this axis */
axis_ptr = (double *) datapointer(a, indices);
*axis_ptr_array = axis_ptr;
return (axis_ptr_array + 1);
} else if (dim == axis) {
/* use first element only */
indices[dim] = 0;
axis_ptr_array = _get_axis_pointer(indices, axis, a, 
axis_ptr_array, dim+1);

return axis_ptr_array;
} else {
/* iterate range of indices */
npy_intp length = a-dimensions[dim];
for (i=0; i  length; i++) {
indices[dim] = i;
axis_ptr_array = _get_axis_pointer(indices, axis, a, 
axis_ptr_array, dim+1);

}
return axis_ptr_array;
}
}


static double **get_axis_pointers(ndarray *a, int axis, npy_intp *count)
{
npy_intp indices[NPY_MAXDIMS];
double **out, **tmp;
npy_intp i, j;
j = 1;
for (i=0; i  a-nd; i++) {
if (i != axis)
j *= a-dimensions[i];
}
*count = j;

out = (double **) malloc(*count * sizeof(double*));
if (out == NULL) {
*count = 0;
return NULL;
}
tmp = out;
switch (a-nd) {
/* for one dimension we just return the data pointer */
case 1:
*tmp = (double *)a-data;
break;
/* specialized for two dimensions */
case 2:
switch (axis) {
case 0:
for (i=0; ia-dimensions[1]; i++)
*tmp++ = (double *)(a-data + i*a-strides[1]);
break;

case 1:
for (i=0; ia-dimensions[0]; i++)
*tmp++ = (double *)(a-data + i*a-strides[0]);
break;
}
break;
/* specialized for three dimensions */
case 3:
switch (axis) {
case 0:
for (i=0; ia-dimensions[1]; i++)
for (j=0; ja-dimensions[2]; j++)
*tmp++ = (double *)(a-data + 
i*a-strides[1] + j*a-strides[2]);

break;
case 1:
for (i=0; ia-dimensions[0]; i++)
for (j=0; ja-dimensions[2]; j++)
*tmp++ = (double *)(a-data + 
i*a-strides[0] + j*a-strides[2]);

break;

case 2:
for (i=0; ia-dimensions[0]; i++)
for (j=0; ja-dimensions[1]; j++)
*tmp++ = (double *)(a-data + 
i*a-strides[0] + j*a-strides[1]);


}
break;
/* four or more dimensions: use recursion */
default:
for (i=0; ia-nd; i++) indices[i] = 0;
_get_axis_pointer(indices, axis, a, out, 0);

}
done:
return out;
}



Another thing I did when reimplementing lfilter was copy-in copy-out
for strided arrays.
 

What is copy-in copy out ? I am not familiar with this term ?

   


Strided memory access is slow. So it often helps to make a temporary 
copy that are contiguous.


static void copy_in(double *restrict y, double *restrict x, npy_intp n, 
npy_intp s)

{
npy_intp i;
char *tmp;
if (s == sizeof(double))
memcpy(y, x, n*sizeof(double));
else {
tmp = (char *)x;
for (i=0; in; i++) {
*y++ = *x;
tmp += s;
x = (double *)tmp;
}
}
}

static void copy_out(double *restrict y, double *restrict x, npy_intp n, 
npy_intp s)

{
npy_intp i;
char *tmp;
if (s == sizeof(double))
memcpy(y, x, n*sizeof(double));
else {
tmp = (char *)y;
for (i=0; in; i++) {
*y = *x++;
tmp += s;
y = (double *)tmp;
}
}
}



Sturla








___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Charles R Harris
On Thu, Jun 10, 2010 at 6:27 PM, Sturla Molden stu...@molden.no wrote:

  Den 11.06.2010 00:57, skrev David Cournapeau:

  Do you have the code for this ? That's something I wanted to do, but
 never took the time to do. Faster generic iterator would be nice, but
 very hard to do in general.





 /* this computes the start adress for every vector along a dimension
 (axis) of an ndarray */

 typedef PyArrayObject ndarray;

 inline char *datapointer(ndarray *a, npy_intp *indices)
 {
 char *data = a-data;
 int i;
 npy_intp j;
 for (i=0; i  a-nd; i++) {
 j = indices[i];
 data += j * a-strides[i];
 }
 return data;
 }

 static double ** _get_axis_pointer(npy_intp *indices, int axis,
ndarray *a, double **axis_ptr_array, int dim)
 {
 /* recursive axis pointer search for 4 dimensions or more */
 npy_intp i;
 double *axis_ptr;
 if (dim == a-nd) {
 /* done recursing dimensions,
 compute address of this axis */
 axis_ptr = (double *) datapointer(a, indices);
 *axis_ptr_array = axis_ptr;
 return (axis_ptr_array + 1);
 } else if (dim == axis) {
 /* use first element only */
 indices[dim] = 0;
 axis_ptr_array = _get_axis_pointer(indices, axis, a,
 axis_ptr_array, dim+1);
 return axis_ptr_array;
 } else {
 /* iterate range of indices */
 npy_intp length = a-dimensions[dim];
 for (i=0; i  length; i++) {
 indices[dim] = i;
 axis_ptr_array = _get_axis_pointer(indices, axis, a,
 axis_ptr_array, dim+1);
 }
 return axis_ptr_array;
 }
 }


 static double **get_axis_pointers(ndarray *a, int axis, npy_intp *count)
 {
 npy_intp indices[NPY_MAXDIMS];
 double **out, **tmp;
 npy_intp i, j;
 j = 1;
 for (i=0; i  a-nd; i++) {
 if (i != axis)
 j *= a-dimensions[i];
 }
 *count = j;

 out = (double **) malloc(*count * sizeof(double*));
 if (out == NULL) {
 *count = 0;
 return NULL;
 }
 tmp = out;
 switch (a-nd) {
 /* for one dimension we just return the data pointer */
 case 1:
 *tmp = (double *)a-data;
 break;
 /* specialized for two dimensions */
 case 2:
 switch (axis) {
 case 0:
 for (i=0; ia-dimensions[1]; i++)
 *tmp++ = (double *)(a-data + i*a-strides[1]);
 break;

 case 1:
 for (i=0; ia-dimensions[0]; i++)
 *tmp++ = (double *)(a-data + i*a-strides[0]);
 break;
 }
 break;
 /* specialized for three dimensions */
 case 3:
 switch (axis) {
 case 0:
 for (i=0; ia-dimensions[1]; i++)
 for (j=0; ja-dimensions[2]; j++)
 *tmp++ = (double *)(a-data + i*a-strides[1] +
 j*a-strides[2]);
 break;
 case 1:
 for (i=0; ia-dimensions[0]; i++)
 for (j=0; ja-dimensions[2]; j++)
 *tmp++ = (double *)(a-data + i*a-strides[0] +
 j*a-strides[2]);
 break;

 case 2:
 for (i=0; ia-dimensions[0]; i++)
 for (j=0; ja-dimensions[1]; j++)
 *tmp++ = (double *)(a-data + i*a-strides[0] +
 j*a-strides[1]);

 }
 break;
 /* four or more dimensions: use recursion */
 default:
 for (i=0; ia-nd; i++) indices[i] = 0;
 _get_axis_pointer(indices, axis, a, out, 0);

 }
 done:
 return out;

 }


   Another thing I did when reimplementing lfilter was copy-in copy-out
 for strided arrays.


  What is copy-in copy out ? I am not familiar with this term ?




 Strided memory access is slow. So it often helps to make a temporary copy
 that are contiguous.


But for an initial refactoring it probably falls in the category of
premature optimization. Another thing to avoid on the first go around is
micro-optimization, as it tends to complicate the code and often doesn't do
much for performance.

snip

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread David
On 06/11/2010 10:02 AM, Charles R Harris wrote:



 But for an initial refactoring it probably falls in the category of
 premature optimization. Another thing to avoid on the first go around is
 micro-optimization, as it tends to complicate the code and often doesn't
 do much for performance.

I agree it may be difficult to add this in the initial refactorization, 
but I don't think it falls into the micro optimization category.

The whole idea of converting strided buffers into temporary contiguous 
areas is already in ufunc, though. Maybe a core API to do just that 
would be useful. I am not familiar with the ufunc API at all, so I am 
not sure how it would go.

cheers,

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread David
On 06/11/2010 09:27 AM, Sturla Molden wrote:


 Strided memory access is slow. So it often helps to make a temporary
 copy that are contiguous.

Ah, ok, I did not know this was called copy-in/copy-out, thanks for the 
explanation. I agree this would be a good direction to pursue, but maybe 
out of scope for the first refactoring,

cheers,

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion



Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Sturla Molden
Den 11.06.2010 04:19, skrev David:

 Ah, ok, I did not know this was called copy-in/copy-out, thanks for the
 explanation. I agree this would be a good direction to pursue, but maybe
 out of scope for the first refactoring,



Copy-in copy-out is actually an implementation detail in Fortran 
compilers. It has more to do with how subroutines are called. But the 
purpose is to turn a strided array into a contiguous one.

Sturla


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Sturla Molden
Den 11.06.2010 03:02, skrev Charles R Harris:

 But for an initial refactoring it probably falls in the category of 
 premature optimization. Another thing to avoid on the first go around 
 is micro-optimization, as it tends to complicate the code and often 
 doesn't do much for performance.




First, to refractor NumPy we must get rid of array iterators as Python 
objects. Retaining them as Python objects defies the whole purpose. 
Second, creating an array of pointers has the advantage of being more 
friendly to OpenMP and multi-core CPUs. We don't need to serialize 
access to an iterator, and we allow OpenMP to do load balancing. For 
things like FFT on vectors along an axis in a multidimensional array 
(e.g. in multi-dimensional FFTs), this is not a micro-optimization. Not 
to mention that it is easier to iterate over an array than use a 
complicated iterator object in C.

Sturla
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy re-factoring project

2010-06-10 Thread Charles R Harris
On Thu, Jun 10, 2010 at 8:40 PM, Sturla Molden stu...@molden.no wrote:

 Den 11.06.2010 03:02, skrev Charles R Harris:
 
  But for an initial refactoring it probably falls in the category of
  premature optimization. Another thing to avoid on the first go around
  is micro-optimization, as it tends to complicate the code and often
  doesn't do much for performance.




 First, to refractor NumPy we must get rid of array iterators as Python
 objects. Retaining them as Python objects defies the whole purpose.
 Second, creating an array of pointers has the advantage of being more
 friendly to OpenMP and multi-core CPUs. We don't need to serialize
 access to an iterator, and we allow OpenMP to do load balancing. For
 things like FFT on vectors along an axis in a multidimensional array
 (e.g. in multi-dimensional FFTs), this is not a micro-optimization. Not
 to mention that it is easier to iterate over an array than use a
 complicated iterator object in C.


How does that work doing ffts along all axis? What about slicing and views,
axis rolling and transposes? Furthermore, if you are just going to add two
discontiguous arrays, or take the sine, making copies is wasted effort. I
think those things definitely fall in the category of optimizations for many
things and will just be a distraction from the real problems of the
refactoring.

When I say micro-optimization, I am not talking about such things, however.
Micro-optimization is such things as loop unrolling, which while it can
greatly speed up some things adds a lot of obfuscating code when the first
priority is to get things right. Another example would be trying to get a
tad more speed using pointers instead of indexes, an optimization which also
tends to be architecture dependent. Or using a ton of macros instead of
subroutines or inline functions. So on and so forth. I think in these things
the first priority is to have something readable and debuggable with clear
lines of control flow. This is easier said than done, of course.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion