Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-04-15 Thread Travis Oliphant
Greg Ewing wrote:

 But since the NumPy object has to know about the provider,
 it can simply pass the release call on to it if appropriate.
 I don't see how this case necessitates making the release call
 on a different object.

 I'm -1 on involving any other objects or returning object
 references from the buffer interface, unless someone can
 come up with a use case which actually *requires* this
 (as opposed to it just being something which might be
 nice to have). The buffer interface should be Blazingly
 Fast(tm), and messing with PyObject*s is not the way to
 get that.

The current proposal would be fast but would be more flexible for 
objects that don't have a memory representation that can be shared 
unless they create their own sharing object that perhaps copies the 
data into a contiguous chunk first.   Objects which have memory which 
can be shared perfectly through the interface would simply pass 
themselves as the return value (after incrementing their extant 
buffers count by one).  


 Seems to me the lock should apply to *everything* returned
 by getbuffer. If the requestor is going to iterate over the
 data, and there are multiple dimensions, surely it's going to
 want to refer to the shape and stride info at regular intervals
 while it's doing that. Requiring it to make its own copy
 would be a burden.


There are two use cases that seem to be under discussion.

1) When you want to apply an algorithm to an arbitrary object that 
exposes the buffer interface

2) When you want to create an object that shares memory with another 
object exposing the buffer interface.

These two use cases have slightly different needs.  What I want to avoid 
is forcing the exporting object to be unable to change its shape and 
strides just because an object is using the memory for use case #2. 

I think the solution that states the shape and strides information are 
only guaranteed valid until the GIL is released is sufficent.  

Alternatively, one could release the shape and strides and format 
separately from the memory with a flag as a second argument to 
releasebuffer.

-Travis







 -- 
 Greg



___
Python-Dev mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-04-15 Thread Travis Oliphant
Carl Banks wrote:

 Tr
 ITSM that we are using the word view very differently.  Consider 
 this example:

 A = zeros((100,100))
 B = A.transpose()


You are thinking of NumPy's particular use case.  I'm thinking of a 
generic use case.  So, yes I'm using the word view in two different 
contexts.

In this scenario, NumPy does not even use the buffer interface.  It 
knows how to transpose it's own objects and does so by creating a new 
NumPy object (with it's own shape and strides space) with a data buffer 
pointed to by A.

Yes, I use the word view for this NumPy usage, but only in the context 
of NumPy.   In the PEP, I've been using the word view quite a bit more 
generically.

So, I don't think this is a good example because A.transpose() will 
never call getbuffer of the A object (it will instead use the known 
structure of NumPy directly).  So, let's talk about the generic 
situation instead of the NumPy specific one.


 I'd suggest the object returned by A.getbuffer should be called the 
 buffer provider or something like that.

I don't care what we call it.  I've been using the word view because 
of the obvious analogy to my use of view in NumPy.  When I had 
envisioned returning an actual object very similar to a NumPy array from 
the buffer interface it made a lot of sense to call it a view.  Now, I'm 
fine to call it buffer provider


 For the sake of discussion, I'm going to avoid the word view 
 altogether.  I'll call A the exporter, as before.  B I'll refer to as 
 the requestor.  The object returned by A.getbuffer is the provider.

Fine.  Let's use that terminology since it is new and not cluttered by 
other uses in other contexts.

 Having thought quite a bit about it, and having written several 
 abortive replies, I now understand it and see the importance of it.  
 getbuffer returns the object that you are to call releasebuffer on.  
 It may or may not be the same object as exporter.  Makes sense, is 
 easy to explain.

Yes, that's exactly all I had considered it to be.   Only now, I'm 
wondering if we need to explicitly release a lock on the shape, strides, 
and format information as well as the buffer location information.


 It's easy to see possible use cases for returning a different object.  
 A  hypothetical future incarnation of NumPy might shift the 
 responsibility of managing buffers from NumPy array object to a hidden 
 raw buffer object.  In this scenario, the NumPy object is the 
 exporter, but the raw buffer object the provider.

 Considering this use case, it's clear that getbuffer should return the 
 shape and stride data independently of the provider.  The raw buffer 
 object wouldn't have that information; all it does is store a pointer 
 and keep a reference count.  Shape and stride is defined by the exporter.


So, who manages the memory to the shape and strides and isptr arrays?   
When a provider is created do these need to be created so that the shape 
and strides arrays are never deallocated when in use. 

The situation I'm considering is  if you have a NumPy array of shape 
(2,3,3) which you then obtain a provider of  (presumably from another 
package) and it retains a lock on the memory for a while.  Should it 
also retain a lock on the shape and strides array?   Can the NumPy array 
re-assign the shape and strides while the provider has still not been 
released?

I would like to say yes, which means that the provider must supply it's 
own copy of shape and strides arrays.  This could be the policy.  
Namely, that the provider must supply the memory for the shape, strides, 
and format arrays which is guaranteed for as long as a lock is held.  In 
the case of NumPy, that provider could create it's own copy of the shape 
and strides arrays (or do it when the shape and strides arrays are 
re-assigned).


 Second question: what happens if a view wants to re-export the 
 buffer? Do the views of the buffer ever change?  Example, say you 
 create a transposed view of a Numpy array.  Now you want a slice of 
 the transposed array.  What does the transposed view's getbuffer 
 export?


 Basically, you could not alter the internal representation of the 
 object while views which depended on those values were around.

 In NumPy, a transposed array actually creates a new NumPy object that 
 refers to the same data but has its own shape and strides arrays.

 With the new buffer protocol, the NumPy array would not be able to 
 alter it's shape/strides/or reallocate its data areas while views 
 were being held by other objects.


 But requestors could alter their own copies of the data, no?  Back to 
 the transpose example: B itself obviously can't use the same strides 
 array as A uses.  It would have to create its own strides, right?


I don't like this example because B does have it's own strides because 
it is a complete NumPy array.   I think we are talking about the same 
thing and that is who manages the memory for the shape and strides 
(and format). 

I think the 

Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-26 Thread Carl Banks
(cc'ing back to Python-dev; the original reply was intended for it by I 
had an email malfunction.)

Travis Oliphant wrote:
 Carl Banks wrote:
 3. Allow getbuffer to return an array of derefence offsets, one for 
 each dimension.  For a given dimension i, if derefoff[i] is 
 nonnegative, it's assumed that the current position (base pointer + 
 indexing so far) is a pointer to a subarray, and derefoff[i] is the 
 offest in that array where the current position goes for the next 
 dimension.  If derefoff[i] is negative, there is no dereferencing.  
 Here is an example of how it'd work:
 
 
 This sounds interesting, but I'm not sure I totally see it.  I probably 
 need a picture to figure out what you are proposing. 

I'll get on it sometime.  For now I hope an example will do.


 The derefoff 
 sounds like some-kind of offset.   Is that enough?  Why not just make 
 derefoff[i] == 0 instead of negative?

I may have misunderstood something.  I had thought the values exported 
by getbuffer could change as the view narrowed, but I'm not sure if it's 
the case now.  I'll assume it isn't for now, because it simplifies 
things and demonstrates the concept better.

Let's start from the beginning.  First, change the prototype to this:

 typedef PyObject *(*getbufferproc)(PyObject *obj, void **buf,
Py_ssize_t *len, int *writeable,
char **format, int *ndims,
Py_ssize_t **shape,
Py_ssize_t **strides,
int **isptr)

isptr is a flag indicating whether, for a certain dimension, the 
positision we've strided to so far is a pointer that should be followed 
before proceeding with the rest of the strides.

Now here's what a general get_item_pointer function would look like, 
given a set of indices:

void* get_item_pointer(int ndim, void* buf, Py_ssize_t* strides,
Py_ssize_t* derefoff, Py_ssize_t *indices) {
 char* pointer = (char*)buf;
 int i;
 for (i = 0; i  ndim; i++) {
 pointer += strides[i]*indices[i];
 if (isptr[i]) {
 pointer = *(char**)pointer;
 }
 }
 return (void*)pointer;
}


 I don't fully understand the PIL example you gave.

Yeah.  How about more details.  Here is a hypothetical image data object 
structure:

struct rgba {
 unsigned char r, g, b, a;
};

struct ImageObject {
 PyObject_HEAD;
 ...
 struct rgba** lines;
 Py_ssize_t height;
 Py_ssize_t width;
 Py_ssize_t shape_array[2];
 Py_ssize_t stride_array[2];
 Py_ssize_t view_count;
};

lines points to malloced 1-D array of (struct rgba*).  Each pointer in 
THAT block points to a seperately malloced array of (struct rgba).  Got 
that?

In order to access, say, the red value of the pixel at x=30, y=50, you'd 
use lines[50][30].r.

So what does ImageObject's getbuffer do?  Leaving error checking out:

PyObject* getbuffer(PyObject *self, void **buf, Py_ssize_t *len,
 int *writeable, char **format, int *ndims,
 Py_ssize_t **shape, Py_ssize_t **strides,
 int **isptr) {

 static int _isptr[2] = { 1, 0 };

 *buf = self-lines;
 *len = self-height*self-width;
 *writable = 1;
 *ndims = 2;
 self-shape_array[0] = height;
 self-shape_array[1] = width;
 *shape = self-shape_array;
 self-stride_array[0] = sizeof(struct rgba*);  /* yep */
 self-stride_array[1] = sizeof(struct rgba);
 *strides = self-stride_array;
 *isptr = _isptr;

 self-view_count ++;
 /* create and return view object here, but for what? */
}


There are three essential differences from a regular, contiguous array.

1. buf is set to point at the array of pointers, not directly to the data.

2. The isptr thing.  isptr[0] is true to indicate that the first 
dimension is an array of pointers, not the actual data.

3. stride[0] is sizeof(struct rgba*), not self-width*sizeof(struct 
rgba) like it would be for a contiguous array.  This is because your 
first stride is through an array of pointers, not the data itself.


So let's examine what get_item_pointer above will do given these 
values.  Once again, we're looking for the pixel at x=30, y=50.

First, we set pointer to buf, that is, self-lines.

Then we take the first stride: we add index[0]+strides[0], that is, 
50*4=200, to poitner.  pointer now equals self-lines[50].

Now, we check isptr[0].  We see that it is true.  Thus, the position 
we've strided to is, in fact, a pointer to a subarray where the actual 
data is.  So we follow it: pointer = *pointer.  pointer now equals 
self-lines[50] which equals self-lines[50][0].

Next dimension.  We take the second stride: we add index[1]+strides[1], 
that is, 30*4=120, to pointer.  pointer now equals self-lines[50][30].

Now, we check isptr[1].  It's false.  No dereferencing this step.

We're 

Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-26 Thread Travis Oliphant
Carl Banks wrote:
 We're done.  Return pointer.

Thank you for this detailed example.  I will have to parse it in more 
depth but I think I can see what you are suggesting.

 
 First, I'm not sure why getbuffer needs to return a view object. 

The view object in your case would just be the ImageObject.  The reason 
I was thinking the function should return something is to provide more 
flexibility in what a view object actually is.

I've also been going back and forth between explicitly passing all this 
information around or placing it in an actual view-object.  In some 
sense, a view object is a NumPy array in my mind.  But, with the 
addition of isptr we are actually expanding the memory abstraction of 
the view object beyond an explicit NumPy array.

In the most common case, I envisioned the view object would just be the 
object itself in which case it doesn't actually have to be returned. 
While returning the view object would allow unspecified flexibilty in 
the future, it really adds nothing to the current vision.

We could add a view object separately as an abstract API on top of the 
buffer interface.

 
 
 Second question: what happens if a view wants to re-export the buffer? 
 Do the views of the buffer ever change?  Example, say you create a 
 transposed view of a Numpy array.  Now you want a slice of the 
 transposed array.  What does the transposed view's getbuffer export?

Basically, you could not alter the internal representation of the object 
while views which depended on those values were around.

In NumPy, a transposed array actually creates a new NumPy object that 
refers to the same data but has its own shape and strides arrays.

With the new buffer protocol, the NumPy array would not be able to alter 
it's shape/strides/or reallocate its data areas while views were being 
held by other objects.

With the shape and strides information, the format information, and the 
data buffer itself, there are actually several pieces of memory that may 
need to be protected because they may be shared with other objects. 
This makes me wonder if releasebuffer should contain an argument which 
states whether or not to release the memory, the shape and strides 
information, the format information, or all of it.

Having such a thing as a view object would actually be nice because it 
could hold on to a particular view of data with a given set of shape and 
strides (whose memory it owns itself) and then the exporting object 
would be free to change it's shape/strides information as long as the 
data did not change.

 
 The reason I ask is: if things like buf and strides and shape 
 could change when a buffer is re-exported, then it can complicate things 
 for PIL-like buffers.  (How would you account for offsets in a dimension 
 that's in a subarray?)

I'm not sure what you mean, offsets are handled by changing the starting 
location of the pointer to the buffer.

-Travis
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-26 Thread Carl Banks
Travis Oliphant wrote:
 Carl Banks wrote:
 We're done.  Return pointer.
 
 Thank you for this detailed example.  I will have to parse it in more 
 depth but I think I can see what you are suggesting.
 
 First, I'm not sure why getbuffer needs to return a view object. 
 
 The view object in your case would just be the ImageObject.  

ITSM that we are using the word view very differently.  Consider this 
example:

A = zeros((100,100))
B = A.transpose()

In this scenario, A would be the exporter object, I think we both would 
call it that.  When I use the word view, I'm referring to B.  However, 
you seem to be referring to the object returned by A.getbuffer, right? 
What term have you been using to refer to B?  Obviously, it would help 
the discussion if we could get our terminology straight.

(Frankly, I don't agree with your usage; it doesn't agree with other 
uses of the word view.  For example, consider the proposed Python 3000 
dictionary views:

D = dict()
V = D.items()

Here, V is the view, and it's analogous to B in the above example.)

I'd suggest the object returned by A.getbuffer should be called the 
buffer provider or something like that.

For the sake of discussion, I'm going to avoid the word view 
altogether.  I'll call A the exporter, as before.  B I'll refer to as 
the requestor.  The object returned by A.getbuffer is the provider.


  The reason
  I was thinking the function should return something is to provide more
  flexibility in what a view object actually is.
 
 I've also been going back and forth between explicitly passing all this 
 information around or placing it in an actual view-object.  In some 
 sense, a view object is a NumPy array in my mind.  But, with the 
 addition of isptr we are actually expanding the memory abstraction of 
 the view object beyond an explicit NumPy array.

 In the most common case, I envisioned the view object would just be the 
 object itself in which case it doesn't actually have to be returned. 
 While returning the view object would allow unspecified flexibilty in 
 the future, it really adds nothing to the current vision.
 
 We could add a view object separately as an abstract API on top of the 
 buffer interface.

Having thought quite a bit about it, and having written several abortive 
replies, I now understand it and see the importance of it.  getbuffer 
returns the object that you are to call releasebuffer on.  It may or may 
not be the same object as exporter.  Makes sense, is easy to explain.

It's easy to see possible use cases for returning a different object.  A 
  hypothetical future incarnation of NumPy might shift the 
responsibility of managing buffers from NumPy array object to a hidden 
raw buffer object.  In this scenario, the NumPy object is the exporter, 
but the raw buffer object the provider.

Considering this use case, it's clear that getbuffer should return the 
shape and stride data independently of the provider.  The raw buffer 
object wouldn't have that information; all it does is store a pointer 
and keep a reference count.  Shape and stride is defined by the exporter.


 Second question: what happens if a view wants to re-export the buffer? 
 Do the views of the buffer ever change?  Example, say you create a 
 transposed view of a Numpy array.  Now you want a slice of the 
 transposed array.  What does the transposed view's getbuffer export?
 
 Basically, you could not alter the internal representation of the object 
 while views which depended on those values were around.

 In NumPy, a transposed array actually creates a new NumPy object that 
 refers to the same data but has its own shape and strides arrays.
 
 With the new buffer protocol, the NumPy array would not be able to alter 
 it's shape/strides/or reallocate its data areas while views were being 
 held by other objects.

But requestors could alter their own copies of the data, no?  Back to 
the transpose example: B itself obviously can't use the same strides 
array as A uses.  It would have to create its own strides, right?

So, what if someone takes a slice out of B?  When calling B.getbuffer,
does it get B's strides, or A's?

I think it should get B's.  After all, if you're taking a slice of B, 
don't you care about the slicing relative to B's axes?  I can't really 
think of a use case for exporting A's stride data when you take a slice 
of B, and it doesn't seem to simplify memory management, because B has 
to make it's own copies anyways.


 With the shape and strides information, the format information, and the 
 data buffer itself, there are actually several pieces of memory that may 
 need to be protected because they may be shared with other objects. 
 This makes me wonder if releasebuffer should contain an argument which 
 states whether or not to release the memory, the shape and strides 
 information, the format information, or all of it.

Here's what I think: the lock should only apply to the buffer itself, 
and not to shape and stride data at 

Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-26 Thread Greg Ewing
Carl Banks wrote:

 It's easy to see possible use cases for returning a different object.  A 
   hypothetical future incarnation of NumPy might shift the 
 responsibility of managing buffers from NumPy array object to a hidden 
 raw buffer object.  In this scenario, the NumPy object is the exporter, 
 but the raw buffer object the provider.

But since the NumPy object has to know about the provider,
it can simply pass the release call on to it if appropriate.
I don't see how this case necessitates making the release call
on a different object.

I'm -1 on involving any other objects or returning object
references from the buffer interface, unless someone can
come up with a use case which actually *requires* this
(as opposed to it just being something which might be
nice to have). The buffer interface should be Blazingly
Fast(tm), and messing with PyObject*s is not the way to
get that.

  This makes me wonder if releasebuffer should contain an argument which 
  states whether or not to release the memory, the shape and strides 
  information, the format information, or all of it.

 Here's what I think: the lock should only apply to the buffer itself, 
 and not to shape and stride data at all.  If the requestor wants to keep
 its own copies of the data, it would have to malloc its own storage for 
 it.  I expect that this would be very rare.

Seems to me the lock should apply to *everything* returned
by getbuffer. If the requestor is going to iterate over the
data, and there are multiple dimensions, surely it's going to
want to refer to the shape and stride info at regular intervals
while it's doing that. Requiring it to make its own copy
would be a burden.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-26 Thread Carl Banks
Travis Oliphant wrote:
 Carl Banks wrote:
 
 Tr
 ITSM that we are using the word view very differently.  Consider 
 this example:

 A = zeros((100,100))
 B = A.transpose()
 
 
 You are thinking of NumPy's particular use case.  I'm thinking of a 
 generic use case.  So, yes I'm using the word view in two different 
 contexts.
 
 In this scenario, NumPy does not even use the buffer interface.  It 
 knows how to transpose it's own objects and does so by creating a new 
 NumPy object (with it's own shape and strides space) with a data buffer 
 pointed to by A.

I realized that as soon as I tried a simple Python demonstration of it. 
  So it's a poor example.  But I hope it's obvious how it would 
generalize to a different type.


 Having such a thing as a view object would actually be nice because 
 it could hold on to a particular view of data with a given set of 
 shape and strides (whose memory it owns itself) and then the 
 exporting object would be free to change it's shape/strides 
 information as long as the data did not change.


 What I don't undestand is why it's important for the provider to 
 retain this data.  The requestor only needs the information when it's 
 created; it will calculate its own versions of the data, and will not 
 need the originals again, so no need to the provider to keep them around.
 
 That is certainly a policy we could enforce (and pretty much what I've 
 been thinking).  We just need to make it explicit that the shape and 
 strides provided is only guaranteed up until a GIL release (i.e. 
 arbitrary Python code could change these memory areas both their content 
 and location) and so if you need them later, make your own copies.
 
 If this were the policy, then NumPy could simply pass pointers to its 
 stored shape and strides arrays when the buffer interface is called but 
 then not worry about re-allocating these arrays before the buffer lock 
 is released.   Another object could hold on to the memory area of the 
 NumPy array but would have to store shape and strides information if it 
 wanted to keep it.
 NumPy could also just pass a pointer to the char * representation of the 
 format (which in NumPy would be stored in the dtype object) and would 
 not have to worry about the dtype being re-assigned later.

Bingo!  This is my preference.


 The reason I ask is: if things like buf and strides and shape 
 could change when a buffer is re-exported, then it can complicate 
 things for PIL-like buffers.  (How would you account for offsets in 
 a dimension that's in a subarray?)


 I'm not sure what you mean, offsets are handled by changing the 
 starting location of the pointer to the buffer.



 But to anwser your question: you can't just change the starting 
 location because there's hundreds of buffers.  You'd either have to 
 change the starting location of each one of them, which is highly 
 undesirable, or to factor in an offset somehow.  (This was, in fact, 
 the point of the derefoff term in my original suggestion.)
 
 
 I get better what your derefoff was doing now.  I was missing the 
 de-referencing that was going on.   Couldn't you still just store a 
 pointer to the start of the array.  In other words, isn't your **isptr  
 suggestion sufficient?   It seems to be.

No.  The problem arises when slicing.  In a single buffer, you would 
adjust the base pointer to point at the element [0,0] of the slice.  But 
you can't do that with multiple buffers.  Instead, you have to add an 
offset after deferencing the pointer to the subarray.

Hence my derefoff proposal.  It dereferenced the pointer, then added an 
offset to get you to the 0 position in that dimension.


 Anyways, despite the miscommunications so far, I now have a very good 
 idea of what's going on.  We definitely need to get terms straight.  I 
 agree that getbuffer should return an object.  I think we need to 
 think harder about the case when requestors re-export the buffer.  
 (Maybe it's time to whip up some experimental objects?)
 
 I'm still not clear what you are concerned about.   If an object 
 consumes the buffer interface and then wants to be able to later export 
 it to another, then from our discussion about the shape/strides and 
 format information, it would have to maintain it's own copies of these 
 things, because it could not rely on the original provider (or exporter) 
 to keep them around once the GIL is released.

Right.  So, if someone calls getbuffer, it would send its own copies of 
the buffer information, and not the original exporter's.  The values 
returned by getbuffer can vary for a given buffer, depending on the 
exporter.  Which means the data returned by getbuffer could reflect 
slicing.  Which means the isptr array is not sufficient for the 
PIL-style multiple buffers.


 This is the reason we would have to be very clear about the guaranteed 
 persistance of the shape/strides and format memory whose pointers are 
 returned through the proposed buffer interface.
 
 

Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-23 Thread Neil Hodgson
   I have developed a split vector type that implements the buffer protocol at
http://scintilla.sourceforge.net/splitvector-1.0.zip

   It acts as a mutable string implementing most of the sequence
protocol as well as the buffer protocol. splitvector.SplitVector('c')
creates a vector containing 8 bit characters and
splitvector.SplitVector('u') is for Unicode.

   A writable attribute bufferAppearence can be set to 0 (default) to
respond to buffer protocol calls by moving the gap to the end and
returning the address of all of the data. Setting bufferAppearence to
1 responds as a two segment buffer. I haven't found any code that
understands responding with two segments. sre and file.write handle
SplitVector fine when it responds as a single segment:

import re, splitvector
x = splitvector.SplitVector(c)
x[:] = The life of brian
r = re.compile(l[a-z]*, re.M)
print x
y = r.search(x)
print y.group(0)
x.bufferAppearence = 1
y = r.search(x)
print y.group(0)

   produces

The life of brian
life
Traceback (most recent call last):
  File qt.py, line 9, in module
y = r.search(x)
TypeError: expected string or buffer

   It is likely that adding multi-segment ability to sre would
complexify and slow it down. OTOH multi-segment buffers may be
well-suited to scatter/gather I/O calls like writev.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-22 Thread Travis Oliphant
Greg Ewing wrote:
 Travis Oliphant wrote:
 
 
I'm talking about arrays of pointers to other arrays:

i.e. if somebody defined in C

float B[10][20]

then B would B an array of pointers to arrays of floats.
 
 
 No, it wouldn't, it would be a contiguously stored
 2-dimensional array of floats. An array of pointers
 would be
 
float *B[10];
 
 followed by code to allocate 10 arrays of 20 floats
 each and initialise B to point to them.
 

You are right, of course, that example was not correct.  I think the 
point is still valid, though.   One could still use the shape to 
indicate how many levels of pointers-to-pointers there are (i.e. how 
many pointer dereferences are needed to select out an element).  Further 
dimensionality could then be reported in the format string.

This would not be hard to allow.  It also would not be hard to write a 
utility function to copy such shared memory into a contiguous segment to 
provide a C-API that allows casual users to avoid the details of memory 
layout when they are writing an algorithm that just uses the memory.

 I can imagine cases like that coming up in practice.
 For example, an image object might store its data
 as four blocks of memory for R, G, B and A planes,
 each of which is a contiguous 2d array with shape
 and stride -- but you want to view it as a 3d
 array byte[plane][x][y].

All we can do is have the interface actually be able to describe it's 
data.  Users would have to take that information and write code 
accordingly.

In this case, for example, one possibility is that the object would 
raise an error if strides were requested.  It would also raise an error 
if contiguous data was requested (or I guess it could report the R 
channel only if it wanted to).   Only if segments were requested could 
it return an array of pointers to the four memory blocks.  It could then 
report itself as a 2-d array of shape (4, H)  where H is the height. 
Each element of the array would be reported as %sB % W where W is the 
width of the image (i.e. each element of the 2-d array would be a 1-d 
array of length W.

Alternatively it could report itself as a 1-d array of shape (4,) with 
elements (H,W)B

A user would have to write the algorithm correctly in order to access 
the memory correctly.

Alternatively, a utility function that copies into a contiguous buffer 
would allow the consumer to not care about exactly how the memory is 
layed out.  But, the buffer interface would allow the utility function 
to figure it out and do the right thing for each exporter.  This 
flexibility would not be available if we don't allow for segmented 
memory in the buffer interface.

So, I don't think it's that hard to at least allow the multiple-segment 
idea into the buffer interface (as long as all the segments are the same 
size, mind you).  It's only one more argument to the getbuffer call.


-Travis

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Extended Buffer Interface/Protocol

2007-03-21 Thread Travis Oliphant

I'm soliciting feedback on my extended buffer protocol that I am 
proposing for inclusion in Python 3000.  As soon as the Python 3.0 
implementation is complete, I plan to back-port the result to Python 
2.6, therefore, I think there may be some interest on this list.

Basically, the extended buffer protocol seeks to allow memory sharing with

1) information about what is in the memory (float, int, C-structure, etc.)
2) information about the shape of the memory (if any)

3) information about discontiguous memory segments


Number 3 is where I could use feedback --- especially from PIL users and 
developers.   Strides are a common way to think about a possibly 
discontiguous chunk of memory (which appear in NumPy when you select a 
sub-region from a larger array). The strides vector tells you how many 
bytes to skip in each dimension to get to the next memory location for 
that dimension.

Because NumPy uses this memory model as do several compute libraries 
(like BLAS and LAPACK), it makes sense to allow this memory model to be 
shared between objects in Python.

Currently, the proposed buffer interface eliminates the multi-segment 
option (for Python 3.0) which I think was originally put in place 
because of the PIL.   However, I don't know if it is actually used by 
any extension types.  This is a big reason why Guido wants to drop the 
multi-segment interface option.

The question is should we eliminate the possibility of sharing memory 
for objects that store data basically as arrays of arrays (i.e. true 
C-style arrays).  That is what I'm currently proposing, but I could also 
see an argument that states that if we are going to support strided 
memory access, we should also support array of array memory access.

If this is added, then it would be another function-call that gets a 
array-of-array-style memory from the object.  What do others think of 
these ideas?


One possible C-API call that Python could grow with the current buffer 
interface is to allow contiguous-memory mirroring of discontiguous 
memory, or an iterator object that iterates through every element of any 
object that exposes the buffer protocol.


Thanks for any feedback,

-Travis Oliphant



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-21 Thread Travis Oliphant

Attached is the PEP.

:PEP: XXX
:Title: Revising the buffer protocol
:Version: $Revision: $
:Last-Modified: $Date:  $
:Author: Travis Oliphant [EMAIL PROTECTED]
:Status: Draft
:Type: Standards Track
:Content-Type: text/x-rst
:Created: 28-Aug-2006
:Python-Version: 3000

Abstract


This PEP proposes re-designing the buffer API (PyBufferProcs
function pointers) to improve the way Python allows memory sharing
in Python 3.0

In particular, it is proposed that the multiple-segment and
character buffer portions of the buffer API be eliminated and
additional function pointers be provided to allow sharing any
multi-dimensional nature of the memory and what data-format the
memory contains.

Rationale
=

The buffer protocol allows different Python types to exchange a
pointer to a sequence of internal buffers.  This functionality is
*extremely* useful for sharing large segments of memory between
different high-level objects, but it is too limited and has issues.

1. There is the little (never?) used sequence-of-segments option
   (bf_getsegcount)

2. There is the apparently redundant character-buffer option
   (bf_getcharbuffer)

3. There is no way for a consumer to tell the buffer-API-exporting
   object it is finished with its view of the memory and
   therefore no way for the exporting object to be sure that it is
   safe to reallocate the pointer to the memory that it owns (for
   example, the array object reallocating its memory after sharing
   it with the buffer object which held the original pointer led
   to the infamous buffer-object problem).

4. Memory is just a pointer with a length. There is no way to
   describe what is in the memory (float, int, C-structure, etc.)

5. There is no shape information provided for the memory.  But,
   several array-like Python types could make use of a standard
   way to describe the shape-interpretation of the memory
   (wxPython, GTK, pyQT, CVXOPT, PyVox, Audio and Video
   Libraries, ctypes, NumPy, data-base interfaces, etc.)

6. There is no way to share discontiguous memory (except through
   the sequence of segments notion).  

   There are two widely used libraries that use the concept of
   discontiguous memory: PIL and NumPy.  Their view of discontiguous
   arrays is different, though.  This buffer interface allows
   sharing of either memory model.  Exporters will only use one 
   approach and consumers may choose to support discontiguous 
   arrays of each type however they choose. 

   NumPy uses the notion of constant striding in each dimension as its
   basic concept of an array. With this concept, a simple sub-region
   of a larger array can be described without copying the data.   T
   Thus, stride information is the additional information that must be
   shared. 

   The PIL uses a more opaque memory representation. Sometimes an
   image is contained in a contiguous segment of memory, but sometimes
   it is contained in an array of pointers to the contiguous segments
   (usually lines) of the image.  The PIL is where the idea of multiple
   buffer segments in the original buffer interface came from. 
  

   NumPy's strided memory model is used more often in computational
   libraries and because it is so simple it makes sense to support
   memory sharing using this model.  The PIL memory model is used often
   in C-code where a 2-d array can be then accessed using double
   pointer indirection:  e.g. image[i][j].  

   The buffer interface should allow the object to export either of these
   memory models.  Consumers are free to either require contiguous memory
   or write code to handle either memory model.  

Proposal Overview
=

* Eliminate the char-buffer and multiple-segment sections of the
  buffer-protocol.

* Unify the read/write versions of getting the buffer.

* Add a new function to the interface that should be called when
  the consumer object is done with the view.

* Add a new variable to allow the interface to describe what is in
  memory (unifying what is currently done now in struct and
  array)

* Add a new variable to allow the protocol to share shape information

* Add a new variable for sharing stride information

* Add a new mechanism for sharing array of arrays. 

* Fix all objects in the core and the standard library to conform
  to the new interface

* Extend the struct module to handle more format specifiers

Specification
=

Change the PyBufferProcs structure to

::

typedef struct {
 getbufferproc bf_getbuffer
 releasebufferproc bf_releasebuffer
}

::

typedef PyObject *(*getbufferproc)(PyObject *obj, void **buf,
   Py_ssize_t *len, int *writeable,
   char **format, int *ndims,
   Py_ssize_t **shape,
   Py_ssize_t **strides,
   void **segments)

All variables except the 

Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-21 Thread Neil Hodgson
Travis Oliphant:

 3) information about discontiguous memory segments


 Number 3 is where I could use feedback --- especially from PIL users and
 developers.   Strides are a common way to think about a possibly
 discontiguous chunk of memory (which appear in NumPy when you select a
 sub-region from a larger array). The strides vector tells you how many
 bytes to skip in each dimension to get to the next memory location for
 that dimension.

   I think one of the motivations for discontiguous segments was for
split buffers which are commonly used in text editors. A split buffer
has a gap in the middle where insertions and deletions can often occur
without moving much memory. When an insertion or deletion is required
elsewhere then the gap is first moved to that position. I have long
intended to implement a good split buffer extension for Python but the
best I have currently is an extension written using Boost.Python which
doesn't implement the buffer interface. Here is a description of split
buffers:

http://www.cs.cmu.edu/~wjh/papers/byte.html

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-21 Thread Guido van Rossum
On 3/21/07, Neil Hodgson [EMAIL PROTECTED] wrote:
 Travis Oliphant:

  3) information about discontiguous memory segments
 
 
  Number 3 is where I could use feedback --- especially from PIL users and
  developers.   Strides are a common way to think about a possibly
  discontiguous chunk of memory (which appear in NumPy when you select a
  sub-region from a larger array). The strides vector tells you how many
  bytes to skip in each dimension to get to the next memory location for
  that dimension.

I think one of the motivations for discontiguous segments was for
 split buffers which are commonly used in text editors. A split buffer
 has a gap in the middle where insertions and deletions can often occur
 without moving much memory. When an insertion or deletion is required
 elsewhere then the gap is first moved to that position. I have long
 intended to implement a good split buffer extension for Python but the
 best I have currently is an extension written using Boost.Python which
 doesn't implement the buffer interface. Here is a description of split
 buffers:

 http://www.cs.cmu.edu/~wjh/papers/byte.html

But there's always a call to remove the gap (or move it to the end).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-21 Thread Greg Ewing
Neil Hodgson wrote:

I think one of the motivations for discontiguous segments was for
 split buffers which are commonly used in text editors.

Note that this is different from the case of an array
of pointers to arrays, which is a multi-dimensional
array structure, whereas a split buffer is a concatenation
of two (possibly different sized) one-dimensional arrays.

So an array-of-pointers interface wouldn't be a direct
substitute for the existing multi-segment buffer
interface.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-21 Thread Greg Ewing
Travis Oliphant wrote:

 The question is should we eliminate the possibility of sharing memory 
 for objects that store data basically as arrays of arrays (i.e. true 
 C-style arrays).

Can you clarify what you mean by this? Are you talking
about an array of pointers to other arrays? (This is
not what I would call an array of arrays, even in C.)

Supporting this kind of thing could be a slippery slope,
since there can be arbitrary levels of complexity to
such a structure. E.g do you support a 1d array of
pointers to 3d arrays of pointers to 2d arrays? Etc.

The more different kinds of format you support, the less
likely it becomes that the thing consuming the data
will be willing to go to the trouble required to
understand it.

 One possible C-API call that Python could grow with the current buffer 
 interface is to allow contiguous-memory mirroring of discontiguous 
 memory,

I don't think the buffer protocol itself should incorporate
anything that requires implicitly copying the data, since
the whole purpose of it is to provide direct access to the
data without need for copying.

It would be okay to supply some utility functions for
re-packing data, though.

 or an iterator object that iterates through every element of any 
 object that exposes the buffer protocol.

Again, for efficiency reasons I wouldn't like to involve
Python objects and iteration mechanisms in this. The
buffer interface is meant to give you raw access to the
data at raw C speeds. Anything else is outside its scope,
IMO.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-21 Thread Neil Hodgson
Greg Ewing:

 So an array-of-pointers interface wouldn't be a direct
 substitute for the existing multi-segment buffer
 interface.

   Providing an array of (pointer,length) wouldn't be too much extra
work for a split vector implementation.

Guido van Rossum:

 But there's always a call to remove the gap (or move it to the end).

   Yes, although its something you try to avoid.

   I'm not saying that this is an important use-case since no one
seems to have produced a split vector implementation that provides the
buffer protocol. Numeric-style array handling is much more common so
deserves priority.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-21 Thread Travis Oliphant
Greg Ewing wrote:
 Travis Oliphant wrote:
 
 
The question is should we eliminate the possibility of sharing memory 
for objects that store data basically as arrays of arrays (i.e. true 
C-style arrays).
 
 
 Can you clarify what you mean by this? Are you talking
 about an array of pointers to other arrays? (This is
 not what I would call an array of arrays, even in C.)

I'm talking about arrays of pointers to other arrays:

i.e. if somebody defined in C

float B[10][20]

then B would B an array of pointers to arrays of floats.

 
 Supporting this kind of thing could be a slippery slope,
 since there can be arbitrary levels of complexity to
 such a structure. E.g do you support a 1d array of
 pointers to 3d arrays of pointers to 2d arrays? Etc.
 

Yes, I saw that.  But, it could actually be supported, in general.
The shape information is available.  If a 3-d array is meant then ndims
is 3 and you would re-cast the returned pointer appropriately.

In other words, suppose that instead of strides you can request a 
variable through the buffer interface with type void **segments.

Then, by passing the address to a void * variable to the routine you 
would receive the array.  Then, you could handle 1-d, 2-d, and 3-d cases 
using something like this:

This is pseudocode:

void *segments;
int ndims;
Py_ssize_t *shape;
char *format;


(ndims, shape, format, and segments) are passed to the buffer 
interface.

if strcmp(format, f) != 0
 raise an error.

if (ndims == 1)

var = (float *)segments
for (i=0; ishape[0]; i++)
 # process var[i]

else if (ndims == 2)

var = (float **)segments
for (i=0; ishape[0]; i++)
for (j=0; jshape[1]; j++)
# process var[i][j]

else if (ndims == 3)

 var = (float ***)segments
 for (i=0; ishape[0]; i++)
 for (j=0; jshape[1]; j++)
 for (k=0; jshape[2]; k++)
  # process var[i][j][k]

else

  raise an Error.



 The more different kinds of format you support, the less
 likely it becomes that the thing consuming the data
 will be willing to go to the trouble required to
 understand it.

That is certainly true.   I'm really only going through the trouble, 
since the multiple segment already exists and the PIL has this memory 
model (although I have not heard PIL developers clamoring for support, 
--- I'm just being sensitive to that extension type).

 
 
One possible C-API call that Python could grow with the current buffer 
interface is to allow contiguous-memory mirroring of discontiguous 
memory,
 
 
 I don't think the buffer protocol itself should incorporate
 anything that requires implicitly copying the data, since
 the whole purpose of it is to provide direct access to the
 data without need for copying.

No, this would not be the buffer protocol, but merely a C-API that would 
use the buffer protocol - i.e. it is just a utility function as you mention.

 
 It would be okay to supply some utility functions for
 re-packing data, though.
 
 
or an iterator object that iterates through every element of any 
object that exposes the buffer protocol.
 
 
 Again, for efficiency reasons I wouldn't like to involve
 Python objects and iteration mechanisms in this. 

I was thinking more of a C-iterator, like NumPy provides.  This can be 
very efficient (as long as the loop is not in Python).

It sure provides a nice abstraction that lets you deal with 
discontiguous arrays as if they were contiguous, though.

The
 buffer interface is meant to give you raw access to the
 data at raw C speeds. Anything else is outside its scope,

Sure.  These things are just ideas about *future* utility functions that 
might make use of the buffer interface and motivate its design.

Thanks for your comments.


-Travis

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Extended Buffer Interface/Protocol

2007-03-21 Thread Greg Ewing
Travis Oliphant wrote:

 I'm talking about arrays of pointers to other arrays:
 
 i.e. if somebody defined in C
 
 float B[10][20]
 
 then B would B an array of pointers to arrays of floats.

No, it wouldn't, it would be a contiguously stored
2-dimensional array of floats. An array of pointers
would be

   float *B[10];

followed by code to allocate 10 arrays of 20 floats
each and initialise B to point to them.

 Yes, I saw that.  But, it could actually be supported, in general.

Certainly it *can* be supported, but the question is
how many different format variations it's reasonable
to expect the consumer of the data to be able to deal
with. Because each variation requires the consumer to
use different code to access the data, if it wants to
avoid making a copy.

 else if (ndims == 3)
 
  var = (float ***)segments
  for (i=0; ishape[0]; i++)
  for (j=0; jshape[1]; j++)
  for (k=0; jshape[2]; k++)
   # process var[i][j][k]

This assumes that the 3-dimensional case is using
the array-of-pointers implementation at all levels.
But there are other possibilities, e.g. a 1d array
of pointers to contiguous 2d arrays, or a contiguous
2d array of pointers to 1d arrays. It's hard to
deal with all of those using a common piece of code.

I can imagine cases like that coming up in practice.
For example, an image object might store its data
as four blocks of memory for R, G, B and A planes,
each of which is a contiguous 2d array with shape
and stride -- but you want to view it as a 3d
array byte[plane][x][y].

(Actually you'd probably *prefer* to view it as
byte[x][y][plane], which would make things even
more difficult...)

 I was thinking more of a C-iterator, like NumPy provides.  This can be 
 very efficient (as long as the loop is not in Python).
 
 It sure provides a nice abstraction that lets you deal with 
 discontiguous arrays as if they were contiguous, though.

Something like that might be useful.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiem! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com