[Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

2011-03-11 Thread Dag Sverre Seljebotn
There's a few libraries out there that needs to know whether or not an 
array changed since the last time it was used: joblib and pymc comes to 
mind. I believe joblib computes a SHA1 or md5 hash of array contents, 
while pymc simply assume you never change an array and uses the id().

The pymc approach is fragile, while in my case the joblib approach is 
too expensive since I'll call the function again many times in a row 
with the same large array (yes, I can code around it, but the code gets 
less streamlined).

So, would it be possible to very quickly detect whether a NumPy array is 
guaranteed to not have changed? Here's a revision counter approach:

  1) Introduce a new 64-bit int field modification_count in the array 
object struct.

  2) modification_count is incremented any time it is possible that an 
array changes. In particular, PyArray_DATA would increment the counter.

  3) A new PyArray_READONLYDATA is introduced that does not increment 
the counter, which can be used in strategic spots. However, the point is 
simply to rule out *most* sources of having to recompute a checksum for 
the array -- a non-matching modification_count is not a guarantee the 
array has changed, but an unmatched modification_count is a guarantee of 
an unchanged array

  4) The counter can be ignored for readonly (base) arrays.

  5a) A method is introduced Python-side,  
arr.checksum(algorithm=md5|sha1), that uses this machinery to cache 
checksum computation and that can be plugged into joblib.

  5b) Alternatively, the modification count is exposed directly to 
Python-side, and it is up to users to store the modification count (e.g. 
in a WeakKeyDictionary indexed by the array's base array).

Another solution to the problem would be to allow registering event 
handlers. Main reason I'm not proposing that is because I don't want to 
spend the time to implement it (sounds a lot more difficult), it appears 
to be considerably less backwards-compatible, and so on.

Why not a simple dirty flag? Because you'd need one for every possible 
application of this (e.g, md5 and sha1 would need seperate dirty flags, 
and other uses than hashing would need yet more flags, and so on).

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

2011-03-11 Thread Charles R Harris
On Fri, Mar 11, 2011 at 11:41 AM, Dag Sverre Seljebotn 
d.s.seljeb...@astro.uio.no wrote:

 There's a few libraries out there that needs to know whether or not an
 array changed since the last time it was used: joblib and pymc comes to
 mind. I believe joblib computes a SHA1 or md5 hash of array contents,
 while pymc simply assume you never change an array and uses the id().

 The pymc approach is fragile, while in my case the joblib approach is
 too expensive since I'll call the function again many times in a row
 with the same large array (yes, I can code around it, but the code gets
 less streamlined).

 So, would it be possible to very quickly detect whether a NumPy array is
 guaranteed to not have changed? Here's a revision counter approach:

  1) Introduce a new 64-bit int field modification_count in the array
 object struct.

  2) modification_count is incremented any time it is possible that an
 array changes. In particular, PyArray_DATA would increment the counter.

  3) A new PyArray_READONLYDATA is introduced that does not increment
 the counter, which can be used in strategic spots. However, the point is
 simply to rule out *most* sources of having to recompute a checksum for
 the array -- a non-matching modification_count is not a guarantee the
 array has changed, but an unmatched modification_count is a guarantee of
 an unchanged array

  4) The counter can be ignored for readonly (base) arrays.

  5a) A method is introduced Python-side,
 arr.checksum(algorithm=md5|sha1), that uses this machinery to cache
 checksum computation and that can be plugged into joblib.

  5b) Alternatively, the modification count is exposed directly to
 Python-side, and it is up to users to store the modification count (e.g.
 in a WeakKeyDictionary indexed by the array's base array).

 Another solution to the problem would be to allow registering event
 handlers. Main reason I'm not proposing that is because I don't want to
 spend the time to implement it (sounds a lot more difficult), it appears
 to be considerably less backwards-compatible, and so on.

 Why not a simple dirty flag? Because you'd need one for every possible
 application of this (e.g, md5 and sha1 would need seperate dirty flags,
 and other uses than hashing would need yet more flags, and so on).


What about views? Wouldn't it be easier to write another object wrapping an
ndarray?

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

2011-03-11 Thread Pauli Virtanen
On Fri, 11 Mar 2011 11:47:58 -0700, Charles R Harris wrote:
[clip]
 What about views? Wouldn't it be easier to write another object wrapping
 an ndarray?

I think the buffer interfaces and all other various ways Numpy provides 
exports for arrays make keeping tabs on modification impossible to do 
completely reliably.

Pauli

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

2011-03-11 Thread Dag Sverre Seljebotn
 On Fri, 11 Mar 2011 19:37:42 + (UTC), Pauli Virtanen p...@iki.fi 
 wrote:
 On Fri, 11 Mar 2011 11:47:58 -0700, Charles R Harris wrote:
 [clip]
 What about views? Wouldn't it be easier to write another object 
 wrapping
 an ndarray?

 I think the buffer interfaces and all other various ways Numpy 
 provides
 exports for arrays make keeping tabs on modification impossible to do
 completely reliably.

 Not to mention all the pain of making sure the arrays are wrapped and 
 stay wrapped in the first place. In particular in combination with other 
 array wrappers.

 I wasn't saying this is absolutely needed, just that it'd be a really 
 convenient feature helpful for caching. Sometimes, introducing fast 
 caching this way can remove a lot of logic from the code. Introducing a 
 Python-space visible wrapper object kind of defeats the purpose for me.

 Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

2011-03-11 Thread Charles R Harris
On Fri, Mar 11, 2011 at 1:06 PM, Dag Sverre Seljebotn 
d.s.seljeb...@astro.uio.no wrote:

  On Fri, 11 Mar 2011 19:37:42 + (UTC), Pauli Virtanen p...@iki.fi
  wrote:
  On Fri, 11 Mar 2011 11:47:58 -0700, Charles R Harris wrote:
  [clip]
  What about views? Wouldn't it be easier to write another object
  wrapping
  an ndarray?
 
  I think the buffer interfaces and all other various ways Numpy
  provides
  exports for arrays make keeping tabs on modification impossible to do
  completely reliably.

  Not to mention all the pain of making sure the arrays are wrapped and
  stay wrapped in the first place. In particular in combination with other
  array wrappers.

  I wasn't saying this is absolutely needed, just that it'd be a really
  convenient feature helpful for caching. Sometimes, introducing fast
  caching this way can remove a lot of logic from the code. Introducing a
  Python-space visible wrapper object kind of defeats the purpose for me.


Well, starting with a wrapped object would allow you to experiment and
discover what it is you really need. A smallish specialized object is
probably a better starting point for development than a big solution.
Operating systems do this sort of thing with the VM, but they have hardware
assistance down at the lowest level and rather extensive structures to track
status. Furthermore, the memory is organized into blocks and that makes it a
lot easier to monitor than strided memory. In fact, I think you might want
to set up your own memory subsystem and have the arrays sit on top of that.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

2011-03-11 Thread Anne Archibald
On 11 March 2011 15:34, Charles R Harris charlesr.har...@gmail.com wrote:

 On Fri, Mar 11, 2011 at 1:06 PM, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no wrote:

  On Fri, 11 Mar 2011 19:37:42 + (UTC), Pauli Virtanen p...@iki.fi
  wrote:
  On Fri, 11 Mar 2011 11:47:58 -0700, Charles R Harris wrote:
  [clip]
  What about views? Wouldn't it be easier to write another object
  wrapping
  an ndarray?
 
  I think the buffer interfaces and all other various ways Numpy
  provides
  exports for arrays make keeping tabs on modification impossible to do
  completely reliably.

  Not to mention all the pain of making sure the arrays are wrapped and
  stay wrapped in the first place. In particular in combination with other
  array wrappers.

  I wasn't saying this is absolutely needed, just that it'd be a really
  convenient feature helpful for caching. Sometimes, introducing fast
  caching this way can remove a lot of logic from the code. Introducing a
  Python-space visible wrapper object kind of defeats the purpose for me.


 Well, starting with a wrapped object would allow you to experiment and
 discover what it is you really need. A smallish specialized object is
 probably a better starting point for development than a big solution.
 Operating systems do this sort of thing with the VM, but they have hardware
 assistance down at the lowest level and rather extensive structures to track
 status. Furthermore, the memory is organized into blocks and that makes it a
 lot easier to monitor than strided memory. In fact, I think you might want
 to set up your own memory subsystem and have the arrays sit on top of that.

In fact, on many systems, using malloc on large contiguous blocks of
memory returns a freshly-mmaped region. It's possible that with a
little deviousness (and, sadly, some system-specific code) one could
arrange to allocate some arrays in a way that would trigger
modification-count updating by the VM system. If you're serious about
detecting modifications, this sort of thing may be the only way to go
- a modification-detection system that misses some modifications might
be worse than none at all.

An internal numpy setup is going to be a nightmare even if all you
have to worry about is views and you're willing to allow
non-overlapping views to count as modifying each other - you'd have to
add a modification count to the ultimate base array (the one whose
deletion triggers disposal of the memory arena), and then every
modification to a view would have to walk the linked list of views all
the way up to the top to increment the modification counter. You'll
also be triggering increments of the modification counter on all sorts
of non-modifications that occur in C code. Doable, but a huge job for
dubious benefit.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

2011-03-11 Thread Dag Sverre Seljebotn
On 03/11/2011 10:04 PM, Anne Archibald wrote:
 On 11 March 2011 15:34, Charles R Harrischarlesr.har...@gmail.com  wrote:
 On Fri, Mar 11, 2011 at 1:06 PM, Dag Sverre Seljebotn
 d.s.seljeb...@astro.uio.no  wrote:
   On Fri, 11 Mar 2011 19:37:42 + (UTC), Pauli Virtanenp...@iki.fi
   wrote:
 On Fri, 11 Mar 2011 11:47:58 -0700, Charles R Harris wrote:
 [clip]
 What about views? Wouldn't it be easier to write another object
 wrapping
 an ndarray?
 I think the buffer interfaces and all other various ways Numpy
 provides
 exports for arrays make keeping tabs on modification impossible to do
 completely reliably.
   Not to mention all the pain of making sure the arrays are wrapped and
   stay wrapped in the first place. In particular in combination with other
   array wrappers.

   I wasn't saying this is absolutely needed, just that it'd be a really
   convenient feature helpful for caching. Sometimes, introducing fast
   caching this way can remove a lot of logic from the code. Introducing a
   Python-space visible wrapper object kind of defeats the purpose for me.

 Well, starting with a wrapped object would allow you to experiment and
 discover what it is you really need. A smallish specialized object is
 probably a better starting point for development than a big solution.
 Operating systems do this sort of thing with the VM, but they have hardware
 assistance down at the lowest level and rather extensive structures to track
 status. Furthermore, the memory is organized into blocks and that makes it a
 lot easier to monitor than strided memory. In fact, I think you might want
 to set up your own memory subsystem and have the arrays sit on top of that.
 In fact, on many systems, using malloc on large contiguous blocks of
 memory returns a freshly-mmaped region. It's possible that with a
 little deviousness (and, sadly, some system-specific code) one could
 arrange to allocate some arrays in a way that would trigger
 modification-count updating by the VM system. If you're serious about
 detecting modifications, this sort of thing may be the only way to go
 - a modification-detection system that misses some modifications might
 be worse than none at all.

 An internal numpy setup is going to be a nightmare even if all you
 have to worry about is views and you're willing to allow
 non-overlapping views to count as modifying each other - you'd have to
 add a modification count to the ultimate base array (the one whose
 deletion triggers disposal of the memory arena), and then every
 modification to a view would have to walk the linked list of views all
 the way up to the top to increment the modification counter. You'll
 also be triggering increments of the modification counter on all sorts
 of non-modifications that occur in C code. Doable, but a huge job for
 dubious benefit.

Yes, you are right. For instance PEP-3118 makes it rather natural to 
hold only the data pointer and object refcount for some time and only 
modify the data later, and things like that can't be coded around in 
NumPy no matter the effort.

Thanks for your sobering comments. I'll just keep using explicit 
mechanisms in my program.

(I didn't know about the VM modification counting, but wasn't able to 
find much on Google either. At any rate that is definitely overkill here.)

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

2011-03-11 Thread Charles R Harris
On Fri, Mar 11, 2011 at 12:37 PM, Pauli Virtanen p...@iki.fi wrote:

 On Fri, 11 Mar 2011 11:47:58 -0700, Charles R Harris wrote:
 [clip]
  What about views? Wouldn't it be easier to write another object wrapping
  an ndarray?

 I think the buffer interfaces and all other various ways Numpy provides
 exports for arrays make keeping tabs on modification impossible to do
 completely reliably.


I didn't mean wrap, as in subclass, I meant wrap in the sense of having
ndarray as a private component in a class. One could maybe make a class
factory that made a wholly new class for every freshly instanced monitor
object and access info could be kept in class variables. Every instance of
the generated class would essentially be a view of a single ndarray
instance.

I don't think all possible accesses could be caught, but for the purposes of
mc something less perfect would probably serve.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion