Re: [Numpy-discussion] Masking through generator arrays

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn <
d.s.seljeb...@astro.uio.no> wrote:

> Sorry everyone for being so dense and contaminating that other thread.
> Here's a new thread where I can respond to Nathaniel's response.
>
> On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>  > Hi Dag,
>  >
>  > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>  >   wrote:
>  >> I'm a heavy user of masks, which are used to make data NA in the
>  >> statistical sense. The setting is that we have to mask out the
> radiation
>  >> coming from the Milky Way in full-sky images of the Cosmic Microwave
>  >> Background. There's data, but we know we can't trust it, so we make it
>  >> NA. But we also do play around with different masks.
>  >
>  > Oh, this is great -- that means you're one of the users that I wasn't
>  > sure existed or not :-). Now I know!
>  >
>  >> Today we keep the mask in a seperate array, and to zero-mask we do
>  >>
>  >> masked_data = data * mask
>  >>
>  >> or
>  >>
>  >> masked_data = data.copy()
>  >> masked_data[mask == 0] = np.nan # soon np.NA
>  >>
>  >> depending on the circumstances.
>  >>
>  >> Honestly, API-wise, this is as good as its gets for us. Nice and
>  >> transparent, no new semantics to learn in the special case of masks.
>  >>
>  >> Now, this has performance issues: Lots of memory use, extra transfers
>  >> over the memory bus.
>  >
>  > Right -- this is a case where (in the NA-overview terminology) masked
>  > storage+NA semantics would be useful.
>  >
>  >> BUT, NumPy has that problem all over the place, even for "x + y + z"!
>  >> Solving it in the special case of masks, by making a new API, seems a
>  >> bit myopic to me.
>  >>
>  >> IMO, that's much better solved at the fundamental level. As an
>  >> *illustration*:
>  >>
>  >> with np.lazy:
>  >>  masked_data1 = data * mask1
>  >>  masked_data2 = data * (mask1 | mask2)
>  >>  masked_data3 = (x + y + z) * (mask1&  mask3)
>  >>
>  >> This would create three "generator arrays" that would zero-mask the
>  >> arrays (and perform the three-term addition...) upon request. You could
>  >> slice the generator arrays as you wish, and by that slice the data and
>  >> the mask in one operation. Obviously this could handle NA-masking too.
>  >>
>  >> You can probably do this today with Theano and numexpr, and I think
>  >> Travis mentioned that "generator arrays" are on his radar for core
> NumPy.
>  >
>  > Implementing this today would require some black magic hacks, because
>  > on entry/exit to the context manager you'd have to "reach up" into the
>  > calling scope and replace all the ndarray's with LazyArrays and then
>  > vice-versa. This is actually totally possible:
>  >https://gist.github.com/2347382
>  > but I'm not sure I'd call it *wise*. (You could probably avoid the
>  > truly horrible set_globals_dict part of that gist, though.) Might be
>  > fun to prototype, though...
>
> 1) My main point was just that I believe masked arrays is something that
> to me feels immature, and that it is the kind of thing that should be
> constructed from simpler primitives. And that NumPy should focus on
> simple primitives. You could make it
>

I can't disagree, as I suggested the same as a possibility myself ;) There
is a lot of infrastructure now in numpy, but given the use cases I'm
tending towards the view that masked arrays should be left to others, at
least for the time being. The question is how to generalize the
infrastructure and what hooks to provide. I think just spending a month or
two pulling stuff out is counter productive, but evolving the code is
definitely needed. If you could familiarize yourself with what is in there,
something that seems largely neglected by the critics, and make
suggestions, that would be helpful.

I'd also like to hear from Mark. It has been about 9 mos since he did the
work, and I'd be surprised if he didn't have ideas for doing some things
differently. OTOH, I can understand his reluctance to get involved in a
topic where I thought he was poorly treated last time around.


>
> np.gen.generating_multiply(data, mask)
>
> 2) About the with construct in particular, I intended "__enter__" and
> "__exit__" to only toggle a thread-local flag, and when that flag is in
> effect, "__mul__" would do a "generating_multiply" and return an
> ndarraygenerator rather than an ndarray.
>
> But of course, the amount of work is massive.
>
>


Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 06:18 AM, Charles R Harris wrote:
>
>
> On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
> mailto:d.s.seljeb...@astro.uio.no>> wrote:
>
> Sorry everyone for being so dense and contaminating that other thread.
> Here's a new thread where I can respond to Nathaniel's response.
>
> On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>  > Hi Dag,
>  >
>  > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>  > mailto:d.s.seljeb...@astro.uio.no>>
>   wrote:
>  >> I'm a heavy user of masks, which are used to make data NA in the
>  >> statistical sense. The setting is that we have to mask out the
> radiation
>  >> coming from the Milky Way in full-sky images of the Cosmic Microwave
>  >> Background. There's data, but we know we can't trust it, so we
> make it
>  >> NA. But we also do play around with different masks.
>  >
>  > Oh, this is great -- that means you're one of the users that I wasn't
>  > sure existed or not :-). Now I know!
>  >
>  >> Today we keep the mask in a seperate array, and to zero-mask we do
>  >>
>  >> masked_data = data * mask
>  >>
>  >> or
>  >>
>  >> masked_data = data.copy()
>  >> masked_data[mask == 0] = np.nan # soon np.NA
>  >>
>  >> depending on the circumstances.
>  >>
>  >> Honestly, API-wise, this is as good as its gets for us. Nice and
>  >> transparent, no new semantics to learn in the special case of masks.
>  >>
>  >> Now, this has performance issues: Lots of memory use, extra
> transfers
>  >> over the memory bus.
>  >
>  > Right -- this is a case where (in the NA-overview terminology) masked
>  > storage+NA semantics would be useful.
>  >
>  >> BUT, NumPy has that problem all over the place, even for "x + y
> + z"!
>  >> Solving it in the special case of masks, by making a new API,
> seems a
>  >> bit myopic to me.
>  >>
>  >> IMO, that's much better solved at the fundamental level. As an
>  >> *illustration*:
>  >>
>  >> with np.lazy:
>  >>  masked_data1 = data * mask1
>  >>  masked_data2 = data * (mask1 | mask2)
>  >>  masked_data3 = (x + y + z) * (mask1&  mask3)
>  >>
>  >> This would create three "generator arrays" that would zero-mask the
>  >> arrays (and perform the three-term addition...) upon request.
> You could
>  >> slice the generator arrays as you wish, and by that slice the
> data and
>  >> the mask in one operation. Obviously this could handle
> NA-masking too.
>  >>
>  >> You can probably do this today with Theano and numexpr, and I think
>  >> Travis mentioned that "generator arrays" are on his radar for core
> NumPy.
>  >
>  > Implementing this today would require some black magic hacks, because
>  > on entry/exit to the context manager you'd have to "reach up"
> into the
>  > calling scope and replace all the ndarray's with LazyArrays and then
>  > vice-versa. This is actually totally possible:
>  > https://gist.github.com/2347382
>  > but I'm not sure I'd call it *wise*. (You could probably avoid the
>  > truly horrible set_globals_dict part of that gist, though.) Might be
>  > fun to prototype, though...
>
> 1) My main point was just that I believe masked arrays is something that
> to me feels immature, and that it is the kind of thing that should be
> constructed from simpler primitives. And that NumPy should focus on
> simple primitives. You could make it
>
>
> I can't disagree, as I suggested the same as a possibility myself ;)
> There is a lot of infrastructure now in numpy, but given the use cases
> I'm tending towards the view that masked arrays should be left to
> others, at least for the time being. The question is how to generalize
> the infrastructure and what hooks to provide. I think just spending a
> month or two pulling stuff out is counter productive, but evolving the
> code is definitely needed. If you could familiarize yourself with what
> is in there, something that seems largely neglected by the critics, and
> make suggestions, that would be helpful.

But how on earth can I make constructive criticisms about code when I 
don't know what the purpose of that code is supposed to be?

Are you saying you agree that the masking aspect should be banned (or at 
least not "core"), and asking me to look at code from that perspective 
and comment on how to get there while keeping as much as possible of the 
rest? Would that really be helpful?

Dag
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Charles R Harris
On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn <
d.s.seljeb...@astro.uio.no> wrote:

> On 05/10/2012 06:18 AM, Charles R Harris wrote:
> >
> >
> > On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
> > mailto:d.s.seljeb...@astro.uio.no>> wrote:
> >
> > Sorry everyone for being so dense and contaminating that other
> thread.
> > Here's a new thread where I can respond to Nathaniel's response.
> >
> > On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
> >  > Hi Dag,
> >  >
> >  > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
> >  > mailto:d.s.seljeb...@astro.uio.no>>
> >   wrote:
> >  >> I'm a heavy user of masks, which are used to make data NA in the
> >  >> statistical sense. The setting is that we have to mask out the
> > radiation
> >  >> coming from the Milky Way in full-sky images of the Cosmic
> Microwave
> >  >> Background. There's data, but we know we can't trust it, so we
> > make it
> >  >> NA. But we also do play around with different masks.
> >  >
> >  > Oh, this is great -- that means you're one of the users that I
> wasn't
> >  > sure existed or not :-). Now I know!
> >  >
> >  >> Today we keep the mask in a seperate array, and to zero-mask we
> do
> >  >>
> >  >> masked_data = data * mask
> >  >>
> >  >> or
> >  >>
> >  >> masked_data = data.copy()
> >  >> masked_data[mask == 0] = np.nan # soon np.NA
> >  >>
> >  >> depending on the circumstances.
> >  >>
> >  >> Honestly, API-wise, this is as good as its gets for us. Nice and
> >  >> transparent, no new semantics to learn in the special case of
> masks.
> >  >>
> >  >> Now, this has performance issues: Lots of memory use, extra
> > transfers
> >  >> over the memory bus.
> >  >
> >  > Right -- this is a case where (in the NA-overview terminology)
> masked
> >  > storage+NA semantics would be useful.
> >  >
> >  >> BUT, NumPy has that problem all over the place, even for "x + y
> > + z"!
> >  >> Solving it in the special case of masks, by making a new API,
> > seems a
> >  >> bit myopic to me.
> >  >>
> >  >> IMO, that's much better solved at the fundamental level. As an
> >  >> *illustration*:
> >  >>
> >  >> with np.lazy:
> >  >>  masked_data1 = data * mask1
> >  >>  masked_data2 = data * (mask1 | mask2)
> >  >>  masked_data3 = (x + y + z) * (mask1&  mask3)
> >  >>
> >  >> This would create three "generator arrays" that would zero-mask
> the
> >  >> arrays (and perform the three-term addition...) upon request.
> > You could
> >  >> slice the generator arrays as you wish, and by that slice the
> > data and
> >  >> the mask in one operation. Obviously this could handle
> > NA-masking too.
> >  >>
> >  >> You can probably do this today with Theano and numexpr, and I
> think
> >  >> Travis mentioned that "generator arrays" are on his radar for
> core
> > NumPy.
> >  >
> >  > Implementing this today would require some black magic hacks,
> because
> >  > on entry/exit to the context manager you'd have to "reach up"
> > into the
> >  > calling scope and replace all the ndarray's with LazyArrays and
> then
> >  > vice-versa. This is actually totally possible:
> >  > https://gist.github.com/2347382
> >  > but I'm not sure I'd call it *wise*. (You could probably avoid the
> >  > truly horrible set_globals_dict part of that gist, though.) Might
> be
> >  > fun to prototype, though...
> >
> > 1) My main point was just that I believe masked arrays is something
> that
> > to me feels immature, and that it is the kind of thing that should be
> > constructed from simpler primitives. And that NumPy should focus on
> > simple primitives. You could make it
> >
> >
> > I can't disagree, as I suggested the same as a possibility myself ;)
> > There is a lot of infrastructure now in numpy, but given the use cases
> > I'm tending towards the view that masked arrays should be left to
> > others, at least for the time being. The question is how to generalize
> > the infrastructure and what hooks to provide. I think just spending a
> > month or two pulling stuff out is counter productive, but evolving the
> > code is definitely needed. If you could familiarize yourself with what
> > is in there, something that seems largely neglected by the critics, and
> > make suggestions, that would be helpful.
>
> But how on earth can I make constructive criticisms about code when I
> don't know what the purpose of that code is supposed to be?
>

What do you mean? I thought the purpose was quite clearly laid out in the
NEP. But the implementation of that purpose required some infrastructure.
The point, I suppose, is for you to suggest what would serve your use case.


>
> Are you saying you agree that the masking aspect should be banned (or at
> l

Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 10:40 AM, Charles R Harris wrote:
>
>
> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
> mailto:d.s.seljeb...@astro.uio.no>> wrote:
>
> On 05/10/2012 06:18 AM, Charles R Harris wrote:
>  >
>  >
>  > On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>  > mailto:d.s.seljeb...@astro.uio.no>
>  >> wrote:
>  >
>  > Sorry everyone for being so dense and contaminating that
> other thread.
>  > Here's a new thread where I can respond to Nathaniel's response.
>  >
>  > On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>  > > Hi Dag,
>  > >
>  > > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>  > > mailto:d.s.seljeb...@astro.uio.no>
> >>
>  >   wrote:
>  > >> I'm a heavy user of masks, which are used to make data NA in the
>  > >> statistical sense. The setting is that we have to mask out the
>  > radiation
>  > >> coming from the Milky Way in full-sky images of the Cosmic
> Microwave
>  > >> Background. There's data, but we know we can't trust it, so we
>  > make it
>  > >> NA. But we also do play around with different masks.
>  > >
>  > > Oh, this is great -- that means you're one of the users that I
> wasn't
>  > > sure existed or not :-). Now I know!
>  > >
>  > >> Today we keep the mask in a seperate array, and to zero-mask we do
>  > >>
>  > >> masked_data = data * mask
>  > >>
>  > >> or
>  > >>
>  > >> masked_data = data.copy()
>  > >> masked_data[mask == 0] = np.nan # soon np.NA
>  > >>
>  > >> depending on the circumstances.
>  > >>
>  > >> Honestly, API-wise, this is as good as its gets for us. Nice and
>  > >> transparent, no new semantics to learn in the special case of
> masks.
>  > >>
>  > >> Now, this has performance issues: Lots of memory use, extra
>  > transfers
>  > >> over the memory bus.
>  > >
>  > > Right -- this is a case where (in the NA-overview terminology)
> masked
>  > > storage+NA semantics would be useful.
>  > >
>  > >> BUT, NumPy has that problem all over the place, even for "x + y
>  > + z"!
>  > >> Solving it in the special case of masks, by making a new API,
>  > seems a
>  > >> bit myopic to me.
>  > >>
>  > >> IMO, that's much better solved at the fundamental level. As an
>  > >> *illustration*:
>  > >>
>  > >> with np.lazy:
>  > >>  masked_data1 = data * mask1
>  > >>  masked_data2 = data * (mask1 | mask2)
>  > >>  masked_data3 = (x + y + z) * (mask1&  mask3)
>  > >>
>  > >> This would create three "generator arrays" that would
> zero-mask the
>  > >> arrays (and perform the three-term addition...) upon request.
>  > You could
>  > >> slice the generator arrays as you wish, and by that slice the
>  > data and
>  > >> the mask in one operation. Obviously this could handle
>  > NA-masking too.
>  > >>
>  > >> You can probably do this today with Theano and numexpr, and I
> think
>  > >> Travis mentioned that "generator arrays" are on his radar for core
>  > NumPy.
>  > >
>  > > Implementing this today would require some black magic hacks,
> because
>  > > on entry/exit to the context manager you'd have to "reach up"
>  > into the
>  > > calling scope and replace all the ndarray's with LazyArrays and
> then
>  > > vice-versa. This is actually totally possible:
>  > > https://gist.github.com/2347382
>  > > but I'm not sure I'd call it *wise*. (You could probably avoid the
>  > > truly horrible set_globals_dict part of that gist, though.)
> Might be
>  > > fun to prototype, though...
>  >
>  > 1) My main point was just that I believe masked arrays is
> something that
>  > to me feels immature, and that it is the kind of thing that
> should be
>  > constructed from simpler primitives. And that NumPy should
> focus on
>  > simple primitives. You could make it
>  >
>  >
>  > I can't disagree, as I suggested the same as a possibility myself ;)
>  > There is a lot of infrastructure now in numpy, but given the use
> cases
>  > I'm tending towards the view that masked arrays should be left to
>  > others, at least for the time being. The question is how to
> generalize
>  > the infrastructure and what hooks to provide. I think just spending a
>  > month or two pulling stuff out is counter productive, but
> evolving the
>  > code is definitely needed. If you could familiarize yourself with
> what
>  > is in there, something that seems largely neglected by the
> critics, and
>   

Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 11:38 AM, Dag Sverre Seljebotn wrote:
> On 05/10/2012 10:40 AM, Charles R Harris wrote:
>>
>>
>> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
>> mailto:d.s.seljeb...@astro.uio.no>>  wrote:
>>
>>  On 05/10/2012 06:18 AM, Charles R Harris wrote:
>>   >
>>   >
>>   >  On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>>   >  mailto:d.s.seljeb...@astro.uio.no>
>>  >  >>  wrote:
>>   >
>>   >  Sorry everyone for being so dense and contaminating that
>>  other thread.
>>   >  Here's a new thread where I can respond to Nathaniel's response.
>>   >
>>   >  On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>>   >  >  Hi Dag,
>>   >  >
>>   >  >  On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>>   >  >  mailto:d.s.seljeb...@astro.uio.no>
>>  >>
>>   >wrote:
>>   >  >>  I'm a heavy user of masks, which are used to make data NA in the
>>   >  >>  statistical sense. The setting is that we have to mask out the
>>   >  radiation
>>   >  >>  coming from the Milky Way in full-sky images of the Cosmic
>>  Microwave
>>   >  >>  Background. There's data, but we know we can't trust it, so we
>>   >  make it
>>   >  >>  NA. But we also do play around with different masks.
>>   >  >
>>   >  >  Oh, this is great -- that means you're one of the users that I
>>  wasn't
>>   >  >  sure existed or not :-). Now I know!
>>   >  >
>>   >  >>  Today we keep the mask in a seperate array, and to zero-mask we 
>> do
>>   >  >>
>>   >  >>  masked_data = data * mask
>>   >  >>
>>   >  >>  or
>>   >  >>
>>   >  >>  masked_data = data.copy()
>>   >  >>  masked_data[mask == 0] = np.nan # soon np.NA
>>   >  >>
>>   >  >>  depending on the circumstances.
>>   >  >>
>>   >  >>  Honestly, API-wise, this is as good as its gets for us. Nice and
>>   >  >>  transparent, no new semantics to learn in the special case of
>>  masks.
>>   >  >>
>>   >  >>  Now, this has performance issues: Lots of memory use, extra
>>   >  transfers
>>   >  >>  over the memory bus.
>>   >  >
>>   >  >  Right -- this is a case where (in the NA-overview terminology)
>>  masked
>>   >  >  storage+NA semantics would be useful.
>>   >  >
>>   >  >>  BUT, NumPy has that problem all over the place, even for "x + y
>>   >  + z"!
>>   >  >>  Solving it in the special case of masks, by making a new API,
>>   >  seems a
>>   >  >>  bit myopic to me.
>>   >  >>
>>   >  >>  IMO, that's much better solved at the fundamental level. As an
>>   >  >>  *illustration*:
>>   >  >>
>>   >  >>  with np.lazy:
>>   >  >>   masked_data1 = data * mask1
>>   >  >>   masked_data2 = data * (mask1 | mask2)
>>   >  >>   masked_data3 = (x + y + z) * (mask1&   mask3)
>>   >  >>
>>   >  >>  This would create three "generator arrays" that would
>>  zero-mask the
>>   >  >>  arrays (and perform the three-term addition...) upon request.
>>   >  You could
>>   >  >>  slice the generator arrays as you wish, and by that slice the
>>   >  data and
>>   >  >>  the mask in one operation. Obviously this could handle
>>   >  NA-masking too.
>>   >  >>
>>   >  >>  You can probably do this today with Theano and numexpr, and I
>>  think
>>   >  >>  Travis mentioned that "generator arrays" are on his radar for 
>> core
>>   >  NumPy.
>>   >  >
>>   >  >  Implementing this today would require some black magic hacks,
>>  because
>>   >  >  on entry/exit to the context manager you'd have to "reach up"
>>   >  into the
>>   >  >  calling scope and replace all the ndarray's with LazyArrays and
>>  then
>>   >  >  vice-versa. This is actually totally possible:
>>   >  >  https://gist.github.com/2347382
>>   >  >  but I'm not sure I'd call it *wise*. (You could probably avoid 
>> the
>>   >  >  truly horrible set_globals_dict part of that gist, though.)
>>  Might be
>>   >  >  fun to prototype, though...
>>   >
>>   >  1) My main point was just that I believe masked arrays is
>>  something that
>>   >  to me feels immature, and that it is the kind of thing that
>>  should be
>>   >  constructed from simpler primitives. And that NumPy should
>>  focus on
>>   >  simple primitives. You could make it
>>   >
>>   >
>>   >  I can't disagree, as I suggested the same as a possibility myself ;)
>>   >  There is a lot of infrastructure now in numpy, but given the use
>>  cases
>>   >  I'm tending towards the view that masked arrays should be left to
>>  

Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 11:38 AM, Dag Sverre Seljebotn wrote:
> On 05/10/2012 10:40 AM, Charles R Harris wrote:
>>
>>
>> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
>> mailto:d.s.seljeb...@astro.uio.no>>  wrote:
>>
>>  On 05/10/2012 06:18 AM, Charles R Harris wrote:
>>   >
>>   >
>>   >  On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>>   >  mailto:d.s.seljeb...@astro.uio.no>
>>  >  >>  wrote:
>>   >
>>   >  Sorry everyone for being so dense and contaminating that
>>  other thread.
>>   >  Here's a new thread where I can respond to Nathaniel's response.
>>   >
>>   >  On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>>   >  >  Hi Dag,
>>   >  >
>>   >  >  On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>>   >  >  mailto:d.s.seljeb...@astro.uio.no>
>>  >>
>>   >wrote:
>>   >  >>  I'm a heavy user of masks, which are used to make data NA in the
>>   >  >>  statistical sense. The setting is that we have to mask out the
>>   >  radiation
>>   >  >>  coming from the Milky Way in full-sky images of the Cosmic
>>  Microwave
>>   >  >>  Background. There's data, but we know we can't trust it, so we
>>   >  make it
>>   >  >>  NA. But we also do play around with different masks.
>>   >  >
>>   >  >  Oh, this is great -- that means you're one of the users that I
>>  wasn't
>>   >  >  sure existed or not :-). Now I know!
>>   >  >
>>   >  >>  Today we keep the mask in a seperate array, and to zero-mask we 
>> do
>>   >  >>
>>   >  >>  masked_data = data * mask
>>   >  >>
>>   >  >>  or
>>   >  >>
>>   >  >>  masked_data = data.copy()
>>   >  >>  masked_data[mask == 0] = np.nan # soon np.NA
>>   >  >>
>>   >  >>  depending on the circumstances.
>>   >  >>
>>   >  >>  Honestly, API-wise, this is as good as its gets for us. Nice and
>>   >  >>  transparent, no new semantics to learn in the special case of
>>  masks.
>>   >  >>
>>   >  >>  Now, this has performance issues: Lots of memory use, extra
>>   >  transfers
>>   >  >>  over the memory bus.
>>   >  >
>>   >  >  Right -- this is a case where (in the NA-overview terminology)
>>  masked
>>   >  >  storage+NA semantics would be useful.
>>   >  >
>>   >  >>  BUT, NumPy has that problem all over the place, even for "x + y
>>   >  + z"!
>>   >  >>  Solving it in the special case of masks, by making a new API,
>>   >  seems a
>>   >  >>  bit myopic to me.
>>   >  >>
>>   >  >>  IMO, that's much better solved at the fundamental level. As an
>>   >  >>  *illustration*:
>>   >  >>
>>   >  >>  with np.lazy:
>>   >  >>   masked_data1 = data * mask1
>>   >  >>   masked_data2 = data * (mask1 | mask2)
>>   >  >>   masked_data3 = (x + y + z) * (mask1&   mask3)
>>   >  >>
>>   >  >>  This would create three "generator arrays" that would
>>  zero-mask the
>>   >  >>  arrays (and perform the three-term addition...) upon request.
>>   >  You could
>>   >  >>  slice the generator arrays as you wish, and by that slice the
>>   >  data and
>>   >  >>  the mask in one operation. Obviously this could handle
>>   >  NA-masking too.
>>   >  >>
>>   >  >>  You can probably do this today with Theano and numexpr, and I
>>  think
>>   >  >>  Travis mentioned that "generator arrays" are on his radar for 
>> core
>>   >  NumPy.
>>   >  >
>>   >  >  Implementing this today would require some black magic hacks,
>>  because
>>   >  >  on entry/exit to the context manager you'd have to "reach up"
>>   >  into the
>>   >  >  calling scope and replace all the ndarray's with LazyArrays and
>>  then
>>   >  >  vice-versa. This is actually totally possible:
>>   >  >  https://gist.github.com/2347382
>>   >  >  but I'm not sure I'd call it *wise*. (You could probably avoid 
>> the
>>   >  >  truly horrible set_globals_dict part of that gist, though.)
>>  Might be
>>   >  >  fun to prototype, though...
>>   >
>>   >  1) My main point was just that I believe masked arrays is
>>  something that
>>   >  to me feels immature, and that it is the kind of thing that
>>  should be
>>   >  constructed from simpler primitives. And that NumPy should
>>  focus on
>>   >  simple primitives. You could make it
>>   >
>>   >
>>   >  I can't disagree, as I suggested the same as a possibility myself ;)
>>   >  There is a lot of infrastructure now in numpy, but given the use
>>  cases
>>   >  I'm tending towards the view that masked arrays should be left to
>>  

Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Chris Barker
On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
 wrote:
> What would serve me? I use NumPy as a glorified "double*".

> all I want is my glorified
> "double*". I'm probably not a representative user.)

Actually, I think you are representative of a LOT of users -- it
turns, out, whether Jim Huginin originally was thinking this way or
not, but numpy arrays are really powerful because the provide BOTH and
nifty, full featured array object in Python, AND a wrapper around a
generic "double*" (actually char*, that could be any type).

This is are really widely used feature, and has become even more so
with Cython's numpy support.

That is one of my concerns about the "bit pattern" idea -- we've then
created a new binary type that no other standard software understands
-- that looks like a a lot of work to me to deal with, or even worse,
ripe for weird, non-obvious errors in code that access that good-old
char*.

So I'm happier with a mask implementation -- more memory, yes, but it
seems more robust an easy to deal with with outside code.

But either way, Dag's key point is right on -- in Cython (or any other
code) -- we need to make sure ti's easy to get a regular old pointer
to a regular old C array, and get something else by accident.

-Chris







-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Scott Ransom
On 05/10/2012 02:23 PM, Chris Barker wrote:
> On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>   wrote:
>> What would serve me? I use NumPy as a glorified "double*".
>
>> all I want is my glorified
>> "double*". I'm probably not a representative user.)
>
> Actually, I think you are representative of a LOT of users -- it
> turns, out, whether Jim Huginin originally was thinking this way or
> not, but numpy arrays are really powerful because the provide BOTH and
> nifty, full featured array object in Python, AND a wrapper around a
> generic "double*" (actually char*, that could be any type).
>
> This is are really widely used feature, and has become even more so
> with Cython's numpy support.
>
> That is one of my concerns about the "bit pattern" idea -- we've then
> created a new binary type that no other standard software understands
> -- that looks like a a lot of work to me to deal with, or even worse,
> ripe for weird, non-obvious errors in code that access that good-old
> char*.
>
> So I'm happier with a mask implementation -- more memory, yes, but it
> seems more robust an easy to deal with with outside code.
>
> But either way, Dag's key point is right on -- in Cython (or any other
> code) -- we need to make sure ti's easy to get a regular old pointer
> to a regular old C array, and get something else by accident.
>
> -Chris

Agreed.  (As someone who has been heavily using Numpy since the early 
days of numeric, and who wrote and maintains a suite of scientific 
software that uses Numpy and its C-API in exactly this way.)

Note that I wasn't aware that the proposed mask implementation might (or 
would?) change this behavior...  (and hopefully I haven't just 
misinterpreted these last few emails.  If so, I apologize.).

Cheers,

Scott

-- 
Scott M. RansomAddress:  NRAO
Phone:  (434) 296-0320   520 Edgemont Rd.
email:  sran...@nrao.edu Charlottesville, VA 22903 USA
GPG Fingerprint: 06A9 9553 78BE 16DB 407B  FFCA 9BFA B6FF FFD3 2989
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Inati, Souheil (NIH/NIMH) [E]

On May 10, 2012, at 2:23 PM, Chris Barker wrote:

> On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>  wrote:
>> What would serve me? I use NumPy as a glorified "double*".
> 
>> all I want is my glorified
>> "double*". I'm probably not a representative user.)
> 
> Actually, I think you are representative of a LOT of users -- it
> turns, out, whether Jim Huginin originally was thinking this way or
> not, but numpy arrays are really powerful because the provide BOTH and
> nifty, full featured array object in Python, AND a wrapper around a
> generic "double*" (actually char*, that could be any type).
> 
> This is are really widely used feature, and has become even more so
> with Cython's numpy support.
> 
> That is one of my concerns about the "bit pattern" idea -- we've then
> created a new binary type that no other standard software understands
> -- that looks like a a lot of work to me to deal with, or even worse,
> ripe for weird, non-obvious errors in code that access that good-old
> char*.
> 
> So I'm happier with a mask implementation -- more memory, yes, but it
> seems more robust an easy to deal with with outside code.
> 
> But either way, Dag's key point is right on -- in Cython (or any other
> code) -- we need to make sure ti's easy to get a regular old pointer
> to a regular old C array, and get something else by accident.
> 
> -Chris
> 
> 

+1

As a physicist who uses numpy to develop MRI image reconstruction and data 
analysis methods, I really do think of numpy as a glorified double with a nice 
way to call useful numerical methods.  I also use external methods all the time 
and it's of the utmost importance to have a pointer to a block of data that I 
can say is N complex doubles or something.  Using a separate array for a mask 
is not a big deal.  At worst it's a factor of 2 in memory.  It forces me to pay 
attention to what I'm doing, and if I want to do an SVD on my data, I better 
keep track of what I'm doing myself.

I am not that old, but I'm old enough to remember when matlab was really just 
this - glorified double with a nice slicing/view interface and a thin wrapper 
around eispack and linpack.  (here is a great article by Cleve Moler from 2000: 
http://www.mathworks.com/company/newsletters/news_notes/clevescorner/winter2000.cleve.html).
  You used to read in some ints from a data file and they converted it to 
double and you knew that if you got numerical precision errors it was because 
your algorithm was wrong or you were inverting some nearly singular matrix or 
something, not because of overflow.  And they made a copy of the data every 
time you called a function.  It had serious limitations, but what it did just 
worked.  And then they started to get fancy and it took them a REALLY long time 
and a lot of versions and man hours to get that all sorted out, with lazy 
evaluations and classes and sparse arrays and all that.

I'm not saying what the developers of numpy should do about the masked array 
thing and I really can't comment on how other people use numpy.  I also don't 
really have much of a say about the technical implementations of the guts of 
numpy, but it's worth asking really simple questions like:  I want to do an SVD 
on a 2D array with some missing or masked data.  What should happen?  This 
seems like such a simple question, but really it is incredibly complicated, or 
rather, it's very hard for numpy which is a foundation framework type of code 
to guess what the user means.

Anyway, that's my point of view.  I'm really happy numpy exists and works as 
well as it does and I'm thankful that there are developers out there that can 
build something so useful.

Cheers,
Souheil

--
Souheil Inati, PhD
Staff Scientist
Functional MRI Facility
NIMH/NIH


> 
> 
> 
> 
> 
> -- 
> 
> Christopher Barker, Ph.D.
> Oceanographer
> 
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
> 
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Charles R Harris
On Thu, May 10, 2012 at 12:52 PM, Scott Ransom  wrote:

> On 05/10/2012 02:23 PM, Chris Barker wrote:
> > On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
> >   wrote:
> >> What would serve me? I use NumPy as a glorified "double*".
> >
> >> all I want is my glorified
> >> "double*". I'm probably not a representative user.)
> >
> > Actually, I think you are representative of a LOT of users -- it
> > turns, out, whether Jim Huginin originally was thinking this way or
> > not, but numpy arrays are really powerful because the provide BOTH and
> > nifty, full featured array object in Python, AND a wrapper around a
> > generic "double*" (actually char*, that could be any type).
> >
> > This is are really widely used feature, and has become even more so
> > with Cython's numpy support.
> >
> > That is one of my concerns about the "bit pattern" idea -- we've then
> > created a new binary type that no other standard software understands
> > -- that looks like a a lot of work to me to deal with, or even worse,
> > ripe for weird, non-obvious errors in code that access that good-old
> > char*.
> >
> > So I'm happier with a mask implementation -- more memory, yes, but it
> > seems more robust an easy to deal with with outside code.
> >
> > But either way, Dag's key point is right on -- in Cython (or any other
> > code) -- we need to make sure ti's easy to get a regular old pointer
> > to a regular old C array, and get something else by accident.
> >
> > -Chris
>
> Agreed.  (As someone who has been heavily using Numpy since the early
> days of numeric, and who wrote and maintains a suite of scientific
> software that uses Numpy and its C-API in exactly this way.)
>
> Note that I wasn't aware that the proposed mask implementation might (or
> would?) change this behavior...  (and hopefully I haven't just
> misinterpreted these last few emails.  If so, I apologize.).
>
>
I haven't seen a change in this behavior, otherwise most of current numpy
would break.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Charles R Harris
On Thu, May 10, 2012 at 1:14 PM, Charles R Harris  wrote:

>
>
> On Thu, May 10, 2012 at 12:52 PM, Scott Ransom  wrote:
>
>> On 05/10/2012 02:23 PM, Chris Barker wrote:
>> > On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>> >   wrote:
>> >> What would serve me? I use NumPy as a glorified "double*".
>> >
>> >> all I want is my glorified
>> >> "double*". I'm probably not a representative user.)
>> >
>> > Actually, I think you are representative of a LOT of users -- it
>> > turns, out, whether Jim Huginin originally was thinking this way or
>> > not, but numpy arrays are really powerful because the provide BOTH and
>> > nifty, full featured array object in Python, AND a wrapper around a
>> > generic "double*" (actually char*, that could be any type).
>> >
>> > This is are really widely used feature, and has become even more so
>> > with Cython's numpy support.
>> >
>> > That is one of my concerns about the "bit pattern" idea -- we've then
>> > created a new binary type that no other standard software understands
>> > -- that looks like a a lot of work to me to deal with, or even worse,
>> > ripe for weird, non-obvious errors in code that access that good-old
>> > char*.
>> >
>> > So I'm happier with a mask implementation -- more memory, yes, but it
>> > seems more robust an easy to deal with with outside code.
>> >
>> > But either way, Dag's key point is right on -- in Cython (or any other
>> > code) -- we need to make sure ti's easy to get a regular old pointer
>> > to a regular old C array, and get something else by accident.
>> >
>> > -Chris
>>
>> Agreed.  (As someone who has been heavily using Numpy since the early
>> days of numeric, and who wrote and maintains a suite of scientific
>> software that uses Numpy and its C-API in exactly this way.)
>>
>> Note that I wasn't aware that the proposed mask implementation might (or
>> would?) change this behavior...  (and hopefully I haven't just
>> misinterpreted these last few emails.  If so, I apologize.).
>>
>>
> I haven't seen a change in this behavior, otherwise most of current numpy
> would break.
>
>
I suspect this rumour comes from some ideas for generator arrays (not
mine), but I would strongly oppose anything that changes things that much.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Travis Oliphant

On May 10, 2012, at 1:23 PM, Chris Barker wrote:

> On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>  wrote:
>> What would serve me? I use NumPy as a glorified "double*".
> 
>> all I want is my glorified
>> "double*". I'm probably not a representative user.)
> 
> Actually, I think you are representative of a LOT of users -- it
> turns, out, whether Jim Huginin originally was thinking this way or
> not, but numpy arrays are really powerful because the provide BOTH and
> nifty, full featured array object in Python, AND a wrapper around a
> generic "double*" (actually char*, that could be any type).
> 
> This is are really widely used feature, and has become even more so
> with Cython's numpy support.
> 
> That is one of my concerns about the "bit pattern" idea -- we've then
> created a new binary type that no other standard software understands
> -- that looks like a a lot of work to me to deal with, or even worse,
> ripe for weird, non-obvious errors in code that access that good-old
> char*.

This needs to be clarified,  the point of the "bit pattern" idea is that the 
downstream user would have to actually *request* data in that format or they 
would get an error. You would not get it by "accident".   If you asked for 
an array of floats you would get an array of floats (not an array of 
NA-floats).  

R has *already* created this binary type and we are just including the ability 
to understand it in NumPy. 

This is why it is an easy thing to do without changing the structure of what a 
NumPy array *is*.   Adding the concept of a mask to *every* NumPy array (even 
NumPy arrays that are currently being used in the wild to represent masks) is 
the big change that I don't think should happen. 

-Travis

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 08:23 PM, Chris Barker wrote:
> On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
>   wrote:
>> What would serve me? I use NumPy as a glorified "double*".
>
>> all I want is my glorified
>> "double*". I'm probably not a representative user.)
>
> Actually, I think you are representative of a LOT of users -- it
> turns, out, whether Jim Huginin originally was thinking this way or
> not, but numpy arrays are really powerful because the provide BOTH and
> nifty, full featured array object in Python, AND a wrapper around a
> generic "double*" (actually char*, that could be any type).
>
> This is are really widely used feature, and has become even more so
> with Cython's numpy support.
>
> That is one of my concerns about the "bit pattern" idea -- we've then
> created a new binary type that no other standard software understands
> -- that looks like a a lot of work to me to deal with, or even worse,
> ripe for weird, non-obvious errors in code that access that good-old
> char*.
>
> So I'm happier with a mask implementation -- more memory, yes, but it
> seems more robust an easy to deal with with outside code.

It's very interesting that you consider masks easier to integrate with 
C/C++ code than bitpatterns. I guess everybody's experience (and every 
C/C++/Fortran code base) is different.

>
> But either way, Dag's key point is right on -- in Cython (or any other
> code) -- we need to make sure ti's easy to get a regular old pointer
> to a regular old C array, and get something else by accident.

I'm sorry if I caused any confusion -- I didn't mean to suggest that 
anybody would ever remove the ability of getting a pointer to an 
unmasked array.

There is a problem that's being discussed of the opposite nature:

With masked arrays, the current situation in NumPy trunk is that if 
you're presented with a masked array, and do not explicitly check for a 
mask (i.e., all existing code), you'll transparently and without warning 
"unmask" it -- that is, an element has the last value before NA was 
assigned. This is the case whether you use PEP 3118 (np.ndarray[double] 
or double[:]), or PyArray_DATA.

According to the NEP, you should really get an exception when accessing 
through PEP 3118, but this seems to not be implemented. I don't know 
whether this was a conscious change or a lack of implementation (?).

PyArray_DATA will continue to transparently unmask data. However, with 
Travis' proposal of making a new 'ndmasked' type, old code will be 
protected; it will raise an exception for masked arrays instead of 
transparently unmasking, giving the user a chance to work around it (or 
update the code to work with masks).

Regarding new code that you write to be mask-aware, fear not -- you can 
use PyArray_DATA and PyArray_MASKNA_DATA to get the pointers. You can't 
really access the mask using np.ndarray[uint8] or uint8[:], but it 
wouldn't be a problem for NumPy to provide such access for Cython users.

Regarding native Cython support for masks, bitpatterns would be a quick 
job and an uncontroversial feature, we just need to agree on an 
extension to the PEP 3118 format string with NumPy and then it takes a 
few hours to implement it. Masks would require quite some hashing out on 
the Cython email list to figure out whether and how we would want to 
support it, and is quite some more development work as well. How we'd 
even do that is much more vague to me.

Dag
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-10 Thread Mark Wiebe
On Thu, May 10, 2012 at 5:27 PM, Dag Sverre Seljebotn <
d.s.seljeb...@astro.uio.no> wrote:

> On 05/10/2012 08:23 PM, Chris Barker wrote:
> > On Thu, May 10, 2012 at 2:38 AM, Dag Sverre Seljebotn
> >   wrote:
> >> What would serve me? I use NumPy as a glorified "double*".
> >
> >> all I want is my glorified
> >> "double*". I'm probably not a representative user.)
> >
> > Actually, I think you are representative of a LOT of users -- it
> > turns, out, whether Jim Huginin originally was thinking this way or
> > not, but numpy arrays are really powerful because the provide BOTH and
> > nifty, full featured array object in Python, AND a wrapper around a
> > generic "double*" (actually char*, that could be any type).
> >
> > This is are really widely used feature, and has become even more so
> > with Cython's numpy support.
> >
> > That is one of my concerns about the "bit pattern" idea -- we've then
> > created a new binary type that no other standard software understands
> > -- that looks like a a lot of work to me to deal with, or even worse,
> > ripe for weird, non-obvious errors in code that access that good-old
> > char*.
> >
> > So I'm happier with a mask implementation -- more memory, yes, but it
> > seems more robust an easy to deal with with outside code.
>
> It's very interesting that you consider masks easier to integrate with
> C/C++ code than bitpatterns. I guess everybody's experience (and every
> C/C++/Fortran code base) is different.
>
> >
> > But either way, Dag's key point is right on -- in Cython (or any other
> > code) -- we need to make sure ti's easy to get a regular old pointer
> > to a regular old C array, and get something else by accident.
>
> I'm sorry if I caused any confusion -- I didn't mean to suggest that
> anybody would ever remove the ability of getting a pointer to an
> unmasked array.
>
> There is a problem that's being discussed of the opposite nature:
>
> With masked arrays, the current situation in NumPy trunk is that if
> you're presented with a masked array, and do not explicitly check for a
> mask (i.e., all existing code), you'll transparently and without warning
> "unmask" it -- that is, an element has the last value before NA was
> assigned. This is the case whether you use PEP 3118 (np.ndarray[double]
> or double[:]), or PyArray_DATA.
>
> According to the NEP, you should really get an exception when accessing
> through PEP 3118, but this seems to not be implemented. I don't know
> whether this was a conscious change or a lack of implementation (?).
>

This was an error, I've made a pull request to fix it.


> PyArray_DATA will continue to transparently unmask data. However, with
> Travis' proposal of making a new 'ndmasked' type, old code will be
> protected; it will raise an exception for masked arrays instead of
> transparently unmasking, giving the user a chance to work around it (or
> update the code to work with masks).
>

In searching for example code, the examples I found and the numpy
documentation recommend using the PyArray_FromAny or related functions to
sanitize the array before use. This provides a place to stop NA-masked
arrays and raise an exception. Is there a lot of code out there which isn't
following this practice?

Cheers,
Mark


> Regarding new code that you write to be mask-aware, fear not -- you can
> use PyArray_DATA and PyArray_MASKNA_DATA to get the pointers. You can't
> really access the mask using np.ndarray[uint8] or uint8[:], but it
> wouldn't be a problem for NumPy to provide such access for Cython users.
>
> Regarding native Cython support for masks, bitpatterns would be a quick
> job and an uncontroversial feature, we just need to agree on an
> extension to the PEP 3118 format string with NumPy and then it takes a
> few hours to implement it. Masks would require quite some hashing out on
> the Cython email list to figure out whether and how we would want to
> support it, and is quite some more development work as well. How we'd
> even do that is much more vague to me.
>
> Dag
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Masking through generator arrays

2012-05-11 Thread Nathaniel Smith
On Thu, May 10, 2012 at 7:23 PM, Chris Barker  wrote:
> That is one of my concerns about the "bit pattern" idea -- we've then
> created a new binary type that no other standard software understands
> -- that looks like a a lot of work to me to deal with, or even worse,
> ripe for weird, non-obvious errors in code that access that good-old
> char*.

Numpy supports a number of unusual binary data types, e.g. halfs and
datetimes, that aren't well supported by other standard software. As
Travis points out, no-one forces you to use them :-).

> So I'm happier with a mask implementation -- more memory, yes, but it
> seems more robust an easy to deal with with outside code.

Let's say we have a no-frills C function that we want to call, and
it's defined to use a mask:

  void do_calcs(double * data, char * mask, int size);

To call this function from Cython, then in the mask NAs world we do
something like:

  a = np.ascontiguousarray(a)
  do_calcs(PyArray_DATA(a), PyArray_MASK(a), a.size)

OTOH in the bitpattern NA world, we do something like:

  a = np.ascontiguousarray(a)
  mask = np.isNA(a)
  do_calcs(PyArray_DATA(a), PyArray_DATA(mask), a.size)

Of course there are various extra complexities that can come in here
depending on what you want to do if there are no NAs possible, whether
do_calcs can take a NULL mask pointer, if you're writing in C instead
of Cython then you need to use the C equivalent functions, etc. But
IMHO there's no fundamental reason why bitpatterns have to be much
more complex to deal with in outside code than masks, assuming a
properly helpful API. What can't be papered over at the API level are
the questions like, do you want to be able to "un-assign" NA to reveal
what used to be there before? That needs masks, for better or worse.

But I may well be missing something... does that address your concern,
or is there more to it?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion