Re: [Numpy-discussion] Missing Data

2014-03-26 Thread Charles R Harris
On Wed, Mar 26, 2014 at 5:43 PM, alex  wrote:

> On Wed, Mar 26, 2014 at 7:22 PM, T J  wrote:
> > What is the status of:
> >
> >https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
>
> For what it's worth this NEP was written in 2011 by mwiebe who made
> 258 numpy commits in 2011, 1 in 2012, and 3 in 2014.  According to
> github, in the last few hours alone mwiebe has made several commits to
> 'blaze' and 'dynd-python'.  Here's the blog post explaining the vision
> for Continuum's 'blaze' project http://continuum.io/blog/blaze.
> Continuum seems to have been started in early 2012.
>

It looks like blaze will have bit pattern missing values ala R. I don't
know if there is going to be a masked array implementation. The NA code was
taken out of Numpy because it was not possible to reach agreement that it
did the right thing.

Numpy.ma remains the only solution for bad data at this time. The code
could probably use more love than it has gotten ;)

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing Data

2014-03-26 Thread alex
On Wed, Mar 26, 2014 at 7:22 PM, T J  wrote:
> What is the status of:
>
>https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst

For what it's worth this NEP was written in 2011 by mwiebe who made
258 numpy commits in 2011, 1 in 2012, and 3 in 2014.  According to
github, in the last few hours alone mwiebe has made several commits to
'blaze' and 'dynd-python'.  Here's the blog post explaining the vision
for Continuum's 'blaze' project http://continuum.io/blog/blaze.
Continuum seems to have been started in early 2012.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Missing Data

2014-03-26 Thread T J
What is the status of:

   https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst

and of missing data in Numpy, more generally?

Is np.ma.array still the "state-of-the-art" way to handle missing data? Or
has something better and more comprehensive been put together?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-14 Thread Richard Hattersley
For what it's worth, I'd prefer ndmasked.

As has been mentioned elsewhere, some algorithms can't really cope with
missing data. I'd very much rather they fail than silently give incorrect
results. Working in the climate prediction business (as with many other
domains I'm sure), even the *potential* for incorrect results can be
damaging.


On 11 May 2012 06:14, Travis Oliphant  wrote:

>
> On May 10, 2012, at 12:21 AM, Charles R Harris wrote:
>
>
>
> On Wed, May 9, 2012 at 11:05 PM, Benjamin Root  wrote:
>
>>
>>
>> On Wednesday, May 9, 2012, Nathaniel Smith wrote:
>>
>>>
>>>
>>> My only objection to this proposal is that committing to this approach
>>> seems premature. The existing masked array objects act quite
>>> differently from numpy.ma, so why do you believe that they're a good
>>> foundation for numpy.ma, and why will users want to switch to their
>>> semantics over numpy.ma's semantics? These aren't rhetorical
>>> questions, it seems like they must have concrete answers, but I don't
>>> know what they are.
>>>
>>
>> Based on the design decisions made in the original NEP, a re-made
>> numpy.ma would have to lose _some_ features particularly, the ability to
>> share masks. Save for that and some very obscure behaviors that are
>> undocumented, it is possible to remake numpy.ma as a compatibility layer.
>>
>> That being said, I think that there are some fundamental questions that
>> has concerned. If I recall, there were unresolved questions about behaviors
>> surrounding assignments to elements of a view.
>>
>> I see the project as broken down like this:
>> 1.) internal architecture (largely abi issues)
>> 2.) external architecture (hooks throughout numpy to utilize the new
>> features where possible such as where= argument)
>> 3.) getter/setter semantics
>> 4.) mathematical semantics
>>
>> At this moment, I think we have pieces of 2 and they are fairly
>> non-controversial. It is 1 that I see as being the immediate hold-up here.
>> 3 & 4 are non-trivial, but because they are mostly about interfaces, I
>> think we can be willing to accept some very basic, fundamental, barebones
>> components here in order to lay the groundwork for a more complete API
>> later.
>>
>> To talk of Travis's proposal, doing nothing is no-go. Not moving forward
>> would dishearten the community. Making a ndmasked type is very intriguing.
>> I see it as a set towards eventually deprecating ndarray? Also, how would
>> it behave with no.asarray() and no.asanyarray()? My other concern is a
>> possible violation of DRY. How difficult would it be to maintain two
>> ndarrays in parallel?
>>
>> As for the flag approach, this still doesn't solve the problem of legacy
>> code (or did I misunderstand?)
>>
>
> My understanding of the flag is to allow the code to stay in and get
> reworked and experimented with while keeping it from contaminating
> conventional use.
>
> The whole point of putting the code in was to experiment and adjust. The
> rather bizarre idea that it needs to be perfect from the get go is
> disheartening, and is seldom how new things get developed. Sure, there is a
> plan up front, but there needs to be feedback and change. And in fact, I
> haven't seen much feedback about the actual code, I don't even know that
> the people complaining have tried using it to see where it hurts. I'd like
> that sort of feedback.
>
>
> I don't think anyone is saying it needs to be perfect from the get go.
>  What I am saying is that this is fundamental enough to downstream users
> that this kind of thing is best done as a separate object.  The flag could
> still be used to make all Python-level array constructors build ndmasked
> objects.
>
> But, this doesn't address the C-level story where there is quite a bit of
> downstream use where people have used the NumPy array as just a pointer to
> memory without considering that there might be a mask attached that should
> be inspected as well.
>
> The NEP addresses this a little bit for those C or C++ consumers of the
> ndarray in C who always use PyArray_FromAny which can fail if the array has
> non-NULL mask contents.   However, it is *not* true that all downstream
> users use PyArray_FromAny.
>
> A large number of users just use something like PyArray_Check and then
> PyArray_DATA to get the pointer to the data buffer and then go from there
> thinking of their data as a strided memory chunk only (no extra mask).
>  The NEP fundamentally changes this simple invariant that has been in NumPy
> and Numeric before it for a long, long time.
>
> I really don't see how we can do this in a 1.7 release.It has too many
> unknown and I think unknowable downstream effects.But, I think we could
> introduce another arrayobject that is the masked_array with a Python-level
> flag that makes it the default array in Python.
>
> There are a few more subtleties,  PyArray_Check by default will pass
> sub-classes so if the new ndmask array were a sub-class then it would be
> passed (just like curren

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-11 Thread Mark Wiebe
On Thu, May 10, 2012 at 10:28 PM, Matthew Brett wrote:

> Hi,
>
> On Thu, May 10, 2012 at 2:43 AM, Nathaniel Smith  wrote:
> > Hi Matthew,
> >
> > On Thu, May 10, 2012 at 12:01 AM, Matthew Brett 
> wrote:
> >>> The third proposal is certainly the best one from Cython's perspective;
> >>> and I imagine for those writing C extensions against the C API too.
> >>> Having PyType_Check fail for ndmasked is a very good way of having code
> >>> fail that is not written to take masks into account.
> >>
> >> Mark, Nathaniel - can you comment how your chosen approaches would
> >> interact with extension code?
> >>
> >> I'm guessing the bitpattern dtypes would be expected to cause
> >> extension code to choke if the type is not supported?
> >
> > That's pretty much how I'm imagining it, yes. Right now if you have,
> > say, a Cython function like
> >
> > cdef f(np.ndarray[double] a):
> >...
> >
> > and you do f(np.zeros(10, dtype=int)), then it will error out, because
> > that function doesn't know how to handle ints, only doubles. The same
> > would apply for, say, a NA-enabled integer. In general there are
> > almost arbitrarily many dtypes that could get passed into any function
> > (including user-defined ones, etc.), so C code already has to check
> > dtypes for correctness.
> >
> > Second order issues:
> > - There is certainly C code out there that just assumes that it will
> > only be passed an array with certain dtype (and ndim, memory layout,
> > etc...). If you write such C code then it's your job to make sure that
> > you only pass it the kinds of arrays that it expects, just like now
> > :-).
> >
> > - We may want to do some sort of special-casing of handling for
> > floating point NA dtypes that use an NaN as the "magic" bitpattern,
> > since many algorithms *will* work with these unchanged, and it might
> > be frustrating to have to wait for every extension module to be
> > updated just to allow for this case explicitly before using them. OTOH
> > you can easily work around this. Like say my_qr is a legacy C function
> > that will in fact propagate NaNs correctly, so float NA dtypes would
> > Just Work -- except, it errors out at the start because it doesn't
> > recognize the dtype. How annoying. We *could* have some special hack
> > you can use to force it to work anyway (by like making the "is this
> > the dtype I expect?" routine lie.) But you can also just do:
> >
> >  def my_qr_wrapper(arr):
> >if arr.dtype is a NA float dtype with NaN magic value:
> >  result = my_qr(arr.view(arr.dtype.base_dtype))
> >  return result.view(arr.dtype)
> >else:
> >  return my_qr(arr)
> >
> > and hey presto, now it will correctly pass through NAs. So perhaps
> > it's not worth bothering with special hacks.
> >
> > - Of course if  your extension function does want to handle NAs
> > generically, then there will be a simple C api for checking for them,
> > setting them, etc. Numpy needs such an API internally anyway!
>
> Thanks for this.
>
> Mark - in view of the discussions about Cython and extension code -
> could you say what you see as disadvantages to the ndmasked subclass
> proposal?
>

The biggest difficulty looks to me like how to work with both of them
reasonably from the C API. The idea of ndarray and ndmasked having
different independent TypeObjects, but still working through the same API
calls feels a little disconcerting. Maybe this is a reasonable compromise,
though, it would be nice to see the idea fleshed out a bit more with some
examples of how the code would work from the C level.

Cheers,
Mark


>
> Cheers,
>
> Matthew
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-11 Thread Travis Oliphant

On May 11, 2012, at 2:13 AM, Fernando Perez wrote:

> On Thu, May 10, 2012 at 11:44 PM, Scott Sinclair
>  wrote:
>> That's pretty much how things already work. The documentation is in
>> the main source tree and built docs end up at http://docs.scipy.org.
>> NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but
>> don't get published outside of the source tree and there's no
>> "preferred" place for discussion documents.
> 
> No, b/c that means that for someone to be able to push to a NEP,
> they'd have to get commit rights to the main numpy source code repo.
> The whole point of what I'm suggesting is to isolate the NEP repo so
> that commit rights can be given for it with minimal thought, whenever
> pretty much anyone says they're going to work on a NEP.
> 
> Obviously today anyone can do that and submit a PR against the main
> repo, but that raises the PR review burden for said repo.  And that
> burden is something that we should strive to keep as low as possible,
> so those key people (the team with commit rights to the main repo) can
> focus their limited resources on reviewing code PRs.
> 
> I'm simply suggesting a way to spread the load as much as possible, so
> that the team with commit rights on the main repo isn't a bottleneck
> on other tasks.

This is a good idea.  I think.I like the thought of a separate NEP and docs 
repo.   

-Travis


> 
> Cheers,
> 
> f
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-11 Thread Fernando Perez
On Thu, May 10, 2012 at 11:44 PM, Scott Sinclair
 wrote:
> That's pretty much how things already work. The documentation is in
> the main source tree and built docs end up at http://docs.scipy.org.
> NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but
> don't get published outside of the source tree and there's no
> "preferred" place for discussion documents.

No, b/c that means that for someone to be able to push to a NEP,
they'd have to get commit rights to the main numpy source code repo.
The whole point of what I'm suggesting is to isolate the NEP repo so
that commit rights can be given for it with minimal thought, whenever
pretty much anyone says they're going to work on a NEP.

Obviously today anyone can do that and submit a PR against the main
repo, but that raises the PR review burden for said repo.  And that
burden is something that we should strive to keep as low as possible,
so those key people (the team with commit rights to the main repo) can
focus their limited resources on reviewing code PRs.

I'm simply suggesting a way to spread the load as much as possible, so
that the team with commit rights on the main repo isn't a bottleneck
on other tasks.

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Scott Sinclair
On 11 May 2012 08:12, Fernando Perez  wrote:
> On Thu, May 10, 2012 at 11:03 PM, Scott Sinclair
>  wrote:
>> Having thought about it, a page on the website isn't a bad idea. I've
>> added a note pointing to this discussion. The document now appears at
>> http://numpy.scipy.org/NA-overview.html
>
> Why not have a separate repo for neps/discussion docs?  That way,
> people can be added to the team as they need to edit them and removed
> when done, and it's separate from the main site itself.  The site can
> simply have a link to this set of documents, which can be built,
> tracked, separately and cleanly.  We have more or less that setup with
> ipython for the site and docs:
>
> - main site page that points to the doc builds:
> http://ipython.org/documentation.html
> - doc builds on a secondary site:
> http://ipython.org/ipython-doc/stable/index.html

That's pretty much how things already work. The documentation is in
the main source tree and built docs end up at http://docs.scipy.org.
NEPs live at https://github.com/numpy/numpy/tree/master/doc/neps, but
don't get published outside of the source tree and there's no
"preferred" place for discussion documents.

> (assuming we'll have a nice website for numpy one day)

Ha ha ha ;-) Thanks for the thoughts and prodding.

Cheers,
Scott
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Fernando Perez
On Thu, May 10, 2012 at 11:03 PM, Scott Sinclair
 wrote:
> Having thought about it, a page on the website isn't a bad idea. I've
> added a note pointing to this discussion. The document now appears at
> http://numpy.scipy.org/NA-overview.html

Why not have a separate repo for neps/discussion docs?  That way,
people can be added to the team as they need to edit them and removed
when done, and it's separate from the main site itself.  The site can
simply have a link to this set of documents, which can be built,
tracked, separately and cleanly.  We have more or less that setup with
ipython for the site and docs:

- main site page that points to the doc builds:
http://ipython.org/documentation.html
- doc builds on a secondary site:
http://ipython.org/ipython-doc/stable/index.html

This seems to me like the best way to separate the main web team
(assuming we'll have a nice website for numpy one day) from the team
that will edit documents of nep/discussion type.  I imagine the web
team will be fairly stable, where as the team for these docs will have
people coming and going.

Just a thought...  As usual, crib anything you find useful from our setup.

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Scott Sinclair
On 11 May 2012 06:57, Travis Oliphant  wrote:
>
> On May 10, 2012, at 3:40 AM, Scott Sinclair wrote:
>
>> On 9 May 2012 18:46, Travis Oliphant  wrote:
>>> The document is available here:
>>>    https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>>
>> This is orthogonal to the discussion, but I'm curious as to why this
>> discussion document has landed in the website repo?
>>
>> I suppose it's not a really big deal, but future uploads of the
>> website will now include a page at
>> http://numpy.scipy.org/NA-overview.html with the content of this
>> document. If that's desirable, I'll add a note at the top of the
>> overview referencing this discussion thread. If not it can be
>> relocated somewhere more desirable after this thread's discussion
>> deadline expires..
>
> Yes, it can be relocated.   Can you suggest where it should go?  It was added 
> there so that nathaniel and mark could both edit it together with Nathaniel 
> added to the web-team.
>
> It may not be a bad place for it, though.   At least for a while.

Having thought about it, a page on the website isn't a bad idea. I've
added a note pointing to this discussion. The document now appears at
http://numpy.scipy.org/NA-overview.html

Cheers,
Scott
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Travis Oliphant

On May 10, 2012, at 12:21 AM, Charles R Harris wrote:

> 
> 
> On Wed, May 9, 2012 at 11:05 PM, Benjamin Root  wrote:
> 
> 
> On Wednesday, May 9, 2012, Nathaniel Smith wrote:
> 
> 
> My only objection to this proposal is that committing to this approach
> seems premature. The existing masked array objects act quite
> differently from numpy.ma, so why do you believe that they're a good
> foundation for numpy.ma, and why will users want to switch to their
> semantics over numpy.ma's semantics? These aren't rhetorical
> questions, it seems like they must have concrete answers, but I don't
> know what they are.
> 
> Based on the design decisions made in the original NEP, a re-made numpy.ma 
> would have to lose _some_ features particularly, the ability to share masks. 
> Save for that and some very obscure behaviors that are undocumented, it is 
> possible to remake numpy.ma as a compatibility layer.
> 
> That being said, I think that there are some fundamental questions that has 
> concerned. If I recall, there were unresolved questions about behaviors 
> surrounding assignments to elements of a view.
> 
> I see the project as broken down like this:
> 1.) internal architecture (largely abi issues)
> 2.) external architecture (hooks throughout numpy to utilize the new features 
> where possible such as where= argument)
> 3.) getter/setter semantics
> 4.) mathematical semantics
> 
> At this moment, I think we have pieces of 2 and they are fairly 
> non-controversial. It is 1 that I see as being the immediate hold-up here. 3 
> & 4 are non-trivial, but because they are mostly about interfaces, I think we 
> can be willing to accept some very basic, fundamental, barebones components 
> here in order to lay the groundwork for a more complete API later.
> 
> To talk of Travis's proposal, doing nothing is no-go. Not moving forward 
> would dishearten the community. Making a ndmasked type is very intriguing. I 
> see it as a set towards eventually deprecating ndarray? Also, how would it 
> behave with no.asarray() and no.asanyarray()? My other concern is a possible 
> violation of DRY. How difficult would it be to maintain two ndarrays in 
> parallel?  
> 
> As for the flag approach, this still doesn't solve the problem of legacy code 
> (or did I misunderstand?)
> 
> My understanding of the flag is to allow the code to stay in and get reworked 
> and experimented with while keeping it from contaminating conventional use.
> 
> The whole point of putting the code in was to experiment and adjust. The 
> rather bizarre idea that it needs to be perfect from the get go is 
> disheartening, and is seldom how new things get developed. Sure, there is a 
> plan up front, but there needs to be feedback and change. And in fact, I 
> haven't seen much feedback about the actual code, I don't even know that the 
> people complaining have tried using it to see where it hurts. I'd like that 
> sort of feedback.
> 

I don't think anyone is saying it needs to be perfect from the get go.What 
I am saying is that this is fundamental enough to downstream users that this 
kind of thing is best done as a separate object.  The flag could still be used 
to make all Python-level array constructors build ndmasked objects.  

But, this doesn't address the C-level story where there is quite a bit of 
downstream use where people have used the NumPy array as just a pointer to 
memory without considering that there might be a mask attached that should be 
inspected as well. 

The NEP addresses this a little bit for those C or C++ consumers of the ndarray 
in C who always use PyArray_FromAny which can fail if the array has non-NULL 
mask contents.   However, it is *not* true that all downstream users use 
PyArray_FromAny. 

A large number of users just use something like PyArray_Check and then 
PyArray_DATA to get the pointer to the data buffer and then go from there 
thinking of their data as a strided memory chunk only (no extra mask).The 
NEP fundamentally changes this simple invariant that has been in NumPy and 
Numeric before it for a long, long time. 

I really don't see how we can do this in a 1.7 release.It has too many 
unknown and I think unknowable downstream effects.But, I think we could 
introduce another arrayobject that is the masked_array with a Python-level flag 
that makes it the default array in Python. 

There are a few more subtleties,  PyArray_Check by default will pass 
sub-classes so if the new ndmask array were a sub-class then it would be passed 
(just like current numpy.ma arrays and matrices would pass that check today).   
 However, there is a PyArray_CheckExact macro which could be used to ensure the 
object was actually of PyArray_Type.   There is also the PyArg_ParseTuple 
command with "O!" that I have seen used many times to ensure an exact NumPy 
array.  

-Travis






> Chuck
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http:/

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Travis Oliphant

On May 10, 2012, at 3:40 AM, Scott Sinclair wrote:

> On 9 May 2012 18:46, Travis Oliphant  wrote:
>> The document is available here:
>>https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
> 
> This is orthogonal to the discussion, but I'm curious as to why this
> discussion document has landed in the website repo?
> 
> I suppose it's not a really big deal, but future uploads of the
> website will now include a page at
> http://numpy.scipy.org/NA-overview.html with the content of this
> document. If that's desirable, I'll add a note at the top of the
> overview referencing this discussion thread. If not it can be
> relocated somewhere more desirable after this thread's discussion
> deadline expires..

Yes, it can be relocated.   Can you suggest where it should go?  It was added 
there so that nathaniel and mark could both edit it together with Nathaniel 
added to the web-team. 

It may not be a bad place for it, though.   At least for a while. 

-Travis


> 
> Cheers,
> Scott
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Matthew Brett
Hi,

On Thu, May 10, 2012 at 2:43 AM, Nathaniel Smith  wrote:
> Hi Matthew,
>
> On Thu, May 10, 2012 at 12:01 AM, Matthew Brett  
> wrote:
>>> The third proposal is certainly the best one from Cython's perspective;
>>> and I imagine for those writing C extensions against the C API too.
>>> Having PyType_Check fail for ndmasked is a very good way of having code
>>> fail that is not written to take masks into account.
>>
>> Mark, Nathaniel - can you comment how your chosen approaches would
>> interact with extension code?
>>
>> I'm guessing the bitpattern dtypes would be expected to cause
>> extension code to choke if the type is not supported?
>
> That's pretty much how I'm imagining it, yes. Right now if you have,
> say, a Cython function like
>
> cdef f(np.ndarray[double] a):
>    ...
>
> and you do f(np.zeros(10, dtype=int)), then it will error out, because
> that function doesn't know how to handle ints, only doubles. The same
> would apply for, say, a NA-enabled integer. In general there are
> almost arbitrarily many dtypes that could get passed into any function
> (including user-defined ones, etc.), so C code already has to check
> dtypes for correctness.
>
> Second order issues:
> - There is certainly C code out there that just assumes that it will
> only be passed an array with certain dtype (and ndim, memory layout,
> etc...). If you write such C code then it's your job to make sure that
> you only pass it the kinds of arrays that it expects, just like now
> :-).
>
> - We may want to do some sort of special-casing of handling for
> floating point NA dtypes that use an NaN as the "magic" bitpattern,
> since many algorithms *will* work with these unchanged, and it might
> be frustrating to have to wait for every extension module to be
> updated just to allow for this case explicitly before using them. OTOH
> you can easily work around this. Like say my_qr is a legacy C function
> that will in fact propagate NaNs correctly, so float NA dtypes would
> Just Work -- except, it errors out at the start because it doesn't
> recognize the dtype. How annoying. We *could* have some special hack
> you can use to force it to work anyway (by like making the "is this
> the dtype I expect?" routine lie.) But you can also just do:
>
>  def my_qr_wrapper(arr):
>    if arr.dtype is a NA float dtype with NaN magic value:
>      result = my_qr(arr.view(arr.dtype.base_dtype))
>      return result.view(arr.dtype)
>    else:
>      return my_qr(arr)
>
> and hey presto, now it will correctly pass through NAs. So perhaps
> it's not worth bothering with special hacks.
>
> - Of course if  your extension function does want to handle NAs
> generically, then there will be a simple C api for checking for them,
> setting them, etc. Numpy needs such an API internally anyway!

Thanks for this.

Mark - in view of the discussions about Cython and extension code -
could you say what you see as disadvantages to the ndmasked subclass
proposal?

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Nathaniel Smith
Hi Matthew,

On Thu, May 10, 2012 at 12:01 AM, Matthew Brett  wrote:
>> The third proposal is certainly the best one from Cython's perspective;
>> and I imagine for those writing C extensions against the C API too.
>> Having PyType_Check fail for ndmasked is a very good way of having code
>> fail that is not written to take masks into account.
>
> Mark, Nathaniel - can you comment how your chosen approaches would
> interact with extension code?
>
> I'm guessing the bitpattern dtypes would be expected to cause
> extension code to choke if the type is not supported?

That's pretty much how I'm imagining it, yes. Right now if you have,
say, a Cython function like

cdef f(np.ndarray[double] a):
...

and you do f(np.zeros(10, dtype=int)), then it will error out, because
that function doesn't know how to handle ints, only doubles. The same
would apply for, say, a NA-enabled integer. In general there are
almost arbitrarily many dtypes that could get passed into any function
(including user-defined ones, etc.), so C code already has to check
dtypes for correctness.

Second order issues:
- There is certainly C code out there that just assumes that it will
only be passed an array with certain dtype (and ndim, memory layout,
etc...). If you write such C code then it's your job to make sure that
you only pass it the kinds of arrays that it expects, just like now
:-).

- We may want to do some sort of special-casing of handling for
floating point NA dtypes that use an NaN as the "magic" bitpattern,
since many algorithms *will* work with these unchanged, and it might
be frustrating to have to wait for every extension module to be
updated just to allow for this case explicitly before using them. OTOH
you can easily work around this. Like say my_qr is a legacy C function
that will in fact propagate NaNs correctly, so float NA dtypes would
Just Work -- except, it errors out at the start because it doesn't
recognize the dtype. How annoying. We *could* have some special hack
you can use to force it to work anyway (by like making the "is this
the dtype I expect?" routine lie.) But you can also just do:

  def my_qr_wrapper(arr):
if arr.dtype is a NA float dtype with NaN magic value:
  result = my_qr(arr.view(arr.dtype.base_dtype))
  return result.view(arr.dtype)
else:
  return my_qr(arr)

and hey presto, now it will correctly pass through NAs. So perhaps
it's not worth bothering with special hacks.

- Of course if  your extension function does want to handle NAs
generically, then there will be a simple C api for checking for them,
setting them, etc. Numpy needs such an API internally anyway!

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Dag Sverre Seljebotn
On 05/10/2012 06:05 AM, Dag Sverre Seljebotn wrote:
> On 05/10/2012 01:01 AM, Matthew Brett wrote:
>> Hi,
>>
>> On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
>>wrote:
>>> On 05/09/2012 06:46 PM, Travis Oliphant wrote:
 Hey all,

 Nathaniel and Mark have worked very hard on a joint document to try and
 explain the current status of the missing-data debate. I think they've
 done an amazing job at providing some context, articulating their views
 and suggesting ways forward in a mutually respectful manner. This is an
 exemplary collaboration and is at the core of why open source is valuable.

 The document is available here:
 https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

 After reading that document, it appears to me that there are some
 fundamentally different views on how things should move forward. I'm
 also reading the document incorporating my understanding of the history,
 of NumPy as well as all of the users I've met and interacted with which
 means I have my own perspective that is not necessarily incorporated
 into that document but informs my recommendations. I'm not sure we can
 reach full consensus on this. We are also well past time for moving
 forward with a resolution on this (perhaps we can all agree on that).

 I would like one more discussion thread where the technical discussion
 can take place. I will make a plea that we keep this discussion as free
 from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
 we can. I can't guarantee that I personally will succeed at that, but I
 can tell you that I will try. That's all I'm asking of anyone else. I
 recognize that there are a lot of other issues at play here besides
 *just* the technical questions, but we are not going to resolve every
 community issue in this technical thread.

 We need concrete proposals and so I will start with three. Please feel
 free to comment on these proposals or add your own during the
 discussion. I will stop paying attention to this thread next Wednesday
 (May 16th) (or earlier if the thread dies) and hope that by that time we
 can agree on a way forward. If we don't have agreement, then I will move
 forward with what I think is the right approach. I will either write the
 code myself or convince someone else to write it.

 In all cases, we have agreement that bit-pattern dtypes should be added
 to NumPy. We should work on these (int32, float64, complex64, str, bool)
 to start. So, the three proposals are independent of this way forward.
 The proposals are all about the extra mask part:

 My three proposals:

 * do nothing and leave things as is

 * add a global flag that turns off masked array support by default but
 otherwise leaves things unchanged (I'm still unclear how this would work
 exactly)

 * move Mark's "masked ndarray objects" into a new fundamental type
 (ndmasked), leaving the actual ndarray type unchanged. The
 array_interface keeps the masked array notions and the ufuncs keep the
 ability to handle arrays like ndmasked. Ideally, numpy.ma
    would be changed to use ndmasked objects as their core.

 For the record, I'm currently in favor of the third proposal. Feel free
 to comment on these proposals (or provide your own).

>>>
>>> Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!
>>
>> Yes, it is very well written, my compliments to the chefs.
>>
>>> The third proposal is certainly the best one from Cython's perspective;
>>> and I imagine for those writing C extensions against the C API too.
>>> Having PyType_Check fail for ndmasked is a very good way of having code
>>> fail that is not written to take masks into account.
>
> I want to make something more clear: There are two Cython cases; in the
> case of "cdef np.ndarray[double]" there is no problem as PEP 3118 access
> will raise an exception for masked arrays.
>
> But, there's the case where you do "cdef np.ndarray", and then proceed
> to use PyArray_DATA. Myself I do this more than PEP 3118 access; usually
> because I pass the data pointer to some C or C++ code.
>
> It'd be great to have such code be forward-compatible in the sense that
> it raises an exception when it meets a masked array. Having PyType_Check
> fail seems like the only way? Am I wrong?

I'm very sorry; I always meant PyObject_TypeCheck, not PyType_Check.

Dag

>
>
>> Mark, Nathaniel - can you comment how your chosen approaches would
>> interact with extension code?
>>
>> I'm guessing the bitpattern dtypes would be expected to cause
>> extension code to choke if the type is not supported?
>
> The proposal, as I understand it, is to use that with new dtypes (?). So
> things will often be fine for that reason:
>
> if arr.dtype == np.float32:
>   c_func

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Scott Sinclair
On 9 May 2012 18:46, Travis Oliphant  wrote:
> The document is available here:
>    https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

This is orthogonal to the discussion, but I'm curious as to why this
discussion document has landed in the website repo?

I suppose it's not a really big deal, but future uploads of the
website will now include a page at
http://numpy.scipy.org/NA-overview.html with the content of this
document. If that's desirable, I'll add a note at the top of the
overview referencing this discussion thread. If not it can be
relocated somewhere more desirable after this thread's discussion
deadline expires..

Cheers,
Scott
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-10 Thread Gael Varoquaux
On Wed, May 09, 2012 at 02:35:26PM -0500, Travis Oliphant wrote:
>  Basically it buys not forcing *all* NumPy users (on the C-API level) to
>now deal with a masked array.    I know this push is a feature that is
>part of Mark's intention (as it pushes downstream libraries to think about
>missing data at a fundamental level). 

I think that this is a bad policy because:

 1. An array is not always data. I realize that there is a big push for
data-related computing lately, but I still believe that the notion
missing data makes no sens for the majority of numpy arrays 
instanciated.

 2. Not every algorithm can be made to work with missing data. I would
even say that most of the advanced algorithm do not work with missing
data.

Don't try to force upon people a problem that they do not have :).

Gael

PS: This message does not claim to take any position in the debate on
which solution for missing data is the best, because I don't think that I
have a good technical vision to back any position.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 11:05 PM, Benjamin Root  wrote:

>
>
> On Wednesday, May 9, 2012, Nathaniel Smith wrote:
>
>>
>>
>> My only objection to this proposal is that committing to this approach
>> seems premature. The existing masked array objects act quite
>> differently from numpy.ma, so why do you believe that they're a good
>> foundation for numpy.ma, and why will users want to switch to their
>> semantics over numpy.ma's semantics? These aren't rhetorical
>> questions, it seems like they must have concrete answers, but I don't
>> know what they are.
>>
>
> Based on the design decisions made in the original NEP, a re-made 
> numpy.mawould have to lose _some_ features particularly, the ability to share
> masks. Save for that and some very obscure behaviors that are undocumented,
> it is possible to remake numpy.ma as a compatibility layer.
>
> That being said, I think that there are some fundamental questions that
> has concerned. If I recall, there were unresolved questions about behaviors
> surrounding assignments to elements of a view.
>
> I see the project as broken down like this:
> 1.) internal architecture (largely abi issues)
> 2.) external architecture (hooks throughout numpy to utilize the new
> features where possible such as where= argument)
> 3.) getter/setter semantics
> 4.) mathematical semantics
>
> At this moment, I think we have pieces of 2 and they are fairly
> non-controversial. It is 1 that I see as being the immediate hold-up here.
> 3 & 4 are non-trivial, but because they are mostly about interfaces, I
> think we can be willing to accept some very basic, fundamental, barebones
> components here in order to lay the groundwork for a more complete API
> later.
>
> To talk of Travis's proposal, doing nothing is no-go. Not moving forward
> would dishearten the community. Making a ndmasked type is very intriguing.
> I see it as a set towards eventually deprecating ndarray? Also, how would
> it behave with no.asarray() and no.asanyarray()? My other concern is a
> possible violation of DRY. How difficult would it be to maintain two
> ndarrays in parallel?
>
> As for the flag approach, this still doesn't solve the problem of legacy
> code (or did I misunderstand?)
>

My understanding of the flag is to allow the code to stay in and get
reworked and experimented with while keeping it from contaminating
conventional use.

The whole point of putting the code in was to experiment and adjust. The
rather bizarre idea that it needs to be perfect from the get go is
disheartening, and is seldom how new things get developed. Sure, there is a
plan up front, but there needs to be feedback and change. And in fact, I
haven't seen much feedback about the actual code, I don't even know that
the people complaining have tried using it to see where it hurts. I'd like
that sort of feedback.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Benjamin Root
On Wednesday, May 9, 2012, Nathaniel Smith wrote:

>
>
> My only objection to this proposal is that committing to this approach
> seems premature. The existing masked array objects act quite
> differently from numpy.ma, so why do you believe that they're a good
> foundation for numpy.ma, and why will users want to switch to their
> semantics over numpy.ma's semantics? These aren't rhetorical
> questions, it seems like they must have concrete answers, but I don't
> know what they are.
>

Based on the design decisions made in the original NEP, a re-made
numpy.mawould have to lose _some_ features particularly, the ability
to share
masks. Save for that and some very obscure behaviors that are undocumented,
it is possible to remake numpy.ma as a compatibility layer.

That being said, I think that there are some fundamental questions that has
concerned. If I recall, there were unresolved questions about behaviors
surrounding assignments to elements of a view.

I see the project as broken down like this:
1.) internal architecture (largely abi issues)
2.) external architecture (hooks throughout numpy to utilize the new
features where possible such as where= argument)
3.) getter/setter semantics
4.) mathematical semantics

At this moment, I think we have pieces of 2 and they are fairly
non-controversial. It is 1 that I see as being the immediate hold-up here.
3 & 4 are non-trivial, but because they are mostly about interfaces, I
think we can be willing to accept some very basic, fundamental, barebones
components here in order to lay the groundwork for a more complete API
later.

To talk of Travis's proposal, doing nothing is no-go. Not moving forward
would dishearten the community. Making a ndmasked type is very intriguing.
I see it as a set towards eventually deprecating ndarray? Also, how would
it behave with no.asarray() and no.asanyarray()? My other concern is a
possible violation of DRY. How difficult would it be to maintain two
ndarrays in parallel?

As for the flag approach, this still doesn't solve the problem of legacy
code (or did I misunderstand?)

Cheers!
Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Dag Sverre Seljebotn
On 05/10/2012 01:01 AM, Matthew Brett wrote:
> Hi,
>
> On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
>   wrote:
>> On 05/09/2012 06:46 PM, Travis Oliphant wrote:
>>> Hey all,
>>>
>>> Nathaniel and Mark have worked very hard on a joint document to try and
>>> explain the current status of the missing-data debate. I think they've
>>> done an amazing job at providing some context, articulating their views
>>> and suggesting ways forward in a mutually respectful manner. This is an
>>> exemplary collaboration and is at the core of why open source is valuable.
>>>
>>> The document is available here:
>>> https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>>>
>>> After reading that document, it appears to me that there are some
>>> fundamentally different views on how things should move forward. I'm
>>> also reading the document incorporating my understanding of the history,
>>> of NumPy as well as all of the users I've met and interacted with which
>>> means I have my own perspective that is not necessarily incorporated
>>> into that document but informs my recommendations. I'm not sure we can
>>> reach full consensus on this. We are also well past time for moving
>>> forward with a resolution on this (perhaps we can all agree on that).
>>>
>>> I would like one more discussion thread where the technical discussion
>>> can take place. I will make a plea that we keep this discussion as free
>>> from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
>>> we can. I can't guarantee that I personally will succeed at that, but I
>>> can tell you that I will try. That's all I'm asking of anyone else. I
>>> recognize that there are a lot of other issues at play here besides
>>> *just* the technical questions, but we are not going to resolve every
>>> community issue in this technical thread.
>>>
>>> We need concrete proposals and so I will start with three. Please feel
>>> free to comment on these proposals or add your own during the
>>> discussion. I will stop paying attention to this thread next Wednesday
>>> (May 16th) (or earlier if the thread dies) and hope that by that time we
>>> can agree on a way forward. If we don't have agreement, then I will move
>>> forward with what I think is the right approach. I will either write the
>>> code myself or convince someone else to write it.
>>>
>>> In all cases, we have agreement that bit-pattern dtypes should be added
>>> to NumPy. We should work on these (int32, float64, complex64, str, bool)
>>> to start. So, the three proposals are independent of this way forward.
>>> The proposals are all about the extra mask part:
>>>
>>> My three proposals:
>>>
>>> * do nothing and leave things as is
>>>
>>> * add a global flag that turns off masked array support by default but
>>> otherwise leaves things unchanged (I'm still unclear how this would work
>>> exactly)
>>>
>>> * move Mark's "masked ndarray objects" into a new fundamental type
>>> (ndmasked), leaving the actual ndarray type unchanged. The
>>> array_interface keeps the masked array notions and the ufuncs keep the
>>> ability to handle arrays like ndmasked. Ideally, numpy.ma
>>>   would be changed to use ndmasked objects as their core.
>>>
>>> For the record, I'm currently in favor of the third proposal. Feel free
>>> to comment on these proposals (or provide your own).
>>>
>>
>> Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!
>
> Yes, it is very well written, my compliments to the chefs.
>
>> The third proposal is certainly the best one from Cython's perspective;
>> and I imagine for those writing C extensions against the C API too.
>> Having PyType_Check fail for ndmasked is a very good way of having code
>> fail that is not written to take masks into account.

I want to make something more clear: There are two Cython cases; in the 
case of "cdef np.ndarray[double]" there is no problem as PEP 3118 access 
will raise an exception for masked arrays.

But, there's the case where you do "cdef np.ndarray", and then proceed 
to use PyArray_DATA. Myself I do this more than PEP 3118 access; usually 
because I pass the data pointer to some C or C++ code.

It'd be great to have such code be forward-compatible in the sense that 
it raises an exception when it meets a masked array. Having PyType_Check 
fail seems like the only way? Am I wrong?


> Mark, Nathaniel - can you comment how your chosen approaches would
> interact with extension code?
>
> I'm guessing the bitpattern dtypes would be expected to cause
> extension code to choke if the type is not supported?

The proposal, as I understand it, is to use that with new dtypes (?). So 
things will often be fine for that reason:

if arr.dtype == np.float32:
 c_function_32bit(np.PyArray_DATA(arr), ...)
else:
 raise ValueError("need 32-bit float array")


>
> Mark - in :
>
> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#cython
>
> - do I understand correctly that you think

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 6:13 PM, Paul Ivanov  wrote:

>
>
> On Wed, May 9, 2012 at 3:12 PM, Travis Oliphant wrote:
>
>> On re-reading, I want to make a couple of things clear:
>>
>> 1) This "wrap-up" discussion is *only* for what to do for NumPy 1.7 in
>> such a way that we don't tie our hands in the future.I do not believe
>> we can figure out what to do for masked arrays in one short week.   What
>> happens beyond NumPy 1.7 should be still discussed and explored.My
>> urgency is entirely about moving forward from where we are in master right
>> now in a direction that we can all accept.  The tight timeline is so
>> that we do *something* and move forward.
>>
>> 2) I missed another possible proposal for NumPy 1.7 which is in the
>> write-up that Mark and Nathaniel made:  remove the masked array additions
>> entirely possibly moving them to another module like numpy-dtypes.
>>
>> Again, these are only for NumPy 1.7.   What happens in any future NumPy
>> and beyond will depend on who comes to the table for both discussion and
>> code-development.
>>
>
> I'm glad that this sentence made it into the write-up: "A project like
> numpy requires developers to write code for advancement to occur, and
> obstacles that impede the writing of code discourage existing developers
> from contributing more, and potentially scare away developers who are
> thinking about joining in." I agree, which is why I'm a little surprised
> after reading the write-up that there's no deference to the alterNEP
> (admittedly kludgy) implementation? One of the arguments made for the NEP
> "preliminary NA-mask implementation" is that "has been extensively tested
> against scipy and other third-party packages, and has been in master in a
> stable state for a significant amount of time." It is my understanding that
> the manner in which this implementation found its way into master was a
> source of concern and contention. To me (and I don't know the level to
> which this is a technically feasible) that's precisely the reason that BOTH
> approaches be allowed to make their way into numpy with experimental
> status. Otherwise, it seems that there is a sort of "scaring away" of
> developers - seeing (from the sidelines) how much of a struggle it's been
> for the alterNEP to find a nurturing environment as an experimental
> alternative inside numpy. In my reading, the process and consensus threads
> that have generated so many responses stem precisely from trying to have an
> atmosphere where everyone is encouraged to join in. The alternatives
> proposed so far (though I do understand it's only for 1.7) do not suggest
> an appreciation for the gravity of the fallout from the neglect the
> alterNEP and the issues which sprang forth from that.
>
> Importantly, I find a problem with how personal this document (and
> discussion) is - I'd much prefer if we talk about technical things by a
> descriptive name, not the person who thought of it. You'll note how I've
> been referring to NEP and alterNEP above. One advantage of this is that
> down the line, if either Mark or Nathaniel change their minds about their
> current preferred way forward, it doesn't take the wind out of it with
> something like "Even Paul changed his mind and now withdraws his support of
> Paul's proposal." We should only focus on the technical merits of a given
> approach, not how many commits have been made by the person proposing them
> or what else they've done in their life: a good idea has value regardless
> of who expresses it. In my fantasy world, with both approaches clearly
> existing in an experimental sandbox inside numpy, folks who feel primary
> attachments to either NEP or alterNEP would be willing to cross party lines
> and pitch in towardd making progress in both camps. That's the way we'll
> find better solutions, by working together, instead of working in
> opposition.
>
>
We are certainly open to code submissions and alternate implementations.
The experimental tag would help there. But someone, as you mention, needs
to write the code.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Paul Ivanov
On Wed, May 9, 2012 at 3:12 PM, Travis Oliphant  wrote:

> On re-reading, I want to make a couple of things clear:
>
> 1) This "wrap-up" discussion is *only* for what to do for NumPy 1.7 in
> such a way that we don't tie our hands in the future.I do not believe
> we can figure out what to do for masked arrays in one short week.   What
> happens beyond NumPy 1.7 should be still discussed and explored.My
> urgency is entirely about moving forward from where we are in master right
> now in a direction that we can all accept.  The tight timeline is so
> that we do *something* and move forward.
>
> 2) I missed another possible proposal for NumPy 1.7 which is in the
> write-up that Mark and Nathaniel made:  remove the masked array additions
> entirely possibly moving them to another module like numpy-dtypes.
>
> Again, these are only for NumPy 1.7.   What happens in any future NumPy
> and beyond will depend on who comes to the table for both discussion and
> code-development.
>

I'm glad that this sentence made it into the write-up: "A project like
numpy requires developers to write code for advancement to occur, and
obstacles that impede the writing of code discourage existing developers
from contributing more, and potentially scare away developers who are
thinking about joining in." I agree, which is why I'm a little surprised
after reading the write-up that there's no deference to the alterNEP
(admittedly kludgy) implementation? One of the arguments made for the NEP
"preliminary NA-mask implementation" is that "has been extensively tested
against scipy and other third-party packages, and has been in master in a
stable state for a significant amount of time." It is my understanding that
the manner in which this implementation found its way into master was a
source of concern and contention. To me (and I don't know the level to
which this is a technically feasible) that's precisely the reason that BOTH
approaches be allowed to make their way into numpy with experimental
status. Otherwise, it seems that there is a sort of "scaring away" of
developers - seeing (from the sidelines) how much of a struggle it's been
for the alterNEP to find a nurturing environment as an experimental
alternative inside numpy. In my reading, the process and consensus threads
that have generated so many responses stem precisely from trying to have an
atmosphere where everyone is encouraged to join in. The alternatives
proposed so far (though I do understand it's only for 1.7) do not suggest
an appreciation for the gravity of the fallout from the neglect the
alterNEP and the issues which sprang forth from that.

Importantly, I find a problem with how personal this document (and
discussion) is - I'd much prefer if we talk about technical things by a
descriptive name, not the person who thought of it. You'll note how I've
been referring to NEP and alterNEP above. One advantage of this is that
down the line, if either Mark or Nathaniel change their minds about their
current preferred way forward, it doesn't take the wind out of it with
something like "Even Paul changed his mind and now withdraws his support of
Paul's proposal." We should only focus on the technical merits of a given
approach, not how many commits have been made by the person proposing them
or what else they've done in their life: a good idea has value regardless
of who expresses it. In my fantasy world, with both approaches clearly
existing in an experimental sandbox inside numpy, folks who feel primary
attachments to either NEP or alterNEP would be willing to cross party lines
and pitch in towardd making progress in both camps. That's the way we'll
find better solutions, by working together, instead of working in
opposition.

best,
-- 
Paul Ivanov
314 address only used for lists,  off-list direct email at:
http://pirsquared.org | GPG/PGP key id: 0x0F3E28F7
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Nathaniel Smith
Hi Dag,

On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
 wrote:
> I'm a heavy user of masks, which are used to make data NA in the
> statistical sense. The setting is that we have to mask out the radiation
> coming from the Milky Way in full-sky images of the Cosmic Microwave
> Background. There's data, but we know we can't trust it, so we make it
> NA. But we also do play around with different masks.

Oh, this is great -- that means you're one of the users that I wasn't
sure existed or not :-). Now I know!

> Today we keep the mask in a seperate array, and to zero-mask we do
>
> masked_data = data * mask
>
> or
>
> masked_data = data.copy()
> masked_data[mask == 0] = np.nan # soon np.NA
>
> depending on the circumstances.
>
> Honestly, API-wise, this is as good as its gets for us. Nice and
> transparent, no new semantics to learn in the special case of masks.
>
> Now, this has performance issues: Lots of memory use, extra transfers
> over the memory bus.

Right -- this is a case where (in the NA-overview terminology) masked
storage+NA semantics would be useful.

> BUT, NumPy has that problem all over the place, even for "x + y + z"!
> Solving it in the special case of masks, by making a new API, seems a
> bit myopic to me.
>
> IMO, that's much better solved at the fundamental level. As an
> *illustration*:
>
> with np.lazy:
>     masked_data1 = data * mask1
>     masked_data2 = data * (mask1 | mask2)
>     masked_data3 = (x + y + z) * (mask1 & mask3)
>
> This would create three "generator arrays" that would zero-mask the
> arrays (and perform the three-term addition...) upon request. You could
> slice the generator arrays as you wish, and by that slice the data and
> the mask in one operation. Obviously this could handle NA-masking too.
>
> You can probably do this today with Theano and numexpr, and I think
> Travis mentioned that "generator arrays" are on his radar for core NumPy.

Implementing this today would require some black magic hacks, because
on entry/exit to the context manager you'd have to "reach up" into the
calling scope and replace all the ndarray's with LazyArrays and then
vice-versa. This is actually totally possible:
  https://gist.github.com/2347382
but I'm not sure I'd call it *wise*. (You could probably avoid the
truly horrible set_globals_dict part of that gist, though.) Might be
fun to prototype, though...

> Point is, as a user, I'm with Travis in having masks support go hide in
> ndmasked; they solve too much of a special case in a way that is too
> particular.

Right, that's the concern.

Hypothetical question: are you actually saying that if you had both
bitpattern NAs and Travis' "ndmasked" object, you would still go ahead
and use the bitpattern NAs for this case, because of the conceptual
simplicity, easy of Cython/C compatibility, etc.?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Matthew Brett
Hi,

On Wed, May 9, 2012 at 12:44 PM, Dag Sverre Seljebotn
 wrote:
> On 05/09/2012 06:46 PM, Travis Oliphant wrote:
>> Hey all,
>>
>> Nathaniel and Mark have worked very hard on a joint document to try and
>> explain the current status of the missing-data debate. I think they've
>> done an amazing job at providing some context, articulating their views
>> and suggesting ways forward in a mutually respectful manner. This is an
>> exemplary collaboration and is at the core of why open source is valuable.
>>
>> The document is available here:
>> https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>>
>> After reading that document, it appears to me that there are some
>> fundamentally different views on how things should move forward. I'm
>> also reading the document incorporating my understanding of the history,
>> of NumPy as well as all of the users I've met and interacted with which
>> means I have my own perspective that is not necessarily incorporated
>> into that document but informs my recommendations. I'm not sure we can
>> reach full consensus on this. We are also well past time for moving
>> forward with a resolution on this (perhaps we can all agree on that).
>>
>> I would like one more discussion thread where the technical discussion
>> can take place. I will make a plea that we keep this discussion as free
>> from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
>> we can. I can't guarantee that I personally will succeed at that, but I
>> can tell you that I will try. That's all I'm asking of anyone else. I
>> recognize that there are a lot of other issues at play here besides
>> *just* the technical questions, but we are not going to resolve every
>> community issue in this technical thread.
>>
>> We need concrete proposals and so I will start with three. Please feel
>> free to comment on these proposals or add your own during the
>> discussion. I will stop paying attention to this thread next Wednesday
>> (May 16th) (or earlier if the thread dies) and hope that by that time we
>> can agree on a way forward. If we don't have agreement, then I will move
>> forward with what I think is the right approach. I will either write the
>> code myself or convince someone else to write it.
>>
>> In all cases, we have agreement that bit-pattern dtypes should be added
>> to NumPy. We should work on these (int32, float64, complex64, str, bool)
>> to start. So, the three proposals are independent of this way forward.
>> The proposals are all about the extra mask part:
>>
>> My three proposals:
>>
>> * do nothing and leave things as is
>>
>> * add a global flag that turns off masked array support by default but
>> otherwise leaves things unchanged (I'm still unclear how this would work
>> exactly)
>>
>> * move Mark's "masked ndarray objects" into a new fundamental type
>> (ndmasked), leaving the actual ndarray type unchanged. The
>> array_interface keeps the masked array notions and the ufuncs keep the
>> ability to handle arrays like ndmasked. Ideally, numpy.ma
>>  would be changed to use ndmasked objects as their core.
>>
>> For the record, I'm currently in favor of the third proposal. Feel free
>> to comment on these proposals (or provide your own).
>>
>
> Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!

Yes, it is very well written, my compliments to the chefs.

> The third proposal is certainly the best one from Cython's perspective;
> and I imagine for those writing C extensions against the C API too.
> Having PyType_Check fail for ndmasked is a very good way of having code
> fail that is not written to take masks into account.

Mark, Nathaniel - can you comment how your chosen approaches would
interact with extension code?

I'm guessing the bitpattern dtypes would be expected to cause
extension code to choke if the type is not supported?

Mark - in :

https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#cython

- do I understand correctly that you think that Cython and other
extension writers should use the numpy API to access the data rather
than accessing it directly via the data pointer and strides?

Best,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Nathaniel Smith
On Wed, May 9, 2012 at 5:46 PM, Travis Oliphant  wrote:
> Hey all,
>
> Nathaniel and Mark have worked very hard on a joint document to try and
> explain the current status of the missing-data debate.   I think they've
> done an amazing job at providing some context, articulating their views and
> suggesting ways forward in a mutually respectful manner.   This is an
> exemplary collaboration and is at the core of why open source is valuable.
>
> The document is available here:
>    https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>
> After reading that document, it appears to me that there are some
> fundamentally different views on how things should move forward.   I'm also
> reading the document incorporating my understanding of the history, of NumPy
> as well as all of the users I've met and interacted with which means I have
> my own perspective that is not necessarily incorporated into that document
> but informs my recommendations.    I'm not sure we can reach full consensus
> on this.     We are also well past time for moving forward with a resolution
> on this (perhaps we can all agree on that).

If we're talking about deciding what to do for the 1.7 release branch,
then I agree. Otherwise, I definitely don't. We really just don't
*know* what our users need with regards to mask-based storage versions
of missing data, so committing to something within a short time period
will just guarantee we have to re-do it all again later.

[Edit: I see that you've clarified this in a follow-up email -- great!]

> We need concrete proposals and so I will start with three.   Please feel
> free to comment on these proposals or add your own during the discussion.
>  I will stop paying attention to this thread next Wednesday (May 16th) (or
> earlier if the thread dies) and hope that by that time we can agree on a way
> forward.  If we don't have agreement, then I will move forward with what I
> think is the right approach.   I will either write the code myself or
> convince someone else to write it.

Again, I'm assuming that what you mean here is that we can't and
shouldn't delay 1.7 indefinitely for this discussion to play out, so
you're proposing that we give ourselves a deadline of 1 week to decide
how to at least get the release unblocked. Let me know if I'm
misreading, though...

> In all cases, we have agreement that bit-pattern dtypes should be added to
> NumPy.      We should work on these (int32, float64, complex64, str, bool)
> to start.    So, the three proposals are independent of this way forward.
> The proposals are all about the extra mask part:
>
> My three proposals:
>
> * do nothing and leave things as is

In the context of 1.7, this seems like a non-starter at this point, at
least if we're going to move in the direction of making decisions by
consensus. It might well be that we'll decide that the current
NEP-like API is what we want (or that some compatible super-set is).
But (as described in more detail in the NA-overview document), I think
there are still serious questions to work out about how and whether a
masked-storage/NA-semantics API is something we want as part of the
ndarray object at all. And Ralf with his release-manager hat says that
he doesn't want to release the current API unless we can guarantee
that some version of it will continue to be supported. To me that
suggests that this is off the table for 1.7.

> * add a global flag that turns off masked array support by default but
> otherwise leaves things unchanged (I'm still unclear how this would work
> exactly)

I've been assuming something like a global variable, and some guards
added to all the top-level functions that take "maskna=" arguments, so
that it's impossible to construct an ndarray that has its "maskna"
flag set to True unless the flag has been toggled.

As I said in NA-overview, I'd be fine with this in principle, but only
if we're certain we're okay with the ABI consequences. And we should
be clear on the goal -- if we just want to let people play with the
API, then there are other options, such as my little experiment:
  https://github.com/njsmith/numpyNEP
(This is certainly less robust, but it works, and is probably a much
easier base for modifications to test alternative APIs.) If the goal
is just to keep the code in master, then that's fine too, though it
has both costs and benefits. (An example of a cost is that its
presence may complicate adding bitpattern NA support.)

> * move Mark's "masked ndarray objects" into a new fundamental type
> (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
> keeps the masked array notions and the ufuncs keep the ability to handle
> arrays like ndmasked.    Ideally, numpy.ma would be changed to use ndmasked
> objects as their core.

If we're talking about 1.7, then what kind of status do you propose
these new objects would have in 1.7? Regular feature, totally
experimental, something else?

My only objection to this proposal is that co

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 4:12 PM, Travis Oliphant  wrote:

> On re-reading, I want to make a couple of things clear:
>
> 1) This "wrap-up" discussion is *only* for what to do for NumPy 1.7 in
> such a way that we don't tie our hands in the future.I do not believe
> we can figure out what to do for masked arrays in one short week.   What
> happens beyond NumPy 1.7 should be still discussed and explored.My
> urgency is entirely about moving forward from where we are in master right
> now in a direction that we can all accept.  The tight timeline is so
> that we do *something* and move forward.
>
> 2) I missed another possible proposal for NumPy 1.7 which is in the
> write-up that Mark and Nathaniel made:  remove the masked array additions
> entirely possibly moving them to another module like numpy-dtypes.
>
> Again, these are only for NumPy 1.7.   What happens in any future NumPy
> and beyond will depend on who comes to the table for both discussion and
> code-development.
>
>
Why don't we go with 2) then? Mark implies that it takes the least work and
it kicks the decision down the road. It may well be that a better approach
turns up after more discussion, or that we decide to just pull it out, but
the first takes time to arrive at and the second takes effort that could be
better spent (IMHO) on other things at the moment.

My sense is that the API is actually the major point of contention,
although I may just be speaking for myself. And perhaps we should look for
ways of adding support for masked array implementations rather than masked
arrays themselves. It could be that easy to use infrastructure that
enhanced others efforts might be a better way forward.



Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant
On re-reading, I want to make a couple of things clear:   

1) This "wrap-up" discussion is *only* for what to do for NumPy 1.7 in 
such a way that we don't tie our hands in the future.I do not believe we 
can figure out what to do for masked arrays in one short week.   What happens 
beyond NumPy 1.7 should be still discussed and explored.My urgency is 
entirely about moving forward from where we are in master right now in a 
direction that we can all accept.  The tight timeline is so that we do 
*something* and move forward.

2) I missed another possible proposal for NumPy 1.7 which is in the 
write-up that Mark and Nathaniel made:  remove the masked array additions 
entirely possibly moving them to another module like numpy-dtypes.

Again, these are only for NumPy 1.7.   What happens in any future NumPy and 
beyond will depend on who comes to the table for both discussion and 
code-development. 

Best regards,

-Travis



On May 9, 2012, at 11:46 AM, Travis Oliphant wrote:

> Hey all, 
> 
> Nathaniel and Mark have worked very hard on a joint document to try and 
> explain the current status of the missing-data debate.   I think they've done 
> an amazing job at providing some context, articulating their views and 
> suggesting ways forward in a mutually respectful manner.   This is an 
> exemplary collaboration and is at the core of why open source is valuable. 
> 
> The document is available here: 
>https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
> 
> After reading that document, it appears to me that there are some 
> fundamentally different views on how things should move forward.   I'm also 
> reading the document incorporating my understanding of the history, of NumPy 
> as well as all of the users I've met and interacted with which means I have 
> my own perspective that is not necessarily incorporated into that document 
> but informs my recommendations.I'm not sure we can reach full consensus 
> on this. We are also well past time for moving forward with a resolution 
> on this (perhaps we can all agree on that). 
> 
> I would like one more discussion thread where the technical discussion can 
> take place.I will make a plea that we keep this discussion as free from 
> logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.   I 
> can't guarantee that I personally will succeed at that, but I can tell you 
> that I will try.   That's all I'm asking of anyone else.I recognize that 
> there are a lot of other issues at play here besides *just* the technical 
> questions, but we are not going to resolve every community issue in this 
> technical thread. 
> 
> We need concrete proposals and so I will start with three.   Please feel free 
> to comment on these proposals or add your own during the discussion.I 
> will stop paying attention to this thread next Wednesday (May 16th) (or 
> earlier if the thread dies) and hope that by that time we can agree on a way 
> forward.  If we don't have agreement, then I will move forward with what I 
> think is the right approach.   I will either write the code myself or 
> convince someone else to write it. 
> 
> In all cases, we have agreement that bit-pattern dtypes should be added to 
> NumPy.  We should work on these (int32, float64, complex64, str, bool) to 
> start.So, the three proposals are independent of this way forward.   The 
> proposals are all about the extra mask part:  
> 
> My three proposals: 
> 
>   * do nothing and leave things as is 
> 
>   * add a global flag that turns off masked array support by default but 
> otherwise leaves things unchanged (I'm still unclear how this would work 
> exactly)
> 
>   * move Mark's "masked ndarray objects" into a new fundamental type 
> (ndmasked), leaving the actual ndarray type unchanged.  The array_interface 
> keeps the masked array notions and the ufuncs keep the ability to handle 
> arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked 
> objects as their core. 
> 
> For the record, I'm currently in favor of the third proposal.   Feel free to 
> comment on these proposals (or provide your own). 
> 
> Best regards,
> 
> -Travis
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 1:35 PM, Travis Oliphant  wrote:

> My three proposals:
>>
>> * do nothing and leave things as is
>>
>> * add a global flag that turns off masked array support by default but
>> otherwise leaves things unchanged (I'm still unclear how this would work
>> exactly)
>>
>> * move Mark's "masked ndarray objects" into a new fundamental type
>> (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
>> keeps the masked array notions and the ufuncs keep the ability to handle
>> arrays like ndmasked.Ideally, numpy.ma would be changed to use
>> ndmasked objects as their core.
>>
>>
> The numpy.ma is unmaintained and I don't see that changing anytime soon.
> As you know, I would prefer 1), but 2) is a good compromise and the infra
> structure for such a flag could be useful for other things, although like
> yourself I'm not sure how it would be implemented. I don't understand your
> proposal for 3), but from the description I don't see that it buys anything.
>
>
> That is a bit strong to call numpy.ma unmaintained.I don't consider
> it that way.Are there a lot of tickets for it that are unaddressed?
> Is it broken?   I know it gets a lot of use in the wild and so I don't
> think NumPy users would be happy to here it is considered unmaintained by
> NumPy developers.
>
> I'm looking forward to more details of Mark's proposal for #2.
>
> The proposal for #3 is quite simple and I think it is also a good
> compromise between removing the masked array entirely from the core NumPy
> object and leaving things as is in master.  It keeps the functionality (but
> in a separate object) much like numpy.ma is a separate object.
>   Basically it buys not forcing *all* NumPy users (on the C-API level) to
> now deal with a masked array.
>

To me, it looks like we will get stuck with a more complicated
implementation without changing the API, something that 2) achieves more
easily while providing a feature likely to be useful as we head towards 2.0.


> I know this push is a feature that is part of Mark's intention (as it
> pushes downstream libraries to think about missing data at a fundamental
> level).But, I think this is too big of a change to put in a 1.X
> release.   The internal array-model used by NumPy is used quite extensively
> in downstream libraries as a *concept*.  Many people have enhanced this
> model with a separate mask array for various reasons, and Mark's current
> use of mask does not satisfy all those use-cases.   I don't see how we can
> justify changing the NumPy 1.X memory model under these circumstances.
>
>
You keep referring to these ghostly people and their unspecified uses, no
doubt to protect the guilty. You don't have to name names, but a little
detail on what they have done and how they use things would be *very*
helpful.


> This is the sort of change that in my mind is a NumPy 2.0 kind of change
> where downstream users will be looking for possible array-model changes.
>
>
We tried the flag day approach to 2.0 already and it failed. I think it
better to have a long term release and a series of releases thereafter
moving step by step with incremental changes towards a 2.0. Mark's 2) would
support that approach.



Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Dag Sverre Seljebotn
On 05/09/2012 06:46 PM, Travis Oliphant wrote:
> Hey all,
>
> Nathaniel and Mark have worked very hard on a joint document to try and
> explain the current status of the missing-data debate. I think they've
> done an amazing job at providing some context, articulating their views
> and suggesting ways forward in a mutually respectful manner. This is an
> exemplary collaboration and is at the core of why open source is valuable.
>
> The document is available here:
> https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>
> After reading that document, it appears to me that there are some
> fundamentally different views on how things should move forward. I'm
> also reading the document incorporating my understanding of the history,
> of NumPy as well as all of the users I've met and interacted with which
> means I have my own perspective that is not necessarily incorporated
> into that document but informs my recommendations. I'm not sure we can
> reach full consensus on this. We are also well past time for moving
> forward with a resolution on this (perhaps we can all agree on that).
>
> I would like one more discussion thread where the technical discussion
> can take place. I will make a plea that we keep this discussion as free
> from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
> we can. I can't guarantee that I personally will succeed at that, but I
> can tell you that I will try. That's all I'm asking of anyone else. I
> recognize that there are a lot of other issues at play here besides
> *just* the technical questions, but we are not going to resolve every
> community issue in this technical thread.
>
> We need concrete proposals and so I will start with three. Please feel
> free to comment on these proposals or add your own during the
> discussion. I will stop paying attention to this thread next Wednesday
> (May 16th) (or earlier if the thread dies) and hope that by that time we
> can agree on a way forward. If we don't have agreement, then I will move
> forward with what I think is the right approach. I will either write the
> code myself or convince someone else to write it.
>
> In all cases, we have agreement that bit-pattern dtypes should be added
> to NumPy. We should work on these (int32, float64, complex64, str, bool)
> to start. So, the three proposals are independent of this way forward.
> The proposals are all about the extra mask part:
>
> My three proposals:
>
> * do nothing and leave things as is
>
> * add a global flag that turns off masked array support by default but
> otherwise leaves things unchanged (I'm still unclear how this would work
> exactly)
>
> * move Mark's "masked ndarray objects" into a new fundamental type
> (ndmasked), leaving the actual ndarray type unchanged. The
> array_interface keeps the masked array notions and the ufuncs keep the
> ability to handle arrays like ndmasked. Ideally, numpy.ma
>  would be changed to use ndmasked objects as their core.
>
> For the record, I'm currently in favor of the third proposal. Feel free
> to comment on these proposals (or provide your own).
>

Bravo!, NA-overview.rst was an excellent read. Thanks Nathaniel and Mark!

The third proposal is certainly the best one from Cython's perspective; 
and I imagine for those writing C extensions against the C API too. 
Having PyType_Check fail for ndmasked is a very good way of having code 
fail that is not written to take masks into account.

If it is in ndarray we would also have some pressure to add support in 
Cython, with ndmasked we avoid that too. Likely outcome is we won't ever 
support it either way, but then we need some big warning in the docs, 
and it's better to avoid that. (I guess be +0 on Mark Florisson 
implementing it if it ends up in core ndarray; I'd almost certainly not 
do it myself.)

That covers Cython. My view as a NumPy user follows.

I'm a heavy user of masks, which are used to make data NA in the 
statistical sense. The setting is that we have to mask out the radiation 
coming from the Milky Way in full-sky images of the Cosmic Microwave 
Background. There's data, but we know we can't trust it, so we make it 
NA. But we also do play around with different masks.

Today we keep the mask in a seperate array, and to zero-mask we do

masked_data = data * mask

or

masked_data = data.copy()
masked_data[mask == 0] = np.nan # soon np.NA

depending on the circumstances.

Honestly, API-wise, this is as good as its gets for us. Nice and 
transparent, no new semantics to learn in the special case of masks.

Now, this has performance issues: Lots of memory use, extra transfers 
over the memory bus.

BUT, NumPy has that problem all over the place, even for "x + y + z"! 
Solving it in the special case of masks, by making a new API, seems a 
bit myopic to me.

IMO, that's much better solved at the fundamental level. As an 
*illustration*:

with np.lazy:
 masked_data1 = data * mask1
 masked_data2 = data * (mask1 | mas

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant
> Mark will you give more details about this proposal?How would the flag 
> work, what would it modify?
> 
> The idea is inspired in part by the Chrome release cycle, which has a 
> presentation here:
> 
> https://docs.google.com/present/view?id=dg63dpc6_4d7vkk6ch&pli=1
> 
> Some quotes:
> Features should be engineered so that they can be disabled easily (1 patch)
> and
> Would large feature development still be possible?
> 
> "Yes, engineers would have to work behind flags, however they can work for as 
> many releases as they need to and can remove the flag when they are done."
> 
> The current numpy codebase isn't designed for this kind of workflow, but I 
> think we can productively emulate the idea for a big feature like NA support.
> 
> One way to do this flag would be to have a "numpy.experimental" namespace 
> which is not imported by default. To enable the NA-mask feature, you could do:
> 
> >>> import numpy.experimental.maskna
> 
> This would trigger an ExperimentalWarning to message that an experimental 
> feature has been enabled, and would add any NA-specific symbols to the numpy 
> namespace (NA, NAType, etc). Without this import, any operation which would 
> create an NA or NA-masked array raises an ExperimentalError instead of 
> succeeding. After this import, things would behave as they do now.

How would this flag work at the C-API level? 

-Travis


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant
> My three proposals: 
> 
>   * do nothing and leave things as is 
> 
>   * add a global flag that turns off masked array support by default but 
> otherwise leaves things unchanged (I'm still unclear how this would work 
> exactly)
> 
>   * move Mark's "masked ndarray objects" into a new fundamental type 
> (ndmasked), leaving the actual ndarray type unchanged.  The array_interface 
> keeps the masked array notions and the ufuncs keep the ability to handle 
> arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked 
> objects as their core. 
> 
> 
> The numpy.ma is unmaintained and I don't see that changing anytime soon. As 
> you know, I would prefer 1), but 2) is a good compromise and the infra 
> structure for such a flag could be useful for other things, although like 
> yourself I'm not sure how it would be implemented. I don't understand your 
> proposal for 3), but from the description I don't see that it buys anything.

That is a bit strong to call numpy.ma unmaintained.I don't consider it that 
way.Are there a lot of tickets for it that are unaddressed?   Is it broken? 
  I know it gets a lot of use in the wild and so I don't think NumPy users 
would be happy to here it is considered unmaintained by NumPy developers. 

I'm looking forward to more details of Mark's proposal for #2. 

The proposal for #3 is quite simple and I think it is also a good compromise 
between removing the masked array entirely from the core NumPy object and 
leaving things as is in master.  It keeps the functionality (but in a separate 
object) much like numpy.ma is a separate object.   Basically it buys not 
forcing *all* NumPy users (on the C-API level) to now deal with a masked array. 
   I know this push is a feature that is part of Mark's intention (as it pushes 
downstream libraries to think about missing data at a fundamental level).
But, I think this is too big of a change to put in a 1.X release.   The 
internal array-model used by NumPy is used quite extensively in downstream 
libraries as a *concept*.  Many people have enhanced this model with a separate 
mask array for various reasons, and Mark's current use of mask does not satisfy 
all those use-cases.   I don't see how we can justify changing the NumPy 1.X 
memory model under these circumstances. 

This is the sort of change that in my mind is a NumPy 2.0 kind of change where 
downstream users will be looking for possible array-model changes.  

-Travis





>  
> For the record, I'm currently in favor of the third proposal.   Feel free to 
> comment on these proposals (or provide your own). 
> 
> 
> Chuck 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Mark Wiebe
On Wed, May 9, 2012 at 2:15 PM, Travis Oliphant  wrote:

>
> On May 9, 2012, at 2:07 PM, Mark Wiebe wrote:
>
> On Wed, May 9, 2012 at 11:46 AM, Travis Oliphant wrote:
>
>> Hey all,
>>
>> Nathaniel and Mark have worked very hard on a joint document to try and
>> explain the current status of the missing-data debate.   I think they've
>> done an amazing job at providing some context, articulating their views and
>> suggesting ways forward in a mutually respectful manner.   This is an
>> exemplary collaboration and is at the core of why open source is valuable.
>>
>> The document is available here:
>>https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>>
>> After reading that document, it appears to me that there are some
>> fundamentally different views on how things should move forward.   I'm also
>> reading the document incorporating my understanding of the history, of
>> NumPy as well as all of the users I've met and interacted with which means
>> I have my own perspective that is not necessarily incorporated into that
>> document but informs my recommendations.I'm not sure we can reach full
>> consensus on this. We are also well past time for moving forward with a
>> resolution on this (perhaps we can all agree on that).
>>
>> I would like one more discussion thread where the technical discussion
>> can take place.I will make a plea that we keep this discussion as free
>> from logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as
>> we can.   I can't guarantee that I personally will succeed at that, but I
>> can tell you that I will try.   That's all I'm asking of anyone else.I
>> recognize that there are a lot of other issues at play here besides *just*
>> the technical questions, but we are not going to resolve every community
>> issue in this technical thread.
>>
>> We need concrete proposals and so I will start with three.   Please feel
>> free to comment on these proposals or add your own during the discussion.
>>  I will stop paying attention to this thread next Wednesday (May 16th) (or
>> earlier if the thread dies) and hope that by that time we can agree on a
>> way forward.  If we don't have agreement, then I will move forward with
>> what I think is the right approach.   I will either write the code myself
>> or convince someone else to write it.
>>
>> In all cases, we have agreement that bit-pattern dtypes should be added
>> to NumPy.  We should work on these (int32, float64, complex64, str,
>> bool) to start.So, the three proposals are independent of this way
>> forward.   The proposals are all about the extra mask part:
>>
>> My three proposals:
>>
>> * do nothing and leave things as is
>>
>> * add a global flag that turns off masked array support by default but
>> otherwise leaves things unchanged (I'm still unclear how this would work
>> exactly)
>>
>> * move Mark's "masked ndarray objects" into a new fundamental type
>> (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
>> keeps the masked array notions and the ufuncs keep the ability to handle
>> arrays like ndmasked.Ideally, numpy.ma would be changed to use
>> ndmasked objects as their core.
>>
>> For the record, I'm currently in favor of the third proposal.   Feel free
>> to comment on these proposals (or provide your own).
>>
>
> I'm most in favour of the second proposal. It won't take very much effort,
> and more clearly marks off this code as experimental than just
> documentation notes.
>
>
> Mark will you give more details about this proposal?How would the flag
> work, what would it modify?
>

The idea is inspired in part by the Chrome release cycle, which has a
presentation here:

https://docs.google.com/present/view?id=dg63dpc6_4d7vkk6ch&pli=1

Some quotes:

Features should be engineered so that they can be disabled easily (1 patch)

and

Would large feature development still be possible?

"Yes, engineers would have to work behind flags, however they can work for
as many releases as they need to and can remove the flag when they are
done."


The current numpy codebase isn't designed for this kind of workflow, but I
think we can productively emulate the idea for a big feature like NA
support.

One way to do this flag would be to have a "numpy.experimental" namespace
which is not imported by default. To enable the NA-mask feature, you could
do:

>>> import numpy.experimental.maskna

This would trigger an ExperimentalWarning to message that an experimental
feature has been enabled, and would add any NA-specific symbols to the
numpy namespace (NA, NAType, etc). Without this import, any operation which
would create an NA or NA-masked array raises an ExperimentalError instead
of succeeding. After this import, things would behave as they do now.

Cheers,
Mark

The proposal to create a ndmasked object that is separate from ndarray
> objects also won't take much effort and also marks off the object so those
> who want to use it can and those who don

Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant

On May 9, 2012, at 2:07 PM, Mark Wiebe wrote:

> On Wed, May 9, 2012 at 11:46 AM, Travis Oliphant  wrote:
> Hey all, 
> 
> Nathaniel and Mark have worked very hard on a joint document to try and 
> explain the current status of the missing-data debate.   I think they've done 
> an amazing job at providing some context, articulating their views and 
> suggesting ways forward in a mutually respectful manner.   This is an 
> exemplary collaboration and is at the core of why open source is valuable. 
> 
> The document is available here: 
>https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
> 
> After reading that document, it appears to me that there are some 
> fundamentally different views on how things should move forward.   I'm also 
> reading the document incorporating my understanding of the history, of NumPy 
> as well as all of the users I've met and interacted with which means I have 
> my own perspective that is not necessarily incorporated into that document 
> but informs my recommendations.I'm not sure we can reach full consensus 
> on this. We are also well past time for moving forward with a resolution 
> on this (perhaps we can all agree on that). 
> 
> I would like one more discussion thread where the technical discussion can 
> take place.I will make a plea that we keep this discussion as free from 
> logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.   I 
> can't guarantee that I personally will succeed at that, but I can tell you 
> that I will try.   That's all I'm asking of anyone else.I recognize that 
> there are a lot of other issues at play here besides *just* the technical 
> questions, but we are not going to resolve every community issue in this 
> technical thread. 
> 
> We need concrete proposals and so I will start with three.   Please feel free 
> to comment on these proposals or add your own during the discussion.I 
> will stop paying attention to this thread next Wednesday (May 16th) (or 
> earlier if the thread dies) and hope that by that time we can agree on a way 
> forward.  If we don't have agreement, then I will move forward with what I 
> think is the right approach.   I will either write the code myself or 
> convince someone else to write it. 
> 
> In all cases, we have agreement that bit-pattern dtypes should be added to 
> NumPy.  We should work on these (int32, float64, complex64, str, bool) to 
> start.So, the three proposals are independent of this way forward.   The 
> proposals are all about the extra mask part:  
> 
> My three proposals: 
> 
>   * do nothing and leave things as is 
> 
>   * add a global flag that turns off masked array support by default but 
> otherwise leaves things unchanged (I'm still unclear how this would work 
> exactly)
> 
>   * move Mark's "masked ndarray objects" into a new fundamental type 
> (ndmasked), leaving the actual ndarray type unchanged.  The array_interface 
> keeps the masked array notions and the ufuncs keep the ability to handle 
> arrays like ndmasked.Ideally, numpy.ma would be changed to use ndmasked 
> objects as their core. 
> 
> For the record, I'm currently in favor of the third proposal.   Feel free to 
> comment on these proposals (or provide your own).
> 
> I'm most in favour of the second proposal. It won't take very much effort, 
> and more clearly marks off this code as experimental than just documentation 
> notes.
> 

Mark will you give more details about this proposal?How would the flag 
work, what would it modify? 

The proposal to create a ndmasked object that is separate from ndarray objects 
also won't take much effort and also marks off the object so those who want to 
use it can and those who don't are not pushed into using it anyway. 

-Travis


> Thanks,
> -Mark
>  
> 
> Best regards,
> 
> -Travis
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Mark Wiebe
On Wed, May 9, 2012 at 11:46 AM, Travis Oliphant wrote:

> Hey all,
>
> Nathaniel and Mark have worked very hard on a joint document to try and
> explain the current status of the missing-data debate.   I think they've
> done an amazing job at providing some context, articulating their views and
> suggesting ways forward in a mutually respectful manner.   This is an
> exemplary collaboration and is at the core of why open source is valuable.
>
> The document is available here:
>https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>
> After reading that document, it appears to me that there are some
> fundamentally different views on how things should move forward.   I'm also
> reading the document incorporating my understanding of the history, of
> NumPy as well as all of the users I've met and interacted with which means
> I have my own perspective that is not necessarily incorporated into that
> document but informs my recommendations.I'm not sure we can reach full
> consensus on this. We are also well past time for moving forward with a
> resolution on this (perhaps we can all agree on that).
>
> I would like one more discussion thread where the technical discussion can
> take place.I will make a plea that we keep this discussion as free from
> logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.
>   I can't guarantee that I personally will succeed at that, but I can tell
> you that I will try.   That's all I'm asking of anyone else.I recognize
> that there are a lot of other issues at play here besides *just* the
> technical questions, but we are not going to resolve every community issue
> in this technical thread.
>
> We need concrete proposals and so I will start with three.   Please feel
> free to comment on these proposals or add your own during the discussion.
>  I will stop paying attention to this thread next Wednesday (May 16th) (or
> earlier if the thread dies) and hope that by that time we can agree on a
> way forward.  If we don't have agreement, then I will move forward with
> what I think is the right approach.   I will either write the code myself
> or convince someone else to write it.
>
> In all cases, we have agreement that bit-pattern dtypes should be added to
> NumPy.  We should work on these (int32, float64, complex64, str, bool)
> to start.So, the three proposals are independent of this way forward.
> The proposals are all about the extra mask part:
>
> My three proposals:
>
> * do nothing and leave things as is
>
> * add a global flag that turns off masked array support by default but
> otherwise leaves things unchanged (I'm still unclear how this would work
> exactly)
>
> * move Mark's "masked ndarray objects" into a new fundamental type
> (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
> keeps the masked array notions and the ufuncs keep the ability to handle
> arrays like ndmasked.Ideally, numpy.ma would be changed to use
> ndmasked objects as their core.
>
> For the record, I'm currently in favor of the third proposal.   Feel free
> to comment on these proposals (or provide your own).
>

I'm most in favour of the second proposal. It won't take very much effort,
and more clearly marks off this code as experimental than just
documentation notes.

Thanks,
-Mark


>
> Best regards,
>
> -Travis
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Charles R Harris
On Wed, May 9, 2012 at 10:46 AM, Travis Oliphant wrote:

> Hey all,
>
> Nathaniel and Mark have worked very hard on a joint document to try and
> explain the current status of the missing-data debate.   I think they've
> done an amazing job at providing some context, articulating their views and
> suggesting ways forward in a mutually respectful manner.   This is an
> exemplary collaboration and is at the core of why open source is valuable.
>
> The document is available here:
>https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst
>
> After reading that document, it appears to me that there are some
> fundamentally different views on how things should move forward.   I'm also
> reading the document incorporating my understanding of the history, of
> NumPy as well as all of the users I've met and interacted with which means
> I have my own perspective that is not necessarily incorporated into that
> document but informs my recommendations.I'm not sure we can reach full
> consensus on this. We are also well past time for moving forward with a
> resolution on this (perhaps we can all agree on that).
>
> I would like one more discussion thread where the technical discussion can
> take place.I will make a plea that we keep this discussion as free from
> logical fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.
>   I can't guarantee that I personally will succeed at that, but I can tell
> you that I will try.   That's all I'm asking of anyone else.I recognize
> that there are a lot of other issues at play here besides *just* the
> technical questions, but we are not going to resolve every community issue
> in this technical thread.
>
> We need concrete proposals and so I will start with three.   Please feel
> free to comment on these proposals or add your own during the discussion.
>  I will stop paying attention to this thread next Wednesday (May 16th) (or
> earlier if the thread dies) and hope that by that time we can agree on a
> way forward.  If we don't have agreement, then I will move forward with
> what I think is the right approach.   I will either write the code myself
> or convince someone else to write it.
>
> In all cases, we have agreement that bit-pattern dtypes should be added to
> NumPy.  We should work on these (int32, float64, complex64, str, bool)
> to start.So, the three proposals are independent of this way forward.
> The proposals are all about the extra mask part:
>
> My three proposals:
>
> * do nothing and leave things as is
>
> * add a global flag that turns off masked array support by default but
> otherwise leaves things unchanged (I'm still unclear how this would work
> exactly)
>
> * move Mark's "masked ndarray objects" into a new fundamental type
> (ndmasked), leaving the actual ndarray type unchanged.  The array_interface
> keeps the masked array notions and the ufuncs keep the ability to handle
> arrays like ndmasked.Ideally, numpy.ma would be changed to use
> ndmasked objects as their core.
>
>
The numpy.ma is unmaintained and I don't see that changing anytime soon. As
you know, I would prefer 1), but 2) is a good compromise and the infra
structure for such a flag could be useful for other things, although like
yourself I'm not sure how it would be implemented. I don't understand your
proposal for 3), but from the description I don't see that it buys anything.


> For the record, I'm currently in favor of the third proposal.   Feel free
> to comment on these proposals (or provide your own).
>
>
Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Missing data wrap-up and request for comments

2012-05-09 Thread Travis Oliphant
Hey all, 

Nathaniel and Mark have worked very hard on a joint document to try and explain 
the current status of the missing-data debate.   I think they've done an 
amazing job at providing some context, articulating their views and suggesting 
ways forward in a mutually respectful manner.   This is an exemplary 
collaboration and is at the core of why open source is valuable. 

The document is available here: 
   https://github.com/numpy/numpy.scipy.org/blob/master/NA-overview.rst

After reading that document, it appears to me that there are some fundamentally 
different views on how things should move forward.   I'm also reading the 
document incorporating my understanding of the history, of NumPy as well as all 
of the users I've met and interacted with which means I have my own perspective 
that is not necessarily incorporated into that document but informs my 
recommendations.I'm not sure we can reach full consensus on this. We 
are also well past time for moving forward with a resolution on this (perhaps 
we can all agree on that).

I would like one more discussion thread where the technical discussion can take 
place.I will make a plea that we keep this discussion as free from logical 
fallacies http://en.wikipedia.org/wiki/Logical_fallacy as we can.   I can't 
guarantee that I personally will succeed at that, but I can tell you that I 
will try.   That's all I'm asking of anyone else.I recognize that there are 
a lot of other issues at play here besides *just* the technical questions, but 
we are not going to resolve every community issue in this technical thread. 

We need concrete proposals and so I will start with three.   Please feel free 
to comment on these proposals or add your own during the discussion.I will 
stop paying attention to this thread next Wednesday (May 16th) (or earlier if 
the thread dies) and hope that by that time we can agree on a way forward.  If 
we don't have agreement, then I will move forward with what I think is the 
right approach.   I will either write the code myself or convince someone else 
to write it. 

In all cases, we have agreement that bit-pattern dtypes should be added to 
NumPy.  We should work on these (int32, float64, complex64, str, bool) to 
start.So, the three proposals are independent of this way forward.   The 
proposals are all about the extra mask part:  

My three proposals: 

* do nothing and leave things as is 

* add a global flag that turns off masked array support by default but 
otherwise leaves things unchanged (I'm still unclear how this would work 
exactly)

* move Mark's "masked ndarray objects" into a new fundamental type 
(ndmasked), leaving the actual ndarray type unchanged.  The array_interface 
keeps the masked array notions and the ufuncs keep the ability to handle arrays 
like ndmasked.Ideally, numpy.ma would be changed to use ndmasked objects as 
their core. 

For the record, I'm currently in favor of the third proposal.   Feel free to 
comment on these proposals (or provide your own). 

Best regards,

-Travis

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-15 Thread Nathaniel Smith
Hi Chuck,

I think I let my frustration get the better of me, and the message
below is too confrontational. I apologize.

I truly would like to understand where you're coming from on this,
though, so I'll try to make this more productive. My summary of points
that no-one has disagreed with yet is here:
  https://github.com/njsmith/numpy/wiki/NA-discussion-status
Of course, this means that there's lots that's left out. Instead of
getting into all those contentious details, I'll stick to just a few
basic questions that might let us get at least of bit of common
ground:
1) Do you disagree with anything that is stated there?
2) Do you feel like that document accurately summarises your basic
idea of what this feature is supposed to do (I assume under the
IGNORED heading)?

Thanks,
-- Nathaniel

On Wed, Mar 7, 2012 at 11:10 PM, Nathaniel Smith  wrote:
> On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris
>  wrote:
>>
>>
>> On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith  wrote:
>>> When it comes to "missing data", bitpatterns can do everything that
>>> masks can do, are no more complicated to implement, and have better
>>> performance characteristics.
>>>
>>
>> Maybe for float, for other things, no. And we have lots of otherthings.
>
> It would be easier to discuss this if you'd, like, discuss :-(. If you
> know of some advantage that masks have over bitpatterns when it comes
> to missing data, can you please share it, instead of just asserting
> it?
>
> Not that I'm immune... I perhaps should have been more explicit
> myself, when I said "performance characteristics", let me clarify that
> I was thinking of both speed (for floats) and memory (for
> most-but-not-all things).
>
>> The
>> performance is a strawman,
>
> How many users need to speak up to say that this is a serious problem
> they have with the current implementation before you stop calling it a
> strawman? Because when Wes says that it's not going to fly for his
> stats/econometics cases, and the neuroimaging folk like Gary and Matt
> say it's not going to fly for their use cases... surely just waving
> that away is a bit dismissive?
>
> I'm not saying that we *have* to implement bitpatterns because
> performance is *the most important feature* -- I'm just saying, well,
> what I said. For *missing data use* cases, bitpatterns have better
> performance characteristics than masks. If we decide that these use
> cases are important, then we should take this into account and weigh
> it against other considerations. Maybe what you think is that these
> use cases shouldn't be the focus of this feature and it should focus
> on the "ignored" use cases instead? That would be a legitimate
> argument... but if that's what you want to say, say it, don't just
> dismiss your users!
>
>> and it *isn't* easier to implement.
>
> If I thought bitpatterns would be easier to implement, I would have
> said so... What I said was that they're not harder. You have some
> extra complexity, mostly in casting, and some reduced complexity -- no
> need to allocate and manipulate the mask. (E.g., simple same-type
> assignments and slicing require special casing for masks, but not for
> bitpatterns.) In many places the complexity is identical -- printing
> routines need to check for either special bitpatterns or masked
> values, whatever. Ufunc loops need to either find the appropriate part
> of the mask, or create a temporary mask buffer by calling a dtype
> func, whatever. On net they seem about equivalent, complexity-wise.
>
> ...I assume you disagree with this analysis, since I've said it
> before, wrote up a sketch for how the implementation would work at the
> C level, etc., and you continue to claim that simplicity is a
> compelling advantage for the masked approach. But I still don't know
> why you think that :-(.
>
>>> > Also, different folks adopt different values
>>> > for 'missing' data, and distributing one or several masks along with the
>>> > data is another common practice.
>>>
>>> True, but not really relevant to the current debate, because you have
>>> to handle such issues as part of your general data import workflow
>>> anyway, and none of these is any more complicated no matter which
>>> implementations are available.
>>>
>>> > One inconvenience I have run into with the current API is that is should
>>> > be
>>> > easier to clear the mask from an "ignored" value without taking a new
>>> > view
>>> > or assigning known data. So maybe two types of masks (different
>>> > payloads),
>>> > or an additional flag could be helpful. The process of assigning masks
>>> > could
>>> > also be made a bit easier than using fancy indexing.
>>>
>>> So this, uh... this was actually the whole goal of the "alterNEP"
>>> design for masks -- making all this stuff easy for people (like you,
>>> apparently?) that want support for ignored values, separately from
>>> missing data, and want a nice clean API for it. Basically having a
>>> separate .mask attribute which was an ordinary, ass

Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 7:39 PM, Benjamin Root  wrote:
> On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith  wrote:
>> When it comes to "missing data", bitpatterns can do everything that
>> masks can do, are no more complicated to implement, and have better
>> performance characteristics.
>>
>
> Not true.  bitpatterns inherently destroys the data, while masks do not.

Yes, that's why I only wrote that this is true for "missing data", not
in general :-). If you have data that is being destroyed, then that's
not missing data, by definition. We don't have consensus yet on
whether that's the use case we are aiming for, but it's the one that
Pierre was worrying about.

> For matplotlib, we can not use bitpatterns because it could over-write user
> data (or we have to copy the data).  I would imagine other extension writers
> would have similar issues when they need to play around with input data in a
> safe manner.

Right. You clearly need some sort of masking, either an explicit mask
array that you keep somewhere, or one that gets attached to the
underlying ndarray in some non-destructive way.

> Also, I doubt that the performance characteristics for strings and integers
> are the same as it is for masks.

Not sure what you mean by this, but I'd be happy to hear more.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris
 wrote:
>
>
> On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith  wrote:
>> When it comes to "missing data", bitpatterns can do everything that
>> masks can do, are no more complicated to implement, and have better
>> performance characteristics.
>>
>
> Maybe for float, for other things, no. And we have lots of otherthings.

It would be easier to discuss this if you'd, like, discuss :-(. If you
know of some advantage that masks have over bitpatterns when it comes
to missing data, can you please share it, instead of just asserting
it?

Not that I'm immune... I perhaps should have been more explicit
myself, when I said "performance characteristics", let me clarify that
I was thinking of both speed (for floats) and memory (for
most-but-not-all things).

> The
> performance is a strawman,

How many users need to speak up to say that this is a serious problem
they have with the current implementation before you stop calling it a
strawman? Because when Wes says that it's not going to fly for his
stats/econometics cases, and the neuroimaging folk like Gary and Matt
say it's not going to fly for their use cases... surely just waving
that away is a bit dismissive?

I'm not saying that we *have* to implement bitpatterns because
performance is *the most important feature* -- I'm just saying, well,
what I said. For *missing data use* cases, bitpatterns have better
performance characteristics than masks. If we decide that these use
cases are important, then we should take this into account and weigh
it against other considerations. Maybe what you think is that these
use cases shouldn't be the focus of this feature and it should focus
on the "ignored" use cases instead? That would be a legitimate
argument... but if that's what you want to say, say it, don't just
dismiss your users!

> and it *isn't* easier to implement.

If I thought bitpatterns would be easier to implement, I would have
said so... What I said was that they're not harder. You have some
extra complexity, mostly in casting, and some reduced complexity -- no
need to allocate and manipulate the mask. (E.g., simple same-type
assignments and slicing require special casing for masks, but not for
bitpatterns.) In many places the complexity is identical -- printing
routines need to check for either special bitpatterns or masked
values, whatever. Ufunc loops need to either find the appropriate part
of the mask, or create a temporary mask buffer by calling a dtype
func, whatever. On net they seem about equivalent, complexity-wise.

...I assume you disagree with this analysis, since I've said it
before, wrote up a sketch for how the implementation would work at the
C level, etc., and you continue to claim that simplicity is a
compelling advantage for the masked approach. But I still don't know
why you think that :-(.

>> > Also, different folks adopt different values
>> > for 'missing' data, and distributing one or several masks along with the
>> > data is another common practice.
>>
>> True, but not really relevant to the current debate, because you have
>> to handle such issues as part of your general data import workflow
>> anyway, and none of these is any more complicated no matter which
>> implementations are available.
>>
>> > One inconvenience I have run into with the current API is that is should
>> > be
>> > easier to clear the mask from an "ignored" value without taking a new
>> > view
>> > or assigning known data. So maybe two types of masks (different
>> > payloads),
>> > or an additional flag could be helpful. The process of assigning masks
>> > could
>> > also be made a bit easier than using fancy indexing.
>>
>> So this, uh... this was actually the whole goal of the "alterNEP"
>> design for masks -- making all this stuff easy for people (like you,
>> apparently?) that want support for ignored values, separately from
>> missing data, and want a nice clean API for it. Basically having a
>> separate .mask attribute which was an ordinary, assignable array
>> broadcastable to the attached array's shape. Nobody seemed interested
>> in talking about it much then but maybe there's interest now?
>>
>
> Come off it, Nathaniel, the problem is minor and fixable. The intent of the
> initial implementation was to discover such things.

Implementation can be wonderful, I absolutely agree. But you
understand that I'd be more impressed by this example if your
discovery weren't something I had been arguing for since before the
implementation began :-).

> These things are less
> accessible with the current API *precisely* because of the feedback from R
> users. It didn't start that way.
>
> We now have something to evolve into what we want. That is a heck of a lot
> more useful than endless discussion.

No, you are still missing the point completely! There is no "what *we*
want", because what you want is different than what I want. The
masking stuff in the alterNEP was an attempt to give people like you
who wanted "ignore

Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Eric Firing
On 03/07/2012 11:15 AM, Pierre Haessig wrote:
> Hi,
> Le 07/03/2012 20:57, Eric Firing a écrit :
>> In other words, good low-level support for numpy.ma functionality?
> Coming back to *existing* ma support, I was just wondering whether it
> was now possible to "np.save" a masked array.
> (I'm using numpy 1.5)

No, not with the mask preserved.  This is one of the improvements I am 
hoping for with the upcoming missing data work.

Eric

> In the end, this is the most annoying problem I have with the existing
> ma module which otherwise is pretty useful to me. I'm happy not to need
> to process 100% of my data though.
>
> Best,
> Pierre
>
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Pierre Haessig
Hi,
Le 07/03/2012 20:57, Eric Firing a écrit :
> In other words, good low-level support for numpy.ma functionality?
Coming back to *existing* ma support, I was just wondering whether it
was now possible to "np.save" a masked array.
(I'm using numpy 1.5)
In the end, this is the most annoying problem I have with the existing
ma module which otherwise is pretty useful to me. I'm happy not to need
to process 100% of my data though.

Best,
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Eric Firing
On 03/07/2012 09:26 AM, Nathaniel Smith wrote:
> On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
>   wrote:
>> On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig
>>> Coming back to Travis proposition "bit-pattern approaches to missing
>>> data (*at least* for float64 and int32) need to be implemented.", I
>>> wonder what is the amount of extra work to go from nafloat64 to
>>> nafloat32/16 ? Is there an hardware support NaN payloads with these
>>> smaller floats ? If not, or if it is too complicated, I feel it is
>>> acceptable to say "it's too complicated" and fall back to mask. One may
>>> have to choose between fancy types and fancy NAs...
>>
>> I'm in agreement here, and that was a major consideration in making a
>> 'masked' implementation first.
>
> When it comes to "missing data", bitpatterns can do everything that
> masks can do, are no more complicated to implement, and have better
> performance characteristics.
>
>> Also, different folks adopt different values
>> for 'missing' data, and distributing one or several masks along with the
>> data is another common practice.
>
> True, but not really relevant to the current debate, because you have
> to handle such issues as part of your general data import workflow
> anyway, and none of these is any more complicated no matter which
> implementations are available.
>
>> One inconvenience I have run into with the current API is that is should be
>> easier to clear the mask from an "ignored" value without taking a new view
>> or assigning known data. So maybe two types of masks (different payloads),
>> or an additional flag could be helpful. The process of assigning masks could
>> also be made a bit easier than using fancy indexing.
>
> So this, uh... this was actually the whole goal of the "alterNEP"
> design for masks -- making all this stuff easy for people (like you,
> apparently?) that want support for ignored values, separately from
> missing data, and want a nice clean API for it. Basically having a
> separate .mask attribute which was an ordinary, assignable array
> broadcastable to the attached array's shape. Nobody seemed interested
> in talking about it much then but maybe there's interest now?

In other words, good low-level support for numpy.ma functionality?  With 
a migration path so that a separate numpy.ma might wither away?  Yes, 
there is interest; this is exactly what I think is needed for my own 
style of applications (which I think are common at least in geoscience), 
and for matplotlib.  The question is how to achieve it as simply and 
cleanly as possible while also satisfying the needs of the R users, and 
while making it easy for matplotlib, for example, to handle *any* 
reasonable input: ma, other masking, nan, or NA-bitpattern.

It may be that a rather pragmatic approach to implementation will prove 
better than a highly idealized set of data models.  Or, it may be that a 
dual approach is best, in which the flag value missing data 
implementation is tightly bound to the R model and the mask 
implementation is explicitly designed for the numpy.ma model. In any 
case, a reasonable level of agreement on the goals is needed.  I presume 
Travis's involvement will facilitate a clarification of the goals and of 
the implementation; and I expect that much of Mark's work will end up 
serving well, even if much needs to be added and the API evolves 
considerably.

Eric

>
> -- Nathaniel
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Matthew Brett
Hi,

On Wed, Mar 7, 2012 at 11:37 AM, Charles R Harris
 wrote:
>
>
> On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith  wrote:
>>
>> On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
>>  wrote:
>> > On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig
>> > 
>> >> Coming back to Travis proposition "bit-pattern approaches to missing
>> >> data (*at least* for float64 and int32) need to be implemented.", I
>> >> wonder what is the amount of extra work to go from nafloat64 to
>> >> nafloat32/16 ? Is there an hardware support NaN payloads with these
>> >> smaller floats ? If not, or if it is too complicated, I feel it is
>> >> acceptable to say "it's too complicated" and fall back to mask. One may
>> >> have to choose between fancy types and fancy NAs...
>> >
>> > I'm in agreement here, and that was a major consideration in making a
>> > 'masked' implementation first.
>>
>> When it comes to "missing data", bitpatterns can do everything that
>> masks can do, are no more complicated to implement, and have better
>> performance characteristics.
>>
>
> Maybe for float, for other things, no. And we have lots of otherthings. The
> performance is a strawman, and it *isn't* easier to implement.
>
>>
>> > Also, different folks adopt different values
>> > for 'missing' data, and distributing one or several masks along with the
>> > data is another common practice.
>>
>> True, but not really relevant to the current debate, because you have
>> to handle such issues as part of your general data import workflow
>> anyway, and none of these is any more complicated no matter which
>> implementations are available.
>>
>> > One inconvenience I have run into with the current API is that is should
>> > be
>> > easier to clear the mask from an "ignored" value without taking a new
>> > view
>> > or assigning known data. So maybe two types of masks (different
>> > payloads),
>> > or an additional flag could be helpful. The process of assigning masks
>> > could
>> > also be made a bit easier than using fancy indexing.
>>
>> So this, uh... this was actually the whole goal of the "alterNEP"
>> design for masks -- making all this stuff easy for people (like you,
>> apparently?) that want support for ignored values, separately from
>> missing data, and want a nice clean API for it. Basically having a
>> separate .mask attribute which was an ordinary, assignable array
>> broadcastable to the attached array's shape. Nobody seemed interested
>> in talking about it much then but maybe there's interest now?
>>
>
> Come off it, Nathaniel, the problem is minor and fixable. The intent of the
> initial implementation was to discover such things. These things are less
> accessible with the current API *precisely* because of the feedback from R
> users. It didn't start that way.
>
> We now have something to evolve into what we want. That is a heck of a lot
> more useful than endless discussion.

The endless discussion is for the following reason:

- The discussion was never adequately resolved.

The discussion was never adequately resolved because there was not
enough work done to understand the various arguments.   In particular,
you've several times said things that indicate to me, as to Nathaniel,
that you either have not read or have not understood the points that
Nathaniel was making.

Travis' recent email - to me - also indicates that there is still a
genuine problem here that has not been adequately explored.

There is no future in trying to stop discussion, and trying to do so
will only prolong it and make it less useful.  It will make the
discussion - endless.

If you want to help - read the alterNEP, respond to it directly, and
further the discussion by engaged debate.

Best,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Benjamin Root
On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith  wrote:

> On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
>  wrote:
> > On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig  >
> >> Coming back to Travis proposition "bit-pattern approaches to missing
> >> data (*at least* for float64 and int32) need to be implemented.", I
> >> wonder what is the amount of extra work to go from nafloat64 to
> >> nafloat32/16 ? Is there an hardware support NaN payloads with these
> >> smaller floats ? If not, or if it is too complicated, I feel it is
> >> acceptable to say "it's too complicated" and fall back to mask. One may
> >> have to choose between fancy types and fancy NAs...
> >
> > I'm in agreement here, and that was a major consideration in making a
> > 'masked' implementation first.
>
> When it comes to "missing data", bitpatterns can do everything that
> masks can do, are no more complicated to implement, and have better
> performance characteristics.
>
>
Not true.  bitpatterns inherently destroys the data, while masks do not.
For matplotlib, we can not use bitpatterns because it could over-write user
data (or we have to copy the data).  I would imagine other extension
writers would have similar issues when they need to play around with input
data in a safe manner.

Also, I doubt that the performance characteristics for strings and integers
are the same as it is for masks.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Charles R Harris
On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith  wrote:

> On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
>  wrote:
> > On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig  >
> >> Coming back to Travis proposition "bit-pattern approaches to missing
> >> data (*at least* for float64 and int32) need to be implemented.", I
> >> wonder what is the amount of extra work to go from nafloat64 to
> >> nafloat32/16 ? Is there an hardware support NaN payloads with these
> >> smaller floats ? If not, or if it is too complicated, I feel it is
> >> acceptable to say "it's too complicated" and fall back to mask. One may
> >> have to choose between fancy types and fancy NAs...
> >
> > I'm in agreement here, and that was a major consideration in making a
> > 'masked' implementation first.
>
> When it comes to "missing data", bitpatterns can do everything that
> masks can do, are no more complicated to implement, and have better
> performance characteristics.
>
>
Maybe for float, for other things, no. And we have lots of otherthings. The
performance is a strawman, and it *isn't* easier to implement.


> > Also, different folks adopt different values
> > for 'missing' data, and distributing one or several masks along with the
> > data is another common practice.
>
> True, but not really relevant to the current debate, because you have
> to handle such issues as part of your general data import workflow
> anyway, and none of these is any more complicated no matter which
> implementations are available.
>
> > One inconvenience I have run into with the current API is that is should
> be
> > easier to clear the mask from an "ignored" value without taking a new
> view
> > or assigning known data. So maybe two types of masks (different
> payloads),
> > or an additional flag could be helpful. The process of assigning masks
> could
> > also be made a bit easier than using fancy indexing.
>
> So this, uh... this was actually the whole goal of the "alterNEP"
> design for masks -- making all this stuff easy for people (like you,
> apparently?) that want support for ignored values, separately from
> missing data, and want a nice clean API for it. Basically having a
> separate .mask attribute which was an ordinary, assignable array
> broadcastable to the attached array's shape. Nobody seemed interested
> in talking about it much then but maybe there's interest now?
>
>
Come off it, Nathaniel, the problem is minor and fixable. The intent of the
initial implementation was to discover such things. These things are less
accessible with the current API *precisely* because of the feedback from R
users. It didn't start that way.

We now have something to evolve into what we want. That is a heck of a lot
more useful than endless discussion.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
 wrote:
> On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig 
>> Coming back to Travis proposition "bit-pattern approaches to missing
>> data (*at least* for float64 and int32) need to be implemented.", I
>> wonder what is the amount of extra work to go from nafloat64 to
>> nafloat32/16 ? Is there an hardware support NaN payloads with these
>> smaller floats ? If not, or if it is too complicated, I feel it is
>> acceptable to say "it's too complicated" and fall back to mask. One may
>> have to choose between fancy types and fancy NAs...
>
> I'm in agreement here, and that was a major consideration in making a
> 'masked' implementation first.

When it comes to "missing data", bitpatterns can do everything that
masks can do, are no more complicated to implement, and have better
performance characteristics.

> Also, different folks adopt different values
> for 'missing' data, and distributing one or several masks along with the
> data is another common practice.

True, but not really relevant to the current debate, because you have
to handle such issues as part of your general data import workflow
anyway, and none of these is any more complicated no matter which
implementations are available.

> One inconvenience I have run into with the current API is that is should be
> easier to clear the mask from an "ignored" value without taking a new view
> or assigning known data. So maybe two types of masks (different payloads),
> or an additional flag could be helpful. The process of assigning masks could
> also be made a bit easier than using fancy indexing.

So this, uh... this was actually the whole goal of the "alterNEP"
design for masks -- making all this stuff easy for people (like you,
apparently?) that want support for ignored values, separately from
missing data, and want a nice clean API for it. Basically having a
separate .mask attribute which was an ordinary, assignable array
broadcastable to the attached array's shape. Nobody seemed interested
in talking about it much then but maybe there's interest now?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Charles R Harris
On Wed, Mar 7, 2012 at 11:21 AM, Lluís  wrote:

> Charles R Harris writes:
> [...]
> > One inconvenience I have run into with the current API is that is should
> be
> > easier to clear the mask from an "ignored" value without taking a new
> view or
> > assigning known data.
>
> AFAIR, the inability to directly access a "mask" attribute was intentional
> to
> make bit-patterns and masks indistinguishable from the POV of the array
> user.
>
> What's the workflow that leads you to un-ignore specific elements?
>
>
>
Because they are not 'unknown', just (temporarily) 'ignored'. This might be
the case if you are experimenting with what happens if certain data is left
out of a fit. The current implementation tries to handle both these case,
and can do so, I would just like the 'ignored' use to be more convenient
than it is.


> > So maybe two types of masks (different payloads), or an additional flag
> could
> > be helpful.
>
> Do you mean different NA values? If that's the case, I think it was taken
> into
> account when implementing the current mechanisms (and was also mentioned
> in the
> NEP), so that it could be supported by both bit-patterns and masks (as one
> of
> the main design points was to make them indistinguishable in the common
> case).
>
>
No, the mask as currently implemented is eight bits and can be extended to
handle different mask values, aka, payloads.


> I think the name was "parametrized dtypes".
>
>
They don't interest me in the least. But that is a whole different area of
discussion.


>
> > The process of assigning masks could also be made a bit easier than using
> > fancy indexing.
>
> I don't get what you mean here, sorry.
>
>
Suppose I receive a data set, say an hdf file, that also includes a mask.
I'd like to load the data and apply the mask directly without doing
something like

data[mask] = np.NA


Do you mean here that this is too cumbersome to use?
>
>>>> a[a < 5] = np.NA
>
> (obviously oversimplified example where everything looks sufficiently
> simple :))
>
>
Mostly speed and memory.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Lluís
Charles R Harris writes:
[...]
> One inconvenience I have run into with the current API is that is should be
> easier to clear the mask from an "ignored" value without taking a new view or
> assigning known data.

AFAIR, the inability to directly access a "mask" attribute was intentional to
make bit-patterns and masks indistinguishable from the POV of the array user.

What's the workflow that leads you to un-ignore specific elements?


> So maybe two types of masks (different payloads), or an additional flag could
> be helpful.

Do you mean different NA values? If that's the case, I think it was taken into
account when implementing the current mechanisms (and was also mentioned in the
NEP), so that it could be supported by both bit-patterns and masks (as one of
the main design points was to make them indistinguishable in the common case).

I think the name was "parametrized dtypes".


> The process of assigning masks could also be made a bit easier than using
> fancy indexing.

I don't get what you mean here, sorry.

Do you mean here that this is too cumbersome to use?

>>> a[a < 5] = np.NA

(obviously oversimplified example where everything looks sufficiently simple :))




Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Charles R Harris
On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig wrote:

> Hi,
>
> Thanks you very much for your lights !
>
> Le 06/03/2012 21:59, Nathaniel Smith a écrit :
> > Right -- R has a very impoverished type system as compared to numpy.
> > There's basically four types: "numeric" (meaning double precision
> > float), "integer", "logical" (boolean), and "character" (string). And
> > in practice the integer type is essentially unused, because R parses
> > numbers like "1" as being floating point, not integer; the only way to
> > get an integer value is to explicitly cast to it. Each of these types
> > has a specific bit-pattern set aside for representing NA. And...
> > that's it. It's very simple when it works, but also very limited.
> I also suspected R to be less powerful in terms of types.
> However, I think  the fact that "It's very simple when it works" is
> important to take into account. At the end of the day, when using all
> the fanciness it is not only about "can I have some NAs in my array ?"
> but also "how *easily* can I have some NAs in my array ?". It's about
> balancing the "how easy" and the "how powerful".
>
> The easyness-of-use is the reason of my concern about having separate
> types "nafloatNN" and "floatNN". Of course, I won't argue that "not
> breaking everything" is even more important !!
>
> Coming back to Travis proposition "bit-pattern approaches to missing
> data (*at least* for float64 and int32) need to be implemented.", I
> wonder what is the amount of extra work to go from nafloat64 to
> nafloat32/16 ? Is there an hardware support NaN payloads with these
> smaller floats ? If not, or if it is too complicated, I feel it is
> acceptable to say "it's too complicated" and fall back to mask. One may
> have to choose between fancy types and fancy NAs...
>
>
I'm in agreement here, and that was a major consideration in making a
'masked' implementation first. Also, different folks adopt different values
for 'missing' data, and distributing one or several masks along with the
data is another common practice.

One inconvenience I have run into with the current API is that is should be
easier to clear the mask from an "ignored" value without taking a new view
or assigning known data. So maybe two types of masks (different payloads),
or an additional flag could be helpful. The process of assigning masks
could also be made a bit easier than using fancy indexing.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 4:35 PM, Pierre Haessig  wrote:
> Hi,
>
> Thanks you very much for your lights !
>
> Le 06/03/2012 21:59, Nathaniel Smith a écrit :
>> Right -- R has a very impoverished type system as compared to numpy.
>> There's basically four types: "numeric" (meaning double precision
>> float), "integer", "logical" (boolean), and "character" (string). And
>> in practice the integer type is essentially unused, because R parses
>> numbers like "1" as being floating point, not integer; the only way to
>> get an integer value is to explicitly cast to it. Each of these types
>> has a specific bit-pattern set aside for representing NA. And...
>> that's it. It's very simple when it works, but also very limited.
> I also suspected R to be less powerful in terms of types.
> However, I think  the fact that "It's very simple when it works" is
> important to take into account. At the end of the day, when using all
> the fanciness it is not only about "can I have some NAs in my array ?"
> but also "how *easily* can I have some NAs in my array ?". It's about
> balancing the "how easy" and the "how powerful".
>
> The easyness-of-use is the reason of my concern about having separate
> types "nafloatNN" and "floatNN". Of course, I won't argue that "not
> breaking everything" is even more important !!

It's a good point, I just don't see how we can really tell what the
trade-offs are at this point. You should bring this up again once more
of the big picture stuff is hammered out.

> Coming back to Travis proposition "bit-pattern approaches to missing
> data (*at least* for float64 and int32) need to be implemented.", I
> wonder what is the amount of extra work to go from nafloat64 to
> nafloat32/16 ? Is there an hardware support NaN payloads with these
> smaller floats ? If not, or if it is too complicated, I feel it is
> acceptable to say "it's too complicated" and fall back to mask. One may
> have to choose between fancy types and fancy NAs...

All modern floating point formats can represent NaNs with payloads, so
in principle there's no difficulty in supporting NA the same way for
all of them. If you're using float16 because you want to offload
computation to a GPU then I would test carefully before trusting the
GPU to handle NaNs correctly, and there may need to be a bit of care
to make sure that casts between these types properly map NAs to NAs,
but generally it should be fine.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Pierre Haessig
Hi,

Thanks you very much for your lights !

Le 06/03/2012 21:59, Nathaniel Smith a écrit :
> Right -- R has a very impoverished type system as compared to numpy.
> There's basically four types: "numeric" (meaning double precision
> float), "integer", "logical" (boolean), and "character" (string). And
> in practice the integer type is essentially unused, because R parses
> numbers like "1" as being floating point, not integer; the only way to
> get an integer value is to explicitly cast to it. Each of these types
> has a specific bit-pattern set aside for representing NA. And...
> that's it. It's very simple when it works, but also very limited.
I also suspected R to be less powerful in terms of types.
However, I think  the fact that "It's very simple when it works" is
important to take into account. At the end of the day, when using all
the fanciness it is not only about "can I have some NAs in my array ?"
but also "how *easily* can I have some NAs in my array ?". It's about
balancing the "how easy" and the "how powerful".

The easyness-of-use is the reason of my concern about having separate
types "nafloatNN" and "floatNN". Of course, I won't argue that "not
breaking everything" is even more important !!

Coming back to Travis proposition "bit-pattern approaches to missing
data (*at least* for float64 and int32) need to be implemented.", I
wonder what is the amount of extra work to go from nafloat64 to
nafloat32/16 ? Is there an hardware support NaN payloads with these
smaller floats ? If not, or if it is too complicated, I feel it is
acceptable to say "it's too complicated" and fall back to mask. One may
have to choose between fancy types and fancy NAs...

Best,
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Nathaniel Smith
On Tue, Mar 6, 2012 at 9:14 PM, Ralf Gommers
 wrote:
> On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith  wrote:
>> On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant 
>> wrote:
>> > Hi all,
>>
>> Hi Travis,
>>
>> Thanks for bringing this back up.
>>
>> Have you looked at the summary from the last thread?
>>  https://github.com/njsmith/numpy/wiki/NA-discussion-status
>
> Re-reading that summary and the main documents and threads linked from it, I
> could find either examples of statistical software that treats missing and
> ignored data explicitly separately, or links to relevant literature. Those
> would probably help the discussion a lot.

(I think you mean "couldn't find"?)

I'm not aware of any software that supports the IGNORED concept at
all, whether in combination with missing data or not. np.ma is
probably the closest example. I think we'd be breaking new ground
there. This is also probably why it is less clear how it should work
:-).

IIUC, the basic reason that people want IGNORED in the core is that it
provides convenience and syntactic sugar for efficient "in place"
operation on subsets of large arrays. So there are actually two parts
there -- the efficient operation, and the convenience/syntactic sugar.
The key feature for efficient operation is the where= feature, which
is not controversial at all. So, there's an argument that for now we
should focus on where=, give people some time to work with it, and
then use that experience to decide what kind of convenience/sugar
would be useful, if any. But, that's just my own idea; I definitely
can't claim any consensus on it.

>> In project management terms, I see three options:
>> 1) Put a big warning label on the functionality and leave it for now
>> ("If this option is given, np.asarray returns a masked array. NOTE: IN
>> THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF RABID, HUNGRY
>> WEASELS. NO GUARANTEES.")
>
> I've opened http://projects.scipy.org/numpy/ticket/2072 for that.

Cool, thanks.

> Assuming
> we stick with this option, I'd appreciate it if you could check in the first
> beta that comes out whether or not the warnings are obvious enough and in
> all the right places. There probably won't be weasels though:)

Of course. I've added myself to the CC list. (Err, if the beta won't
be for a bit, though, then please remind me if you remember? I'm
juggling a lot of balls right now.)

>> 2) Move the code back out of mainline and into a branch until until
>> there's consensus.
>> 3) Hold up the release until this is all sorted.
>>
>> I come from the project-management school that says you should always
>> have a releasable mainline, keep unready code in branches, and never
>> hold up the release for features, so (2) seems obvious to me.
>
> While it may sound obvious, I hope you've understood why in practice it's
> not at all obvious and why you got such strong reactions to your proposal of
> taking out all that code. If not, just look at what happened with the
> numpy-refactor work.

Of course, and that's why I'm not pressing the point. These trade-offs
might be worth talking about at some point -- there are reasons that
basically all the major FOSS projects have moved towards time-based
releases :-) -- but that'd be a huge discussion at a time when we
already have more than enough of those on our plate...

>> But I seem to be very much in the minority on that[1], so oh well :-). I
>> don't have any objection to (1), personally. (3) seems like a bad
>> idea. Just my 2 pence.
>
>
> Agreed that (3) is a bad idea. +1 for (1).
>
> Ralf
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Cheers,
-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Ralf Gommers
On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith  wrote:

> On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant 
> wrote:
> > Hi all,
>
> Hi Travis,
>
> Thanks for bringing this back up.
>
> Have you looked at the summary from the last thread?
>  https://github.com/njsmith/numpy/wiki/NA-discussion-status
>

Re-reading that summary and the main documents and threads linked from it,
I could find either examples of statistical software that treats missing
and ignored data explicitly separately, or links to relevant literature.
Those would probably help the discussion a lot.

The goal was to try and at least work out what points we all *could*
> agree on, to have some common footing for further discussion. I won't
> copy the whole thing here, but I'd summarize the state as:
>  -- It's pretty clear that there are two fairly different conceptual
> models/use cases in play here. For one of them (R-style "missing data"
> cases) it's pretty clear what the desired semantics would be. For the
> other (temporary "ignored values") there's still substantive
> disagreement.
>  -- We *haven't* yet established what we want numpy to actually support.
>
> IMHO the critical next step is this latter one -- maybe we want to
> fully support both use cases. Maybe it's really only one of them
> that's worth trying to support in the numpy core right now. Maybe it's
> just one of them, but it's worth doing so thoroughly that it should
> have multiple implementations. Or whatever.
>
> I fear that if we don't talk about these big picture questions and
> just wade directly back into round-and-round arguments about API
> details then we'll never get anywhere.
>
> [...]
> > Because it is slated to go into release 1.7, we need to re-visit the
> masked array discussion again.The NEP process is the appropriate one
> and I'm glad we are taking that route for these discussions.   My goal is
> to get consensus in order for code to get into NumPy (regardless of who
> writes the code).It may be that we don't come to a consensus
> (reasonable and intelligent people can disagree on things --- look at the
> coming election...).   We can represent different parts of what is
> fortunately a very large user-base of NumPy users.
> >
> > First of all, I want to be clear that I think there is much great work
> that has been done in the current missing data code.  There are some nice
> features in the where clause of the ufunc and the machinery for the
> iterator that allows re-using ufunc loops that are not re-written to check
> for missing data.   I'm sure there are other things as well that I'm not
> quite aware of yet.However, I don't think the API presented to the
> numpy user presently is the correct one for NumPy 1.X.
> >
> > A few particulars:
> >
> >* the reduction operations need to default to "skipna" --- this
> is the most common use case which has been re-inforced again to me today by
> a new user to Python who is using masked arrays presently
>
> This is one of the points where the two conceptual models disagree
> (see also Skipper's point down-thread). If you have "missing data",
> then propagation has to be the default -- the sum of 1, 2, and
> I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there
> but you've asked numpy to temporarily ignore it, then, well, duh, of
> course it should ignore it.
>
> >* the mask needs to be visible to the user if they use that
> approach to missing data (people should be able to get a hold of the mask
> and work with it in Python)
>
> This is also a point where the two conceptual models disagree.
>
> Actually this is one of the original arguments we made against the NEP
> design -- that if you want missing data, then having a mask at all is
> counterproductive, and if you are ignoring data, then of course it
> should be easy to manipulate the ignore mask. The rationale for the
> current design is to compromise between these two approaches -- there
> is a mask, but it's hidden behind a curtain. Mostly. (This may be a
> compromise in the Solomonic sense.)
>
> >* bit-pattern approaches to missing data (at least for float64
> and int32) need to be implemented.
> >
> >* there should be some way when using "masks" (even if it's
> hidden from most users) for missing data to separate the low-level ufunc
> operation from the operation
> >   on the masks...
>
> I don't understand what this means.
>
> > I have heard from several users that they will *not use the missing
> data* in NumPy as currently implemented, and I can now see why.For
> better or for worse, my approach to software is generally very user-driven
> and very pragmatic.  On the other hand, I'm also a mathematician and
> appreciate the cognitive compression that can come out of well-formed
> structure.None-the-less, I'm an *applied* mathematician and am
> ultimately motivated by applications.
> >
> > I will get a hold of the NEP and spend some time with it to discuss some
> of this in 

Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Nathaniel Smith
On Tue, Mar 6, 2012 at 4:38 PM, Mark Wiebe  wrote:
> On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig 
> wrote:
>> >From a potential user perspective, I feel it would be nice to have NA
>> and non-NA cases look as similar as possible. Your code example is
>> particularly striking : two different dtypes to store (from a user
>> perspective) the exact same content ! If this *could* be avoided, it
>> would be great...
>
> The biggest reason to keep the two types separate is performance. The
> straight float dtypes map directly to hardware floating-point operations,
> which can be very fast. The NA-float dtypes have to use additional logic to
> handle the NA values correctly. NA is treated as a particular NaN, and if
> the hardware float operations were used directly, NA would turn into NaN.
> This additional logic usually means more branches, so is slower.

Actually, no -- hardware float operations preserve NA-as-NaN. You
might well need to be careful around more exotic code like optimized
BLAS kernels, but all the basic ufuncs should Just Work at full speed.
Demo:

>>> def hexify(x): return hex(np.float64(x).view(np.int64))
>>> hexify(np.nan)
'0x7ff8L'
# IIRC this is R's NA bitpattern (presumably 1974 is someone's birthday)
>>> NA = np.int64(0x7ff8 + 1974).view(np.float64)
# It is an NaN...
>>> NA
nan
# But it has a distinct bitpattern:
>>> hexify(NA)
'0x7ff807b6L'
# Like any NaN, it propagates through floating point operations:
>>> NA + 3
nan
# But, critically, so does the bitpattern; ordinary Python "+" is
returning NA on this operation:
>>> hexify(NA + 3)
'0x7ff807b6L'

This is how R does it, which is more evidence that this actually works
on real hardware.

There is one place where it fails. In a binary operation with *two*
NaN values, there's an ambiguity about which payload should be
returned. IEEE754 recommends just returning the first one. This means
that NA + NaN = NA, NaN + NA = NaN. This is ugly, but it's an obscure
case that nobody cares about, so it's probably worth it for the speed
gain. (In fact, if you type those two expressions at the R prompt,
then that's what you get, and I can't find any reference to anyone
even noticing this.)

>> I don't know how the NA machinery is working R. Does it works with a
>> kind of "nafloat64" all the time or is there some type inference
>> mechanics involved in choosing the appropriate type ?
>
> My understanding of R is that it works with the "nafloat64" for all its
> operations, yes.

Right -- R has a very impoverished type system as compared to numpy.
There's basically four types: "numeric" (meaning double precision
float), "integer", "logical" (boolean), and "character" (string). And
in practice the integer type is essentially unused, because R parses
numbers like "1" as being floating point, not integer; the only way to
get an integer value is to explicitly cast to it. Each of these types
has a specific bit-pattern set aside for representing NA. And...
that's it. It's very simple when it works, but also very limited.

I'm still skeptical that we could make the floating point types
NA-aware by default -- until we have an implementation in hand, I'm
nervous there'd be some corner case that broke everything. (Maybe
ufuncs are fine but np.dot has an unavoidable overhead, or maybe it
would mess up casting from float types to non-NA-aware types, etc.)
But who knows. Probably not something we can really make a meaningful
decision about yet.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Nathaniel Smith
On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant  wrote:
> Hi all,

Hi Travis,

Thanks for bringing this back up.

Have you looked at the summary from the last thread?
  https://github.com/njsmith/numpy/wiki/NA-discussion-status
The goal was to try and at least work out what points we all *could*
agree on, to have some common footing for further discussion. I won't
copy the whole thing here, but I'd summarize the state as:
  -- It's pretty clear that there are two fairly different conceptual
models/use cases in play here. For one of them (R-style "missing data"
cases) it's pretty clear what the desired semantics would be. For the
other (temporary "ignored values") there's still substantive
disagreement.
  -- We *haven't* yet established what we want numpy to actually support.

IMHO the critical next step is this latter one -- maybe we want to
fully support both use cases. Maybe it's really only one of them
that's worth trying to support in the numpy core right now. Maybe it's
just one of them, but it's worth doing so thoroughly that it should
have multiple implementations. Or whatever.

I fear that if we don't talk about these big picture questions and
just wade directly back into round-and-round arguments about API
details then we'll never get anywhere.

[...]
> Because it is slated to go into release 1.7, we need to re-visit the masked 
> array discussion again.    The NEP process is the appropriate one and I'm 
> glad we are taking that route for these discussions.   My goal is to get 
> consensus in order for code to get into NumPy (regardless of who writes the 
> code).    It may be that we don't come to a consensus (reasonable and 
> intelligent people can disagree on things --- look at the coming 
> election...).   We can represent different parts of what is fortunately a 
> very large user-base of NumPy users.
>
> First of all, I want to be clear that I think there is much great work that 
> has been done in the current missing data code.  There are some nice features 
> in the where clause of the ufunc and the machinery for the iterator that 
> allows re-using ufunc loops that are not re-written to check for missing 
> data.   I'm sure there are other things as well that I'm not quite aware of 
> yet.    However, I don't think the API presented to the numpy user presently 
> is the correct one for NumPy 1.X.
>
> A few particulars:
>
>        * the reduction operations need to default to "skipna" --- this is the 
> most common use case which has been re-inforced again to me today by a new 
> user to Python who is using masked arrays presently

This is one of the points where the two conceptual models disagree
(see also Skipper's point down-thread). If you have "missing data",
then propagation has to be the default -- the sum of 1, 2, and
I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there
but you've asked numpy to temporarily ignore it, then, well, duh, of
course it should ignore it.

>        * the mask needs to be visible to the user if they use that approach 
> to missing data (people should be able to get a hold of the mask and work 
> with it in Python)

This is also a point where the two conceptual models disagree.

Actually this is one of the original arguments we made against the NEP
design -- that if you want missing data, then having a mask at all is
counterproductive, and if you are ignoring data, then of course it
should be easy to manipulate the ignore mask. The rationale for the
current design is to compromise between these two approaches -- there
is a mask, but it's hidden behind a curtain. Mostly. (This may be a
compromise in the Solomonic sense.)

>        * bit-pattern approaches to missing data (at least for float64 and 
> int32) need to be implemented.
>
>        * there should be some way when using "masks" (even if it's hidden 
> from most users) for missing data to separate the low-level ufunc operation 
> from the operation
>           on the masks...

I don't understand what this means.

> I have heard from several users that they will *not use the missing data* in 
> NumPy as currently implemented, and I can now see why.    For better or for 
> worse, my approach to software is generally very user-driven and very 
> pragmatic.  On the other hand, I'm also a mathematician and appreciate the 
> cognitive compression that can come out of well-formed structure.    
> None-the-less, I'm an *applied* mathematician and am ultimately motivated by 
> applications.
>
> I will get a hold of the NEP and spend some time with it to discuss some of 
> this in that document.   This will take several weeks (as PyCon is next week 
> and I have a tutorial I'm giving there).    For now, I do not think 1.7 can 
> be released unless the masked array is labeled *experimental*.

In project management terms, I see three options:
1) Put a big warning label on the functionality and leave it for now
("If this option is given, np.asarray returns a masked array. NOTE: IN
THE NEXT RELEASE

Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Mark Wiebe
Hi Pierre,

On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig wrote:

> Hi Mark,
>
> I went through the NA NEP a few days ago, but only too quickly so that
> my question is probably a rather dumb one. It's about the usability of
> bitpatter-based NAs, based on your recent post :
>
> Le 03/03/2012 22:46, Mark Wiebe a écrit :
> > Also, here's a thought for the usability of NA-float64. As much as
> > global state is a bad idea, something which determines whether
> > implicit float dtypes are NA-float64 or float64 could help. In
> > IPython, "pylab" mode would default to float64, and "statlab" or
> > "pystat" would default to NA-float64. One way to write this might be:
> >
> > >>> np.set_default_float(np.nafloat64)
> > >>> np.array([1.0, 2.0, 3.0])
> > array([ 1.,  2.,  3.], dtype=nafloat64)
> > >>> np.set_default_float(np.float64)
> > >>> np.array([1.0, 2.0, 3.0])
> > array([ 1.,  2.,  3.], dtype=float64)
>
> Q: Is is an *absolute* necessity to have two separate dtypes "nafloatNN"
> and "floatNN" to enable NA bitpattern storage ?
>
> From a potential user perspective, I feel it would be nice to have NA
> and non-NA cases look as similar as possible. Your code example is
> particularly striking : two different dtypes to store (from a user
> perspective) the exact same content ! If this *could* be avoided, it
> would be great...
>

The biggest reason to keep the two types separate is performance. The
straight float dtypes map directly to hardware floating-point operations,
which can be very fast. The NA-float dtypes have to use additional logic to
handle the NA values correctly. NA is treated as a particular NaN, and if
the hardware float operations were used directly, NA would turn into NaN.
This additional logic usually means more branches, so is slower.

One possibility we could consider is to automatically convert an array's
dtype from "float64" to "nafloat64" the first time an NA is assigned. This
would have good performance when there are no NAs, but would transparently
switch on NA support when it's needed.


> I don't know how the NA machinery is working R. Does it works with a
> kind of "nafloat64" all the time or is there some type inference
> mechanics involved in choosing the appropriate type ?
>

My understanding of R is that it works with the "nafloat64" for all its
operations, yes.

Cheers,
Mark


> Best,
> Pierre
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Pierre Haessig
Hi Mark,

I went through the NA NEP a few days ago, but only too quickly so that
my question is probably a rather dumb one. It's about the usability of
bitpatter-based NAs, based on your recent post :

Le 03/03/2012 22:46, Mark Wiebe a écrit :
> Also, here's a thought for the usability of NA-float64. As much as
> global state is a bad idea, something which determines whether
> implicit float dtypes are NA-float64 or float64 could help. In
> IPython, "pylab" mode would default to float64, and "statlab" or
> "pystat" would default to NA-float64. One way to write this might be:
>
> >>> np.set_default_float(np.nafloat64)
> >>> np.array([1.0, 2.0, 3.0])
> array([ 1.,  2.,  3.], dtype=nafloat64)
> >>> np.set_default_float(np.float64)
> >>> np.array([1.0, 2.0, 3.0])
> array([ 1.,  2.,  3.], dtype=float64)

Q: Is is an *absolute* necessity to have two separate dtypes "nafloatNN"
and "floatNN" to enable NA bitpattern storage ?

From a potential user perspective, I feel it would be nice to have NA
and non-NA cases look as similar as possible. Your code example is
particularly striking : two different dtypes to store (from a user
perspective) the exact same content ! If this *could* be avoided, it
would be great...

I don't know how the NA machinery is working R. Does it works with a
kind of "nafloat64" all the time or is there some type inference
mechanics involved in choosing the appropriate type ?

Best,
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-03 Thread Skipper Seabold
On Sat, Mar 3, 2012 at 4:46 PM, Mark Wiebe  wrote:
> On Sat, Mar 3, 2012 at 12:30 PM, Travis Oliphant 

>>
>>        * the reduction operations need to default to "skipna" --- this is
>> the most common use case which has been re-inforced again to me today by a
>> new user to Python who is using masked arrays presently
>
>
> This is a completely trivial change. I went with the default as I did
> because it's what R, the primary inspiration for the NA design, does. We'll
> have to be sure this is well-marked in the documentation about "NumPy NA for
> R users".
>

It may be trivial to change the code, but this isn't a trivial change.
"Most common use case" is hard for me to swallow, since there are so
many. Of the different statistical softwares I've used, none that I
recall ignores missing data (silently) by default. This sounds
dangerous to me. It's one thing to be convenient to work with missing
data, but it's another to try to sweep the problem under the rug. I
imagine the choice of the R developers was a thoughtful one.

Perhaps something like np.seterr should also be implemented for
missing data since there's probably no resolution to what's most
sensible here.

Skipper
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-03 Thread Travis Oliphant
> 
> Mind, Mark only had a few weeks to write code. I think the unfinished state 
> is a direct function of that.
>  
> I have heard from several users that they will *not use the missing data* in 
> NumPy as currently implemented, and I can now see why.For better or for 
> worse, my approach to software is generally very user-driven and very 
> pragmatic.  On the other hand, I'm also a mathematician and appreciate the 
> cognitive compression that can come out of well-formed structure.
> None-the-less, I'm an *applied* mathematician and am ultimately motivated by 
> applications.
> 
> 
> I think that would be Wes. I thought the current state wasn't that far away 
> from what he wanted in the only post where he was somewhat explicit. I think 
> it would be useful for him to sit down with Mark at some time and thrash 
> things out since I think there is some misunderstanding involved.
>  

Actually it wasn't Wes.  It was 3 other people.   I'm already well aware of 
Wes's perspective and actually think his concerns have been handled already.
Also, the person who showed me their use-case was a new user.

But, your point about getting people together is well-taken.  I also recognize 
the fact that there have been (and likely continue to be) misunderstandings on 
multiple fronts.   Fortunately, many of us will be at PyCon later this week.   
We tried really hard to get Mark Wiebe here this weekend as well --- but he 
could only sacrifice a week away from his degree work to join us for PyCon. 

It would be great if you could come to PyCon as well.   Perhaps we can apply to 
NumFOCUS for a travel grant to bring NumPy developers together with other 
interested people to finish the masked array design and implementation.

-Travis


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-03 Thread Charles R Harris
On Sat, Mar 3, 2012 at 1:30 PM, Travis Oliphant  wrote:

> Hi all,
>
> I've been thinking a lot about the masked array implementation lately.
> I finally had the time to look hard at what has been done and now am of the
> opinion that I do not think that 1.7 can be released with the current state
> of the masked array implementation *unless* it is clearly marked as
> experimental and may be changed in 1.8
>
>
That was the intention.


> I wish I had been able to be a bigger part of this conversation last year.
>   But, that is why I took the steps I took to try and figure out another
> way to feed my family *and* stay involved in the NumPy community.   I would
> love to stay involved in what is happening in the SciPy community, but I am
> more satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles,
> Stefan, and others are doing there right now, and don't have time to keep
> up with everything.Even though SciPy was the heart and soul of why I
> even got involved with Python for open source in the first place and took
> many years of my volunteer labor, I won't be able to spend significant time
> on SciPy code over the coming months.   At some point, I really hope to be
> able to make contributions again to that code-base.   Time will tell
> whether or not my aspirations will be realized.  It depends quite a bit on
> whether or not my kids have what they need from me (which right now is
> money and time).
>
> NumPy, on the other hand, is not in a position where I can feel
> comfortable leaving my "baby" to others.  I recognize and value the
> contributions from many people to make NumPy what it is today (e.g. code
> contributions, code rearrangement and standardization, build and install
> improvement, and most recently some architectural changes).But, I feel
> a personal responsibility for the code base as I spent a great many months
> writing NumPy in the first place, and I've spent a great deal of time
> interacting with NumPy users and feel like I have at least some sense of
> their stories.Of course, I built on the shoulders of giants, and much
> of what is there is *because of* where the code was adapted from (it was
> not created de-novo).   Currently,  there remains much that needs to be
> communicated, improved, and worked on, and I have specific opinions about
> what some changes and improvements should be, how they should be written,
> and how the resulting users need to be benefited.
>  It will take time to discuss all of this, and that's where I will spend
> my open-source time in the coming months.
>
> In that vein:
>
> Because it is slated to go into release 1.7, we need to re-visit the
> masked array discussion again.The NEP process is the appropriate one
> and I'm glad we are taking that route for these discussions.   My goal is
> to get consensus in order for code to get into NumPy (regardless of who
> writes the code).It may be that we don't come to a consensus
> (reasonable and intelligent people can disagree on things --- look at the
> coming election...).   We can represent different parts of what is
> fortunately a very large user-base of NumPy users.
>
> First of all, I want to be clear that I think there is much great work
> that has been done in the current missing data code.  There are some nice
> features in the where clause of the ufunc and the machinery for the
> iterator that allows re-using ufunc loops that are not re-written to check
> for missing data.   I'm sure there are other things as well that I'm not
> quite aware of yet.However, I don't think the API presented to the
> numpy user presently is the correct one for NumPy 1.X.
>

A few particulars:
>
>* the reduction operations need to default to "skipna" --- this is
> the most common use case which has been re-inforced again to me today by a
> new user to Python who is using masked arrays presently
>
>* the mask needs to be visible to the user if they use that
> approach to missing data (people should be able to get a hold of the mask
> and work with it in Python)
>
>* bit-pattern approaches to missing data (at least for float64 and
> int32) need to be implemented.
>
>* there should be some way when using "masks" (even if it's hidden
> from most users) for missing data to separate the low-level ufunc operation
> from the operation
>   on the masks...
>
>
Mind, Mark only had a few weeks to write code. I think the unfinished state
is a direct function of that.


> I have heard from several users that they will *not use the missing data*
> in NumPy as currently implemented, and I can now see why.For better or
> for worse, my approach to software is generally very user-driven and very
> pragmatic.  On the other hand, I'm also a mathematician and appreciate the
> cognitive compression that can come out of well-formed structure.
>  None-the-less, I'm an *applied* mathematician and am ultimately motivated
> by applications.
>
>
I think that would be W

Re: [Numpy-discussion] Missing data again

2012-03-03 Thread Ralf Gommers
On Sat, Mar 3, 2012 at 9:30 PM, Travis Oliphant  wrote:

> Hi all,
>
> I've been thinking a lot about the masked array implementation lately.
> I finally had the time to look hard at what has been done and now am of the
> opinion that I do not think that 1.7 can be released with the current state
> of the masked array implementation *unless* it is clearly marked as
> experimental and may be changed in 1.8
>

We had already decided to put an "experimental" label on the
implementation. Also on datetime. I will open a ticket for this now to make
sure we won't forget.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-03 Thread Mark Wiebe
On Sat, Mar 3, 2012 at 12:30 PM, Travis Oliphant wrote:

> 
>


> First of all, I want to be clear that I think there is much great work
> that has been done in the current missing data code.  There are some nice
> features in the where clause of the ufunc and the machinery for the
> iterator that allows re-using ufunc loops that are not re-written to check
> for missing data.   I'm sure there are other things as well that I'm not
> quite aware of yet.However, I don't think the API presented to the
> numpy user presently is the correct one for NumPy 1.X.
>

I thought I might chime in with some implementation-detail notes, as while
Travis has dug into the code, I'm still the person who knows it best.

A few particulars:
>
>* the reduction operations need to default to "skipna" --- this is
> the most common use case which has been re-inforced again to me today by a
> new user to Python who is using masked arrays presently
>

This is a completely trivial change. I went with the default as I did
because it's what R, the primary inspiration for the NA design, does. We'll
have to be sure this is well-marked in the documentation about "NumPy NA
for R users".


>* the mask needs to be visible to the user if they use that
> approach to missing data (people should be able to get a hold of the mask
> and work with it in Python)
>

This is relatively easy. Probably the way to do it is with an
ndarray.maskna property. It could be in 1.7 if we really push. For the
multi-NA future, I think the NPY_MASK dtype, currently an alias for
NPY_UBYTE, would need to become its own dtype with separate .exposed and
.payload attributes.


>* bit-pattern approaches to missing data (at least for float64 and
> int32) need to be implemented.
>

I strongly wanted to do masks first, because of the greater generality and
because the bit-patterns would best be implemented sharing mask
implementation details. I still believe this was the correct choice, and it
set the stage for bit-patterns. It will be possible to make inner loops
that specialize for the default hard-coded bit-pattern dtypes. I paid very
careful attention in the design making sure high performance is possible
without significant rework. The immense scale of the required code changes
meant I couldn't actually implement high performance in the time frame.

The place I think this affects 1.7 the most is in the default choice for
what np.array([1.0, np.NA, 3.0]) and np.array([1, np.NA, 3]) mean. In 1.7,
both mean an NA-masked array. In 1.8, I can see a strong case that the
first should mean an NA-dtype, and the second an NA-masked array.

Also, here's a thought for the usability of NA-float64. As much as global
state is a bad idea, something which determines whether implicit float
dtypes are NA-float64 or float64 could help. In IPython, "pylab" mode would
default to float64, and "statlab" or "pystat" would default to NA-float64.
One way to write this might be:

>>> np.set_default_float(np.nafloat64)
>>> np.array([1.0, 2.0, 3.0])
array([ 1.,  2.,  3.], dtype=nafloat64)
>>> np.set_default_float(np.float64)
>>> np.array([1.0, 2.0, 3.0])
array([ 1.,  2.,  3.], dtype=float64)


>* there should be some way when using "masks" (even if it's hidden
> from most users) for missing data to separate the low-level ufunc operation
> from the operation
>   on the masks...
>

This is completely trivial to implement. Maybe
ndarray.view(maskna='ignore') is a reasonable way to spell direct access
without a mask.

Cheers,
Mark


> I have heard from several users that they will *not use the missing data*
> in NumPy as currently implemented, and I can now see why.For better or
> for worse, my approach to software is generally very user-driven and very
> pragmatic.  On the other hand, I'm also a mathematician and appreciate the
> cognitive compression that can come out of well-formed structure.
>  None-the-less, I'm an *applied* mathematician and am ultimately motivated
> by applications.
>
> I will get a hold of the NEP and spend some time with it to discuss some
> of this in that document.   This will take several weeks (as PyCon is next
> week and I have a tutorial I'm giving there).For now, I do not think
> 1.7 can be released unless the masked array is labeled *experimental*.
>
> Thanks,
>
> -Travis
>
>
>
>
>
>
>
>
>
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Missing data again

2012-03-03 Thread Travis Oliphant
Hi all, 

I've been thinking a lot about the masked array implementation lately. I 
finally had the time to look hard at what has been done and now am of the 
opinion that I do not think that 1.7 can be released with the current state of 
the masked array implementation *unless* it is clearly marked as experimental 
and may be changed in 1.8  

I wish I had been able to be a bigger part of this conversation last year.   
But, that is why I took the steps I took to try and figure out another way to 
feed my family *and* stay involved in the NumPy community.   I would love to 
stay involved in what is happening in the SciPy community, but I am more 
satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles, Stefan, and 
others are doing there right now, and don't have time to keep up with 
everything.Even though SciPy was the heart and soul of why I even got 
involved with Python for open source in the first place and took many years of 
my volunteer labor, I won't be able to spend significant time on SciPy code 
over the coming months.   At some point, I really hope to be able to make 
contributions again to that code-base.   Time will tell whether or not my 
aspirations will be realized.  It depends quite a bit on whether or not my kids 
have what they need from me (which right now is money and time). 
 
NumPy, on the other hand, is not in a position where I can feel comfortable 
leaving my "baby" to others.  I recognize and value the contributions from many 
people to make NumPy what it is today (e.g. code contributions, code 
rearrangement and standardization, build and install improvement, and most 
recently some architectural changes).But, I feel a personal responsibility 
for the code base as I spent a great many months writing NumPy in the first 
place, and I've spent a great deal of time interacting with NumPy users and 
feel like I have at least some sense of their stories.Of course, I built on 
the shoulders of giants, and much of what is there is *because of* where the 
code was adapted from (it was not created de-novo).   Currently,  there remains 
much that needs to be communicated, improved, and worked on, and I have 
specific opinions about what some changes and improvements should be, how they 
should be written, and how the resulting users need to be benefited.   
 It will take time to discuss all of this, and that's where I will spend my 
open-source time in the coming months. 

In that vein: 

Because it is slated to go into release 1.7, we need to re-visit the masked 
array discussion again.The NEP process is the appropriate one and I'm glad 
we are taking that route for these discussions.   My goal is to get consensus 
in order for code to get into NumPy (regardless of who writes the code).It 
may be that we don't come to a consensus (reasonable and intelligent people can 
disagree on things --- look at the coming election...).   We can represent 
different parts of what is fortunately a very large user-base of NumPy users.   
 

First of all, I want to be clear that I think there is much great work that has 
been done in the current missing data code.  There are some nice features in 
the where clause of the ufunc and the machinery for the iterator that allows 
re-using ufunc loops that are not re-written to check for missing data.   I'm 
sure there are other things as well that I'm not quite aware of yet.
However, I don't think the API presented to the numpy user presently is the 
correct one for NumPy 1.X.   

A few particulars: 

* the reduction operations need to default to "skipna" --- this is the 
most common use case which has been re-inforced again to me today by a new user 
to Python who is using masked arrays presently 

* the mask needs to be visible to the user if they use that approach to 
missing data (people should be able to get a hold of the mask and work with it 
in Python)

* bit-pattern approaches to missing data (at least for float64 and 
int32) need to be implemented. 

* there should be some way when using "masks" (even if it's hidden from 
most users) for missing data to separate the low-level ufunc operation from the 
operation
   on the masks...

I have heard from several users that they will *not use the missing data* in 
NumPy as currently implemented, and I can now see why.For better or for 
worse, my approach to software is generally very user-driven and very 
pragmatic.  On the other hand, I'm also a mathematician and appreciate the 
cognitive compression that can come out of well-formed structure.
None-the-less, I'm an *applied* mathematician and am ultimately motivated by 
applications.

I will get a hold of the NEP and spend some time with it to discuss some of 
this in that document.   This will take several weeks (as PyCon is next week 
and I have a tutorial I'm giving there).For now, I do not think 1.7 can be 
released unless the masked array is labeled *ex

[Numpy-discussion] Missing Data development plan

2011-07-07 Thread Mark Wiebe
It's been a day less than two weeks since I posted my first feedback request
on a masked array implementation of missing data. I'd like to thank everyone
that contributed to the discussion, and that continues to contribute.

I believe my design is very solid thanks to all the feedback, and I
understand at the same time there are still concerns that people have about
the design. I sincerely hope that those concerns are further discussed and
made more clear just as I have spent a lot of effort making sure my ideas
are clear and understood by everyone in the discussion.

Travis has directed me to for the moment focus a majority of my attention on
the implementation. He will post further thoughts on the design issues in
the next few days when he has enough of a break in his schedule.

With the short time available for this implementation, my plan is as
follows:

1) Implement the masked implementation of NA nearly to completion. This is
the quickest way to get something that people can provide hands-on feedback
with, and the NA dtype in my design uses the machinery of the masked
implementation for all the computational kernels.

2) Assuming there is enough time left, implement the NA[] parameterized
dtype in concert with a derived[] dtype and cleanups of the datetime64[]
dtype, with the goal of creating some good structure for the possibility of
creating more parameterized dtypes in the future. The derived[] dtype idea
is based on an idea Travis had which he called computed columns, but
generalized to apply in more contexts. When the time comes, I will post a
proposal for feedback on this idea as well.

Thanks once again for all the great feedback, and I look forward to getting
a prototype into your hands to test as quickly as possible!

-Mark
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Nathaniel Smith
On Thu, Jun 30, 2011 at 12:27 PM, Eric Firing  wrote:
> On 06/30/2011 08:53 AM, Nathaniel Smith wrote:
>> On Wed, Jun 29, 2011 at 2:21 PM, Eric Firing  wrote:
>>> In addition, for new code, the full-blown masked array module may not be
>>> needed.  A convenience it adds, however, is the automatic masking of
>>> invalid values:
>>>
>>> In [1]: np.ma.log(-1)
>>> Out[1]: masked
>>>
>>> I'm sure this horrifies some, but there are times and places where it is
>>> a genuine convenience, and preferable to having to use a separate
>>> operation to replace nan or inf with NA or whatever it ends up being.
>>
>> Err, but what would this even get you? NA, NaN, and Inf basically all
>> behave the same WRT floating point operations anyway, i.e., they all
>> propagate?
>
> Not exactly. First, it depends on np.seterr;

IIUC, you're proposing to make this conversion depend on np.seterr
too, though, right?

> second, calculations on NaN
> can be very slow, so are better avoided entirely

They're slow because inside the processor they require a branch and a
separate code path (which doesn't get a lot of transistors allocated
to it). In any of the NA proposals we're talking about, handling an NA
would require a software branch and a separate code path (which is in
ordinary software, now, so it doesn't get any special transistors
allocated to it...). I don't think masking support is likely to give
you a speedup over the processor's NaN handling.

And if it did, that would mean that we speed up FP operations in
general by checking for NaN in software, so then we should do that
everywhere anyway instead of making it an NA-specific feature...

> third, if an array is
> passed to extension code, it is much nicer if that code only has one NA
> value to handle, instead of having to check for all possible "bad" values.

I'm pretty sure that Mark's proposal does not work this way -- he's
saying that the NA-checking code in numpy could optionally check for
all these different "bad" values and handle them the same in ufuncs,
not that we would check the outputs of all FP operations for "bad"
values and then replace them by NA. So your extension code would still
have the same problem. Sorry :-(

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Eric Firing
On 06/30/2011 08:53 AM, Nathaniel Smith wrote:
> On Wed, Jun 29, 2011 at 2:21 PM, Eric Firing  wrote:
>> In addition, for new code, the full-blown masked array module may not be
>> needed.  A convenience it adds, however, is the automatic masking of
>> invalid values:
>>
>> In [1]: np.ma.log(-1)
>> Out[1]: masked
>>
>> I'm sure this horrifies some, but there are times and places where it is
>> a genuine convenience, and preferable to having to use a separate
>> operation to replace nan or inf with NA or whatever it ends up being.
>
> Err, but what would this even get you? NA, NaN, and Inf basically all
> behave the same WRT floating point operations anyway, i.e., they all
> propagate?

Not exactly. First, it depends on np.seterr; second, calculations on NaN 
can be very slow, so are better avoided entirely; third, if an array is 
passed to extension code, it is much nicer if that code only has one NA 
value to handle, instead of having to check for all possible "bad" values.

>
> Is the idea that if ufunc's gain a skipna=True flag, you'd also like
> to be able to turn it into a skipna_and_nan_and_inf=True flag?

No, it is to have a situation where skipna_and_nan_and_inf would not be 
needed, because an operation generating a nan or inf would turn those 
values into NA or IGNORE or whatever right away.

Eric

>
> -- Nathaniel
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Nathaniel Smith
On Wed, Jun 29, 2011 at 2:21 PM, Eric Firing  wrote:
> In addition, for new code, the full-blown masked array module may not be
> needed.  A convenience it adds, however, is the automatic masking of
> invalid values:
>
> In [1]: np.ma.log(-1)
> Out[1]: masked
>
> I'm sure this horrifies some, but there are times and places where it is
> a genuine convenience, and preferable to having to use a separate
> operation to replace nan or inf with NA or whatever it ends up being.

Err, but what would this even get you? NA, NaN, and Inf basically all
behave the same WRT floating point operations anyway, i.e., they all
propagate?

Is the idea that if ufunc's gain a skipna=True flag, you'd also like
to be able to turn it into a skipna_and_nan_and_inf=True flag?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Thu, Jun 30, 2011 at 11:54 AM, Lluís  wrote:

> Mark Wiebe writes:
> > Why is one "magic" and the other "real"? All of this is already
> > sitting on 100 layers of abstraction above electrons and atoms. If
> > we're talking about "real," maybe we should be programming in machine
> > code or using breadboards with individual transistors.
>
> M-x butterfly RET
>
> http://xkcd.com/378/


Ok, I've run this, how long does it take to execute?

-Mark


>
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Thu, Jun 30, 2011 at 11:42 AM, Matthew Brett wrote:

> Hi,
>
> On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe  wrote:
> > On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman
> >  wrote:
> >>
> >>>  Clearly there are some overlaps between what masked arrays are
> >>>  trying to achieve and what Rs NA mechanisms are trying to achieve.
> >>>   Are they really similar enough that they should function using
> >>>  the same API?
> >>>
> >>> Yes.
> >>>
> >>>  And if so, won't that be confusing?
> >>>
> >>> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are
> >>> already
> >>> confusing.
> >>
> >> As one who's been silently following (most of) this thread, and a heavy
> R
> >> and numpy user, perhaps I should chime in briefly here with a use case.
> I
> >> more-or-less always work with partially masked data, like Matthew, but
> not
> >> numpy masked arrays because the memory overhead is prohibitive. And, sad
> to
> >> say, my experiments don't always go perfectly. I therefore have arrays
> in
> >> which there is /both/ (1) data that is simply missing (np.NA?)--it never
> had
> >> a value and never will--as well as simultaneously (2) data that that is
> >> temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask
> >> different portions for different purposes/analyses. I consider these two
> >> separate, completely independent issues and I unfortunately currently
> have
> >> to kluge a lot to handle this.
> >>
> >> Concretely, consider a list of 100,000 observations (rows), with 12
> >> measures per observation-row (a 100,000 x 12 array). Every now and then,
> >> sprinkled throughout this array, I have missing values (someone didn't
> >> answer a question, or a computer failed to record a response, or
> whatever).
> >> For some analyses I want to mask the whole row (e.g., complete-case
> >> analysis), leaving me with array entries that should be tagged with all
> 4
> >> possible labels:
> >>
> >> 1) not masked, not missing
> >> 2) masked, not missing
> >> 3) not masked, missing
> >> 4) masked, missing
> >>
> >> Obviously #4 is "overkill" ... but only until I want to unmask that row.
> >> At that point, I need to be sure that missing values remain missing when
> >> unmasked. Can a single API really handle this?
> >
> > The single API does support a masked array with an NA dtype, and the
> > behavior in this case will be that the value is considered NA if either
> it
> > is masked or the value is the NA bit pattern. So you could add a mask to
> an
> > array with an NA dtype to temporarily treat the data as if more values
> were
> > missing.
>
> Right - but I think the separated API is cleaner and easier to
> explain.  Do you disagree?
>

Kind of, yeah. I think the important things to understand from the Python
perspective are that there are two ways of doing missing values with NA that
look exactly the same except for how you create the arrays. Since you know
that the mask way takes more memory, and that's important for your
application, you can decide to use the NA dtype without any additional
depth.

Understanding that one of them has a special signal for NA while the other
uses masks in the background probably isn't even that important to
understand to be able to use it. I bet lots of people who use R regularly
couldn't come up with a correct explanation of how it works there.

If someone doesn't understand masks, they can use their intuition based on
the special signal idea without any difficulty. The idea that you can
temporarily make some values NA without overwriting your data may not be
intuitive at first glance, but I expect people will find it useful even if
they don't fully understand the subtle details of the masking mechanism.

> One important reason I'm doing it this way is so that each NumPy algorithm
> > and any 3rd party code only needs to be updated once to support both
> forms
> > of missing data.
>
> Could you explain what you mean?  Maybe a couple of examples?
>

Yeah, I've started adding some implementation notes to the NEP. First I need
volunteers to review my current pull requests though. ;)

-Mark


>
> Whatever API results, it will surely be with us for a long time, and
> so it would be good to make sure we have the right one even if it
> costs a bit more to update current code.
>
> Cheers,
>
> Matthew
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data: semantics

2011-06-30 Thread Charles R Harris
On Thu, Jun 30, 2011 at 11:51 AM, Matthew Brett wrote:

> Hi,
>
> On Thu, Jun 30, 2011 at 6:46 PM, Lluís  wrote:
> > Ok, I think it's time to step back and reformulate the problem by
> > completely ignoring the implementation.
> >
> > Here we have 2 "generic" concepts (i.e., applicable to R), plus another
> > extra concept that is exclusive to numpy:
> >
> > * Assigning np.NA to an array, cannot be undone unless through explicit
> >  assignment (i.e., assigning a new arbitrary value, or saving a copy of
> >  the original array before assigning np.NA).
> >
> > * np.NA values propagate by default, unless ufuncs have the "skipna =
> >  True" argument (or the other way around, it doesn't really matter to
> >  this discussion). In order to avoid passing the argument on each
> >  ufunc, we either have some per-array variable for the default "skipna"
> >  value (undesirable) or we can make a trivial ndarray subclass that
> >  will set the "skipna" argument on all ufuncs through the
> >  "_ufunc_wrapper_" mechanism.
> >
> >
> >
> > Now, numpy has the concept of views, which adds some more goodies to the
> > list of concepts:
> >
> > * With views, two arrays can share the same physical data, so that
> >  assignments to any of them will be seen by others (including NA
> >  values).
> >
> > The creation of a view is explicitly stated by the user, so its
> > behaviour should not be perceived as odd (after all, you asked for a
> > view).
> >
> > The good thing is that with views you can avoid costly array copies if
> > you're careful when writing into these views.
> >
> >
> >
> > Now, you can add a new concept: local/temporal/transient missing data.
> >
> > We can take an existing array and create a view with the new argument
> > "transientna = True".
> >
> > Here, both the view and the "transientna = True" are explicitly stated
> > by the user, so it is assumed that she already knows what this is all
> > about.
> >
> > The difference with a regular view is that you also explicitly asked for
> > local/temporal/transient NA values.
> >
> > * Assigning np.NA to an array view with "transientna = True" will
> >  *not* be seen by any of the other views (nor the "original" array),
> >  but anything else will still work "as usual".
> >
> > After all, this is what *you* asked for when using the "transientna =
> > True" argument.
> >
> >
> >
> > To conclude, say that others *must not* care about whether the arrays
> > they're working with have transient NA values. This way, I can create a
> > view with transient NAs, set to NA some uninteresting data, and pass it
> > to a routine written by someone else that sets to NA elements that, for
> > example, are beyond certain threshold from the mean of the elements.
> >
> > This would be equivalent to storing a copy of the original array before
> > passing it to this 3rd party function, only that "transientna", just as
> > views, provide some handy shortcuts to avoid copies.
> >
> >
> > My main point here is that views and local/temporal/transient NAs are
> > all *explicitly* requested, so that its behaviour should not appear as
> > something unexpected.
> >
> > Is there an agreement on this?
>
> Absolutely, if by 'transientna' you mean 'masked'.  The discussion is
> whether the NA API should be the same as the masking API.   The thing
> you are describing is what masking is for, and what it's always been
> for, as far as I can see.   We're arguing that to call this
> 'transientna' instead of 'masked' confuses two concepts that are
> different, to no good purpose.
>
>
It's a hammer. If you want to hammer nails, fine, if you want hammer a bit
of tubing flat, fine. It's a tool, the hammer concept if you will.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data: semantics

2011-06-30 Thread Charles R Harris
On Thu, Jun 30, 2011 at 11:46 AM, Lluís  wrote:

> Ok, I think it's time to step back and reformulate the problem by
> completely ignoring the implementation.
>
> Here we have 2 "generic" concepts (i.e., applicable to R), plus another
> extra concept that is exclusive to numpy:
>
> * Assigning np.NA to an array, cannot be undone unless through explicit
>  assignment (i.e., assigning a new arbitrary value, or saving a copy of
>  the original array before assigning np.NA).
>
> * np.NA values propagate by default, unless ufuncs have the "skipna =
>  True" argument (or the other way around, it doesn't really matter to
>  this discussion). In order to avoid passing the argument on each
>  ufunc, we either have some per-array variable for the default "skipna"
>  value (undesirable) or we can make a trivial ndarray subclass that
>  will set the "skipna" argument on all ufuncs through the
>  "_ufunc_wrapper_" mechanism.
>
>
>
> Now, numpy has the concept of views, which adds some more goodies to the
> list of concepts:
>
> * With views, two arrays can share the same physical data, so that
>  assignments to any of them will be seen by others (including NA
>  values).
>
> The creation of a view is explicitly stated by the user, so its
> behaviour should not be perceived as odd (after all, you asked for a
> view).
>
> The good thing is that with views you can avoid costly array copies if
> you're careful when writing into these views.
>
>
>
> Now, you can add a new concept: local/temporal/transient missing data.
>
> We can take an existing array and create a view with the new argument
> "transientna = True".
>
>
This is already there: x.view(masked=1), although the keyword transientna
has appeal, not least because it avoids the word 'mask', which seems a
source of endless confusion. Note that currently this is only supposed to
work if the original array is unmasked.

Here, both the view and the "transientna = True" are explicitly stated
> by the user, so it is assumed that she already knows what this is all
> about.
>
> The difference with a regular view is that you also explicitly asked for
> local/temporal/transient NA values.
>
> * Assigning np.NA to an array view with "transientna = True" will
>  *not* be seen by any of the other views (nor the "original" array),
>  but anything else will still work "as usual".
>
> After all, this is what *you* asked for when using the "transientna =
> True" argument.
>
>
>
> To conclude, say that others *must not* care about whether the arrays
> they're working with have transient NA values. This way, I can create a
> view with transient NAs, set to NA some uninteresting data, and pass it
> to a routine written by someone else that sets to NA elements that, for
> example, are beyond certain threshold from the mean of the elements.
>
> This would be equivalent to storing a copy of the original array before
> passing it to this 3rd party function, only that "transientna", just as
> views, provide some handy shortcuts to avoid copies.
>
>
> My main point here is that views and local/temporal/transient NAs are
> all *explicitly* requested, so that its behaviour should not appear as
> something unexpected.
>
> Is there an agreement on this?
>
>
Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data: semantics

2011-06-30 Thread Matthew Brett
Hi,

On Thu, Jun 30, 2011 at 6:46 PM, Lluís  wrote:
> Ok, I think it's time to step back and reformulate the problem by
> completely ignoring the implementation.
>
> Here we have 2 "generic" concepts (i.e., applicable to R), plus another
> extra concept that is exclusive to numpy:
>
> * Assigning np.NA to an array, cannot be undone unless through explicit
>  assignment (i.e., assigning a new arbitrary value, or saving a copy of
>  the original array before assigning np.NA).
>
> * np.NA values propagate by default, unless ufuncs have the "skipna =
>  True" argument (or the other way around, it doesn't really matter to
>  this discussion). In order to avoid passing the argument on each
>  ufunc, we either have some per-array variable for the default "skipna"
>  value (undesirable) or we can make a trivial ndarray subclass that
>  will set the "skipna" argument on all ufuncs through the
>  "_ufunc_wrapper_" mechanism.
>
>
>
> Now, numpy has the concept of views, which adds some more goodies to the
> list of concepts:
>
> * With views, two arrays can share the same physical data, so that
>  assignments to any of them will be seen by others (including NA
>  values).
>
> The creation of a view is explicitly stated by the user, so its
> behaviour should not be perceived as odd (after all, you asked for a
> view).
>
> The good thing is that with views you can avoid costly array copies if
> you're careful when writing into these views.
>
>
>
> Now, you can add a new concept: local/temporal/transient missing data.
>
> We can take an existing array and create a view with the new argument
> "transientna = True".
>
> Here, both the view and the "transientna = True" are explicitly stated
> by the user, so it is assumed that she already knows what this is all
> about.
>
> The difference with a regular view is that you also explicitly asked for
> local/temporal/transient NA values.
>
> * Assigning np.NA to an array view with "transientna = True" will
>  *not* be seen by any of the other views (nor the "original" array),
>  but anything else will still work "as usual".
>
> After all, this is what *you* asked for when using the "transientna =
> True" argument.
>
>
>
> To conclude, say that others *must not* care about whether the arrays
> they're working with have transient NA values. This way, I can create a
> view with transient NAs, set to NA some uninteresting data, and pass it
> to a routine written by someone else that sets to NA elements that, for
> example, are beyond certain threshold from the mean of the elements.
>
> This would be equivalent to storing a copy of the original array before
> passing it to this 3rd party function, only that "transientna", just as
> views, provide some handy shortcuts to avoid copies.
>
>
> My main point here is that views and local/temporal/transient NAs are
> all *explicitly* requested, so that its behaviour should not appear as
> something unexpected.
>
> Is there an agreement on this?

Absolutely, if by 'transientna' you mean 'masked'.  The discussion is
whether the NA API should be the same as the masking API.   The thing
you are describing is what masking is for, and what it's always been
for, as far as I can see.   We're arguing that to call this
'transientna' instead of 'masked' confuses two concepts that are
different, to no good purpose.

Best,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] missing data: semantics

2011-06-30 Thread Lluís
Ok, I think it's time to step back and reformulate the problem by
completely ignoring the implementation.

Here we have 2 "generic" concepts (i.e., applicable to R), plus another
extra concept that is exclusive to numpy:

* Assigning np.NA to an array, cannot be undone unless through explicit
  assignment (i.e., assigning a new arbitrary value, or saving a copy of
  the original array before assigning np.NA).

* np.NA values propagate by default, unless ufuncs have the "skipna =
  True" argument (or the other way around, it doesn't really matter to
  this discussion). In order to avoid passing the argument on each
  ufunc, we either have some per-array variable for the default "skipna"
  value (undesirable) or we can make a trivial ndarray subclass that
  will set the "skipna" argument on all ufuncs through the
  "_ufunc_wrapper_" mechanism.



Now, numpy has the concept of views, which adds some more goodies to the
list of concepts:

* With views, two arrays can share the same physical data, so that
  assignments to any of them will be seen by others (including NA
  values).

The creation of a view is explicitly stated by the user, so its
behaviour should not be perceived as odd (after all, you asked for a
view).

The good thing is that with views you can avoid costly array copies if
you're careful when writing into these views.



Now, you can add a new concept: local/temporal/transient missing data.

We can take an existing array and create a view with the new argument
"transientna = True".

Here, both the view and the "transientna = True" are explicitly stated
by the user, so it is assumed that she already knows what this is all
about.

The difference with a regular view is that you also explicitly asked for
local/temporal/transient NA values.

* Assigning np.NA to an array view with "transientna = True" will
  *not* be seen by any of the other views (nor the "original" array),
  but anything else will still work "as usual".

After all, this is what *you* asked for when using the "transientna =
True" argument.



To conclude, say that others *must not* care about whether the arrays
they're working with have transient NA values. This way, I can create a
view with transient NAs, set to NA some uninteresting data, and pass it
to a routine written by someone else that sets to NA elements that, for
example, are beyond certain threshold from the mean of the elements.

This would be equivalent to storing a copy of the original array before
passing it to this 3rd party function, only that "transientna", just as
views, provide some handy shortcuts to avoid copies.


My main point here is that views and local/temporal/transient NAs are
all *explicitly* requested, so that its behaviour should not appear as
something unexpected.

Is there an agreement on this?


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Lluís
Mark Wiebe writes:
> Why is one "magic" and the other "real"? All of this is already
> sitting on 100 layers of abstraction above electrons and atoms. If
> we're talking about "real," maybe we should be programming in machine
> code or using breadboards with individual transistors.

M-x butterfly RET

http://xkcd.com/378/

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Matthew Brett
Hi,

On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe  wrote:
> On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman
>  wrote:
>>
>>>      Clearly there are some overlaps between what masked arrays are
>>>      trying to achieve and what Rs NA mechanisms are trying to achieve.
>>>       Are they really similar enough that they should function using
>>>      the same API?
>>>
>>> Yes.
>>>
>>>      And if so, won't that be confusing?
>>>
>>> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are
>>> already
>>> confusing.
>>
>> As one who's been silently following (most of) this thread, and a heavy R
>> and numpy user, perhaps I should chime in briefly here with a use case. I
>> more-or-less always work with partially masked data, like Matthew, but not
>> numpy masked arrays because the memory overhead is prohibitive. And, sad to
>> say, my experiments don't always go perfectly. I therefore have arrays in
>> which there is /both/ (1) data that is simply missing (np.NA?)--it never had
>> a value and never will--as well as simultaneously (2) data that that is
>> temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask
>> different portions for different purposes/analyses. I consider these two
>> separate, completely independent issues and I unfortunately currently have
>> to kluge a lot to handle this.
>>
>> Concretely, consider a list of 100,000 observations (rows), with 12
>> measures per observation-row (a 100,000 x 12 array). Every now and then,
>> sprinkled throughout this array, I have missing values (someone didn't
>> answer a question, or a computer failed to record a response, or whatever).
>> For some analyses I want to mask the whole row (e.g., complete-case
>> analysis), leaving me with array entries that should be tagged with all 4
>> possible labels:
>>
>> 1) not masked, not missing
>> 2) masked, not missing
>> 3) not masked, missing
>> 4) masked, missing
>>
>> Obviously #4 is "overkill" ... but only until I want to unmask that row.
>> At that point, I need to be sure that missing values remain missing when
>> unmasked. Can a single API really handle this?
>
> The single API does support a masked array with an NA dtype, and the
> behavior in this case will be that the value is considered NA if either it
> is masked or the value is the NA bit pattern. So you could add a mask to an
> array with an NA dtype to temporarily treat the data as if more values were
> missing.

Right - but I think the separated API is cleaner and easier to
explain.  Do you disagree?

> One important reason I'm doing it this way is so that each NumPy algorithm
> and any 3rd party code only needs to be updated once to support both forms
> of missing data.

Could you explain what you mean?  Maybe a couple of examples?

Whatever API results, it will surely be with us for a long time, and
so it would be good to make sure we have the right one even if it
costs a bit more to update current code.

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman  wrote:

>
>   Clearly there are some overlaps between what masked arrays are
>>  trying to achieve and what Rs NA mechanisms are trying to achieve.
>>   Are they really similar enough that they should function using
>>  the same API?
>>
>> Yes.
>>
>>  And if so, won't that be confusing?
>>
>> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are
>> already
>> confusing.
>>
>
> As one who's been silently following (most of) this thread, and a heavy R
> and numpy user, perhaps I should chime in briefly here with a use case. I
> more-or-less always work with partially masked data, like Matthew, but not
> numpy masked arrays because the memory overhead is prohibitive. And, sad to
> say, my experiments don't always go perfectly. I therefore have arrays in
> which there is /both/ (1) data that is simply missing (np.NA?)--it never had
> a value and never will--as well as simultaneously (2) data that that is
> temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask
> different portions for different purposes/analyses. I consider these two
> separate, completely independent issues and I unfortunately currently have
> to kluge a lot to handle this.
>
> Concretely, consider a list of 100,000 observations (rows), with 12
> measures per observation-row (a 100,000 x 12 array). Every now and then,
> sprinkled throughout this array, I have missing values (someone didn't
> answer a question, or a computer failed to record a response, or whatever).
> For some analyses I want to mask the whole row (e.g., complete-case
> analysis), leaving me with array entries that should be tagged with all 4
> possible labels:
>
> 1) not masked, not missing
> 2) masked, not missing
> 3) not masked, missing
> 4) masked, missing
>
> Obviously #4 is "overkill" ... but only until I want to unmask that row. At
> that point, I need to be sure that missing values remain missing when
> unmasked. Can a single API really handle this?
>

The single API does support a masked array with an NA dtype, and the
behavior in this case will be that the value is considered NA if either it
is masked or the value is the NA bit pattern. So you could add a mask to an
array with an NA dtype to temporarily treat the data as if more values were
missing.

One important reason I'm doing it this way is so that each NumPy algorithm
and any 3rd party code only needs to be updated once to support both forms
of missing data. The C API with masks is also a lot cleaner to work with
than one for NA dtypes with the ability to have different NA bit patterns.

-Mark


>
> -best
> Gary
>
>
> The information in this e-mail is intended only for the person to whom it
> is
> addressed. If you believe this e-mail was sent to you in error and the
> e-mail
> contains patient information, please contact the Partners Compliance
> HelpLine at
> http://www.partners.org/**complianceline.
>  If the e-mail was sent to you in error
> but does not contain patient information, please contact the sender and
> properly
> dispose of the e-mail.
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Lluís
Mark Wiebe writes:

> On Wed, Jun 29, 2011 at 1:20 PM, Lluís  wrote:
> [...]
>> As far as I can tell, the only required difference between them is
>> that NA bit patterns must destroy the data. Nothing else. Everything
>> on top of that is a choice of API and interface mechanisms. I want
>> them to behave exactly the same except for that necessary difference,
>> so that it will be possible to use the *exact same Python code* with
>> either approach.
   
> I completely agree. What I'd suggest is a global and/or per-object
> "ndarray.flags.skipna" for people like me that just want to ignore these
> entries without caring about setting it on each operaion (or the other
> way around, depends on the default behaviour).
   
> The downside is that it adds yet another tweaking knob, which is not
> desirable...

> One way around this would be to create an ndarray subclass which
> changes that default. Currently this would not be possible to do
> nicely, but with the _numpy_ufunc_ idea I proposed in a separate
> thread a while back, this could work.

That does indeed sound good :)


Lluis


-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Gary Strangman



  Clearly there are some overlaps between what masked arrays are
  trying to achieve and what Rs NA mechanisms are trying to achieve.
   Are they really similar enough that they should function using
  the same API?

Yes.

  And if so, won't that be confusing?

No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already
confusing.


As one who's been silently following (most of) this thread, and a heavy R 
and numpy user, perhaps I should chime in briefly here with a use case. I 
more-or-less always work with partially masked data, like Matthew, but not 
numpy masked arrays because the memory overhead is prohibitive. And, sad 
to say, my experiments don't always go perfectly. I therefore have arrays 
in which there is /both/ (1) data that is simply missing (np.NA?)--it 
never had a value and never will--as well as simultaneously (2) data that 
that is temporarily masked (np.IGNORE? np.MASKED?) where I want to 
mask/unmask different portions for different purposes/analyses. I consider 
these two separate, completely independent issues and I unfortunately 
currently have to kluge a lot to handle this.


Concretely, consider a list of 100,000 observations (rows), with 12 
measures per observation-row (a 100,000 x 12 array). Every now and then, 
sprinkled throughout this array, I have missing values (someone didn't 
answer a question, or a computer failed to record a response, or 
whatever). For some analyses I want to mask the whole row (e.g., 
complete-case analysis), leaving me with array entries that should be 
tagged with all 4 possible labels:


1) not masked, not missing
2) masked, not missing
3) not masked, missing
4) masked, missing

Obviously #4 is "overkill" ... but only until I want to unmask that row. 
At that point, I need to be sure that missing values remain missing when 
unmasked. Can a single API really handle this?


-best
Gary


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Thu, Jun 30, 2011 at 1:49 AM, Chris Barker  wrote:

> On 6/27/11 9:53 AM, Charles R Harris wrote:
> > Some discussion of disk storage might also help. I don't see how the
> > rules can be enforced if two files are used, one for the mask and
> > another for the data, but that may just be something we need to live
> with.
>
> It seems it wouldn't be too big  deal to extend the *.npy format to
> include the mask.
>
> Could one memmap both the data array and the mask?
>

This I haven't thought about too much yet, but I don't see why not. This
does provide a back door into the mask which violates the abstractions, so I
would want it to be an extremely narrow special case.

-Mark


>
> Netcdf (and assume hdf) have ways to support masks as well.
>
> -Chris
>
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 5:42 PM, Nathaniel Smith  wrote:

> On Wed, Jun 29, 2011 at 2:40 PM, Lluís  wrote:
> > I'm for the option of having a single API when you want to have NA
> > elements, regardless of whether it's using masks or bit patterns.
>
> I understand the desire to avoid having two different APIS...
>
> [snip]
> > My concern is now about how to set the "skipna" in a "comfortable" way,
> > so that I don't have to set it again and again as ufunc arguments:
> >
>  a
> > array([NA, 2, 3])
>  b
> > array([1, 2, NA])
>  a + b
> > array([NA, 2, NA])
>  a.flags.skipna=True
>  b.flags.skipna=True
>  a + b
> > array([1, 4, 3])
>
> ...But... now you're introducing two different kinds of arrays with
> different APIs again? Ones where .skipna==True, and ones where
> .skipna==False?
>
> I know that this way it's not keyed on the underlying storage format,
> but if we support both bit patterns and mask arrays at the
> implementation level, then the only way to make them have identical
> APIs is if we completely disallow unmasking, and shared masks, and so
> forth.


The right set of these conditions has been in the NEP from the beginning.
Unmasking without value assignment is disallowed - the only way to "see
behind the mask" or to share masks is with views. My impression is than more
people are concerned with sharing the same data between different masks,
something also supported through views.

-Mark


> Which doesn't seem like it'd be very popular (and would make
> including the mask-based implementation pretty pointless). So I think
> we have to assume that they will have APIs that are at least somewhat
> different. And then it seems like with this proposal then we'd
> actually end up with *4* different APIs that any particular array
> might follow... (or maybe more, depending on how arrays that had both
> a bit-pattern and mask ended up working).
>
> That's why I was thinking the best solution might be to just bite the
> bullet and make the APIs *totally* different and non-overlapping, so
> it was always obvious which you were using and how they'd interact.
> But I don't know -- for my work I'd be happy to just pass skipna
> everywhere I needed it, and never unmask anything, and so forth, so
> maybe there's some reason why it's really important for the
> bit-pattern NA API to overlap more with the masked array API?
>
> -- Nathaniel
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 4:21 PM, Eric Firing  wrote:

> On 06/29/2011 09:32 AM, Matthew Brett wrote:
> > Hi,
> >
> [...]
> >
> > Clearly there are some overlaps between what masked arrays are trying
> > to achieve and what Rs NA mechanisms are trying to achieve.  Are they
> > really similar enough that they should function using the same API?
> > And if so, won't that be confusing?  I think that's the question
> > that's being asked.
>
> And I think the answer is "no".  No more confusing to people coming from
> R to numpy than views already are--with or without the NEP--and not
> *requiring* people to use any NA-related functionality beyond what they
> are used to from R.
>
> My understanding of the NEP is that it directly yields an API closely
> matching that of R, but with the opportunity, via views, to do more with
> less work, if one so desires.  The present masked array module could be
> made more efficient if the NEP is implemented; regardless of whether
> this is done, the masked array module is not about to vanish, so anyone
> wanting precisely the masked array API will have it; and others remain
> free to ignore it (except for those of us involved in developing
> libraries such as matplotlib, which will have to support all variations
> of the new API along with the already-supported masked arrays).
>
> In addition, for new code, the full-blown masked array module may not be
> needed.  A convenience it adds, however, is the automatic masking of
> invalid values:
>
> In [1]: np.ma.log(-1)
> Out[1]: masked
>
> I'm sure this horrifies some, but there are times and places where it is
> a genuine convenience, and preferable to having to use a separate
> operation to replace nan or inf with NA or whatever it ends up being.
>

I added a mechanism to support this idea with the NA dtypes approach,
spelled 'NA[f8,InfNan]'. Here, all Infs and NaNs are treated as NA by the
system.

-Mark

If np.seterr were extended to allow such automatic masking as an option,
> then the need for a separate masked array module would shrink further.
> I wouldn't mind having to use an explicit kwarg for ignoring NA in
> reduction methods.
>
> Eric
>
>
> >
> > See you,
> >
> > Matthew
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 1:20 PM, Lluís  wrote:

> Mark Wiebe writes:
>
> > There seems to be a general idea that masks and NA bit patterns imply
> > particular differing semantics, something which I think is simply
> > false.
>
> Well, my example contained a difference (the need for the "skipna=True"
> argument) precisely because it seemed that there was some need for
> different defaults.
>
> Honestly, I think this difference breaks the POLA (principle of least
> astonishment).
>
>
> [...]
> > As far as I can tell, the only required difference between them is
> > that NA bit patterns must destroy the data. Nothing else. Everything
> > on top of that is a choice of API and interface mechanisms. I want
> > them to behave exactly the same except for that necessary difference,
> > so that it will be possible to use the *exact same Python code* with
> > either approach.
>
> I completely agree. What I'd suggest is a global and/or per-object
> "ndarray.flags.skipna" for people like me that just want to ignore these
> entries without caring about setting it on each operaion (or the other
> way around, depends on the default behaviour).
>
> The downside is that it adds yet another tweaking knob, which is not
> desirable...
>

One way around this would be to create an ndarray subclass which changes
that default. Currently this would not be possible to do nicely, but with
the _numpy_ufunc_ idea I proposed in a separate thread a while back, this
could work.

-Mark


>
>
> Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 2:32 PM, Matthew Brett wrote:

> Hi,
>
> On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe  wrote:
> > On Wed, Jun 29, 2011 at 8:20 AM, Lluís  wrote:
> >>
> >> Matthew Brett writes:
> >>
> >> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
> >> >> the idea that the entry is still there, but we're just ignoring it.
>  Of
> >> >> course, that goes against common convention, but it might be easier
> to
> >> >> explain.
> >>
> >> > I think Nathaniel's point is that np.IGNORE is a different idea than
> >> > np.NA, and that is why joining the implementations can lead to
> >> > conceptual confusion.
> >>
> >> This is how I see it:
> >>
> >> >>> a = np.array([0, 1, 2], dtype=int)
> >> >>> a[0] = np.NA
> >> ValueError
> >> >>> e = np.array([np.NA, 1, 2], dtype=int)
> >> ValueError
> >> >>> b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
> >> >>> m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
> >> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
> >> >>> b[1] = np.NA
> >> >>> np.sum(b)
> >> np.NA
> >> >>> np.sum(b, skipna=True)
> >> 2
> >> >>> b.mask
> >> None
> >> >>> m[1] = np.NA
> >> >>> np.sum(m)
> >> 2
> >> >>> np.sum(m, skipna=True)
> >> 2
> >> >>> m.mask
> >> [False, False, True]
> >> >>> bm[1] = np.NA
> >> >>> np.sum(bm)
> >> 2
> >> >>> np.sum(bm, skipna=True)
> >> 2
> >> >>> bm.mask
> >> [False, False, True]
> >>
> >> So:
> >>
> >> * Mask takes precedence over bit pattern on element assignment. There's
> >>  still the question of how to assign a bit pattern NA when the mask is
> >>  active.
> >>
> >> * When using mask, elements are automagically skipped.
> >>
> >> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
> >>
> >> * When using bit pattern + mask, it might make sense to have the initial
> >>  values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
> >>  False, True]" and "np.sum(bm) == np.NA")
> >
> > There seems to be a general idea that masks and NA bit patterns imply
> > particular differing semantics, something which I think is simply false.
>
> Well - first - it's helpful surely to separate the concepts and the
> implementation.
>
> Concepts / use patterns (as delineated by Nathaniel):
> A) missing values == 'np.NA' in my emails.  Can we call that CMV
> (concept missing values)?
> B) masks == np.IGNORE in my emails . CMSK (concept masks)?
>

This is a different conceptual model than I'm proposing in the NEP. This is
also exactly what I was trying to clarify in the first email in this thread
under the headings "Missing Data Abstraction" and "Implementation
Techniques". Masks are *just* an implementation technique. They imply
nothing more, except through previously established conventions such as in
various bitmasks, image masks, numpy.ma and others.

masks != np.IGNORE
bit patterns != np.NA

Masks vs bit patterns and R's default NA vs rm.na NA semantics are
completely independent, except where design choices are made that they
should be related. I think they should be unrelated, masks and bit patterns
are two approaches to solving the same problem.


>
> Implementations
> 1) bit-pattern == na-dtype - how about we call that IBP
> (implementation bit patten)?
> 2) array.mask.  IM (implementation mask)?
>
> Nathaniel implied that:
>
> CMV implies: sum([np.NA, 1]) == np.NA
> CMSK implies sum([np.NA, 1]) == 1
>
> and indeed, that's how R and masked arrays respectively behave.


R and numpy.ma.  If we're trying to be clear about our concepts and
implementations, numpy.ma is just one possible implementation of masked
arrays.


> So I
> think it's reasonable to say that at least R thought that the bitmask
> implied the first and Pierre and others thought the mask meant the
> second.
>

R's model is based on years of experience and a model of what missing values
implies, the bitmask implies nothing about the behavior of NA.


>
> The NEP as it stands thinks of CMV and and CM as being different views
> of the same thing,   Please correct me if I'm wrong.
>
> > Both NaN and Inf are implemented in hardware with the same idea as the NA
> > bit pattern, but they do not follow NA missing value semantics.
>
> Right - and that doesn't affect the argument, because the argument is
> about the concepts and not the implementation.
>

You just said R thought bitmasks implied something, and you're saying masked
arrays imply something. If the argument is just about the missing value
concepts, neither of these should be in the present discussion.


>
> > As far as I can tell, the only required difference between them is that
> NA
> > bit patterns must destroy the data. Nothing else.
>
> I think Nathaniel's point was about the expected default behavior in
> the different concepts.
>
> > Everything on top of that
> > is a choice of API and interface mechanisms. I want them to behave
> exactly
> > the same except for that necessary difference, so that it will be
> possible
> > to use the *exact same Python code* with either approach.
>

Re: [Numpy-discussion] missing data discussion round 2

2011-06-30 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 1:07 PM, Dag Sverre Seljebotn <
d.s.seljeb...@astro.uio.no> wrote:

> On 06/29/2011 07:38 PM, Mark Wiebe wrote:
> > On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
> > mailto:d.s.seljeb...@astro.uio.no>> wrote:
> >
> > On 06/29/2011 03:45 PM, Matthew Brett wrote:
> >  > Hi,
> >  >
> >  > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe > >  wrote:
> >  >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew
> > Brettmailto:matthew.br...@gmail.com>>
> >  >> wrote:
> >  >>>
> >  >>> Hi,
> >  >>>
> >  >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith > >  wrote:
> >  >>> ...
> >   (You might think, what difference does it make if you *can*
> > unmask an
> >   item? Us missing data folks could just ignore this feature.
> But:
> >   whatever we end up implementing is something that I will have
> to
> >   explain over and over to different people, most of them not
> >   particularly sophisticated programmers. And there's just no
> > sensible
> >   way to explain this idea that if you store some particular
> > value, then
> >   it replaces the old value, but if you store NA, then the old
> > value is
> >   still there.
> >  >>>
> >  >>> Ouch - yes.  No question, that is difficult to explain.   Well,
> I
> >  >>> think the explanation might go like this:
> >  >>>
> >  >>> "Ah, yes, well, that's because in fact numpy records missing
> > values by
> >  >>> using a 'mask'.   So when you say `a[3] = np.NA', what you mean
> is,
> >  >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
> >  >>>
> >  >>> Is that fair?
> >  >>
> >  >> My favorite way of explaining it would be to have a grid of
> > numbers written
> >  >> on paper, then have several cardboards with holes poked in them
> > in different
> >  >> configurations. Placing these cardboard masks in front of the
> > grid would
> >  >> show different sets of non-missing data, without affecting the
> > values stored
> >  >> on the paper behind them.
> >  >
> >  > Right - but here of course you are trying to explain the mask, and
> >  > this is Nathaniel's point, that in order to explain NAs, you have
> to
> >  > explain masks, and so, even at a basic level, the fusion of the
> two
> >  > ideas is obvious, and already confusing.  I mean this:
> >  >
> >  > a[3] = np.NA
> >  >
> >  > "Oh, so you just set the a[3] value to have some missing value
> code?"
> >  >
> >  > "Ah - no - in fact what I did was set a associated mask in
> position
> >  > a[3] so that you can't any longer see the previous value of a[3]"
> >  >
> >  > "Huh.  You mean I have a mask for every single value in order to
> be
> >  > able to blank out a[3]?  It looks like an assignment.  I mean, it
> >  > looks just like a[3] = 4.  But I guess it isn't?"
> >  >
> >  > "Er..."
> >  >
> >  > I think Nathaniel's point is a very good one - these are separate
> >  > ideas, np.NA and np.IGNORE, and a joint implementation is bound to
> >  > draw them together in the mind of the user.Apart from anything
> >  > else, the user has to know that, if they want a single NA value in
> an
> >  > array, they have to add a mask size array.shape in bytes.  They
> have
> >  > to know then, that NA is implemented by masking, and then the 'NA
> for
> >  > free by adding masking' idea breaks down and starts to feel like a
> >  > kludge.
> >  >
> >  > The counter argument is of course that, in time, the
> > implementation of
> >  > NA with masking will seem as obvious and intuitive, as, say,
> >  > broadcasting, and that we are just reacting from lack of
> experience
> >  > with the new API.
> >
> > However, no matter how used we get to this, people coming from almost
> > any other tool (in particular R) will keep think it is
> > counter-intuitive. Why set up a major semantic incompatability that
> > people then have to overcome in order to start using NumPy.
> >
> >
> > I'm not aware of a semantic incompatibility. I believe R doesn't support
> > views like NumPy does, so the things you have to do to see masking
> > semantics aren't even possible in R.
>
> Well, whether the same feature is possible or not in R is irrelevant to
> whether a semantic incompatability would exist.
>
> Views themselves are a *major* semantic incompatability, and are highly
> confusing at first to MATLAB/Fortran/R people. However they have major
> advantages outweighing the disadvantage of having to caution new users.
>
> But there's simply no precedence anywhere for an assignment that doesn't
> erase the old value for a particular input value, and the advantages
> seem pretty minor (well, I think it is ugly in

Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Chris Barker
On 6/27/11 9:53 AM, Charles R Harris wrote:
> Some discussion of disk storage might also help. I don't see how the
> rules can be enforced if two files are used, one for the mask and
> another for the data, but that may just be something we need to live with.

It seems it wouldn't be too big  deal to extend the *.npy format to 
include the mask.

Could one memmap both the data array and the mask?

Netcdf (and assume hdf) have ways to support masks as well.

-Chris




-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Nathaniel Smith
On Wed, Jun 29, 2011 at 2:40 PM, Lluís  wrote:
> I'm for the option of having a single API when you want to have NA
> elements, regardless of whether it's using masks or bit patterns.

I understand the desire to avoid having two different APIS...

[snip]
> My concern is now about how to set the "skipna" in a "comfortable" way,
> so that I don't have to set it again and again as ufunc arguments:
>
 a
> array([NA, 2, 3])
 b
> array([1, 2, NA])
 a + b
> array([NA, 2, NA])
 a.flags.skipna=True
 b.flags.skipna=True
 a + b
> array([1, 4, 3])

...But... now you're introducing two different kinds of arrays with
different APIs again? Ones where .skipna==True, and ones where
.skipna==False?

I know that this way it's not keyed on the underlying storage format,
but if we support both bit patterns and mask arrays at the
implementation level, then the only way to make them have identical
APIs is if we completely disallow unmasking, and shared masks, and so
forth. Which doesn't seem like it'd be very popular (and would make
including the mask-based implementation pretty pointless). So I think
we have to assume that they will have APIs that are at least somewhat
different. And then it seems like with this proposal then we'd
actually end up with *4* different APIs that any particular array
might follow... (or maybe more, depending on how arrays that had both
a bit-pattern and mask ended up working).

That's why I was thinking the best solution might be to just bite the
bullet and make the APIs *totally* different and non-overlapping, so
it was always obvious which you were using and how they'd interact.
But I don't know -- for my work I'd be happy to just pass skipna
everywhere I needed it, and never unmask anything, and so forth, so
maybe there's some reason why it's really important for the
bit-pattern NA API to overlap more with the masked array API?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Lluís
Nathaniel Smith writes:
> I know that the part 1 of that proposal would satisfy my needs, but I
> don't know as much about your use case, so I'm curious. Would that
> proposal (in particular, part 2, the classic masked-array part) work
> for you?

I'm for the option of having a single API when you want to have NA
elements, regardless of whether it's using masks or bit patterns.

My question is whether your ufuncs should react differently depending on
the type of array you're using (bit pattern vs mask).

In the beginning I thought it could make sense, as you know how you have
created the array. So if you're using masks, you're probably going to
ignore the NAs (becase you've explicitly set them, and you don't want a
NA as the result of your summation).

*But*, the more API/semantics both approaches share, the better; so I'd
say that its better that they show the *very same* behaviour
(w.r.t. "skipna").

My concern is now about how to set the "skipna" in a "comfortable" way,
so that I don't have to set it again and again as ufunc arguments:

>>> a
array([NA, 2, 3])
>>> b
array([1, 2, NA])
>>> a + b
array([NA, 2, NA])
>>> a.flags.skipna=True
>>> b.flags.skipna=True
>>> a + b
array([1, 4, 3])


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Eric Firing
On 06/29/2011 09:32 AM, Matthew Brett wrote:
> Hi,
>
[...]
>
> Clearly there are some overlaps between what masked arrays are trying
> to achieve and what Rs NA mechanisms are trying to achieve.  Are they
> really similar enough that they should function using the same API?
> And if so, won't that be confusing?  I think that's the question
> that's being asked.

And I think the answer is "no".  No more confusing to people coming from 
R to numpy than views already are--with or without the NEP--and not 
*requiring* people to use any NA-related functionality beyond what they 
are used to from R.

My understanding of the NEP is that it directly yields an API closely 
matching that of R, but with the opportunity, via views, to do more with 
less work, if one so desires.  The present masked array module could be 
made more efficient if the NEP is implemented; regardless of whether 
this is done, the masked array module is not about to vanish, so anyone 
wanting precisely the masked array API will have it; and others remain 
free to ignore it (except for those of us involved in developing 
libraries such as matplotlib, which will have to support all variations 
of the new API along with the already-supported masked arrays).

In addition, for new code, the full-blown masked array module may not be 
needed.  A convenience it adds, however, is the automatic masking of 
invalid values:

In [1]: np.ma.log(-1)
Out[1]: masked

I'm sure this horrifies some, but there are times and places where it is 
a genuine convenience, and preferable to having to use a separate 
operation to replace nan or inf with NA or whatever it ends up being.

If np.seterr were extended to allow such automatic masking as an option, 
then the need for a separate masked array module would shrink further. 
I wouldn't mind having to use an explicit kwarg for ignoring NA in 
reduction methods.

Eric


>
> See you,
>
> Matthew
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Nathaniel Smith
On Wed, Jun 29, 2011 at 11:20 AM, Lluís  wrote:
> I completely agree. What I'd suggest is a global and/or per-object
> "ndarray.flags.skipna" for people like me that just want to ignore these
> entries without caring about setting it on each operaion (or the other
> way around, depends on the default behaviour).

I agree with with Matthew that this approach would end up having
horrible side-effects, but I can see why you'd want some way to
accomplish this...

I suggested another approach to handling both NA-style and mask-style
missing data by making them totally separate features. It's buried at
the bottom of this over-long message (you can search for "my
proposal"):
  http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057251.html

I know that the part 1 of that proposal would satisfy my needs, but I
don't know as much about your use case, so I'm curious. Would that
proposal (in particular, part 2, the classic masked-array part) work
for you?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Matthew Brett
Hi,

On Wed, Jun 29, 2011 at 9:17 PM, Charles R Harris
 wrote:
>
>
> On Wed, Jun 29, 2011 at 1:32 PM, Matthew Brett 
> wrote:
>>
>> Hi,
>>
>> On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe  wrote:
>> > On Wed, Jun 29, 2011 at 8:20 AM, Lluís  wrote:
>> >>
>> >> Matthew Brett writes:
>> >>
>> >> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of
>> >> >> conveys
>> >> >> the idea that the entry is still there, but we're just ignoring it.
>> >> >>  Of
>> >> >> course, that goes against common convention, but it might be easier
>> >> >> to
>> >> >> explain.
>> >>
>> >> > I think Nathaniel's point is that np.IGNORE is a different idea than
>> >> > np.NA, and that is why joining the implementations can lead to
>> >> > conceptual confusion.
>> >>
>> >> This is how I see it:
>> >>
>> >> >>> a = np.array([0, 1, 2], dtype=int)
>> >> >>> a[0] = np.NA
>> >> ValueError
>> >> >>> e = np.array([np.NA, 1, 2], dtype=int)
>> >> ValueError
>> >> >>> b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
>> >> >>> m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
>> >> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
>> >> >>> b[1] = np.NA
>> >> >>> np.sum(b)
>> >> np.NA
>> >> >>> np.sum(b, skipna=True)
>> >> 2
>> >> >>> b.mask
>> >> None
>> >> >>> m[1] = np.NA
>> >> >>> np.sum(m)
>> >> 2
>> >> >>> np.sum(m, skipna=True)
>> >> 2
>> >> >>> m.mask
>> >> [False, False, True]
>> >> >>> bm[1] = np.NA
>> >> >>> np.sum(bm)
>> >> 2
>> >> >>> np.sum(bm, skipna=True)
>> >> 2
>> >> >>> bm.mask
>> >> [False, False, True]
>> >>
>> >> So:
>> >>
>> >> * Mask takes precedence over bit pattern on element assignment. There's
>> >>  still the question of how to assign a bit pattern NA when the mask is
>> >>  active.
>> >>
>> >> * When using mask, elements are automagically skipped.
>> >>
>> >> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
>> >>
>> >> * When using bit pattern + mask, it might make sense to have the
>> >> initial
>> >>  values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
>> >>  False, True]" and "np.sum(bm) == np.NA")
>> >
>> > There seems to be a general idea that masks and NA bit patterns imply
>> > particular differing semantics, something which I think is simply false.
>>
>> Well - first - it's helpful surely to separate the concepts and the
>> implementation.
>>
>> Concepts / use patterns (as delineated by Nathaniel):
>> A) missing values == 'np.NA' in my emails.  Can we call that CMV
>> (concept missing values)?
>> B) masks == np.IGNORE in my emails . CMSK (concept masks)?
>>
>> Implementations
>> 1) bit-pattern == na-dtype - how about we call that IBP
>> (implementation bit patten)?
>> 2) array.mask.  IM (implementation mask)?
>>
>
> Remember that the masks are invisible, you can't see them, they are an
> implementation detail. A good reason to hide the implementation is so it can
> be changed without impacting software that depends on the API.

It's not true that you can't see them because masks are using the same
API as for missing values.  Because they're using the same API, the
person using the CMV stuff will soon find out about the masks,
accidentally or not, then they will need to understand masking, and
that is the problem we're discussing here.

See you,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Charles R Harris
On Wed, Jun 29, 2011 at 1:32 PM, Matthew Brett wrote:

> Hi,
>
> On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe  wrote:
> > On Wed, Jun 29, 2011 at 8:20 AM, Lluís  wrote:
> >>
> >> Matthew Brett writes:
> >>
> >> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
> >> >> the idea that the entry is still there, but we're just ignoring it.
>  Of
> >> >> course, that goes against common convention, but it might be easier
> to
> >> >> explain.
> >>
> >> > I think Nathaniel's point is that np.IGNORE is a different idea than
> >> > np.NA, and that is why joining the implementations can lead to
> >> > conceptual confusion.
> >>
> >> This is how I see it:
> >>
> >> >>> a = np.array([0, 1, 2], dtype=int)
> >> >>> a[0] = np.NA
> >> ValueError
> >> >>> e = np.array([np.NA, 1, 2], dtype=int)
> >> ValueError
> >> >>> b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
> >> >>> m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
> >> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
> >> >>> b[1] = np.NA
> >> >>> np.sum(b)
> >> np.NA
> >> >>> np.sum(b, skipna=True)
> >> 2
> >> >>> b.mask
> >> None
> >> >>> m[1] = np.NA
> >> >>> np.sum(m)
> >> 2
> >> >>> np.sum(m, skipna=True)
> >> 2
> >> >>> m.mask
> >> [False, False, True]
> >> >>> bm[1] = np.NA
> >> >>> np.sum(bm)
> >> 2
> >> >>> np.sum(bm, skipna=True)
> >> 2
> >> >>> bm.mask
> >> [False, False, True]
> >>
> >> So:
> >>
> >> * Mask takes precedence over bit pattern on element assignment. There's
> >>  still the question of how to assign a bit pattern NA when the mask is
> >>  active.
> >>
> >> * When using mask, elements are automagically skipped.
> >>
> >> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
> >>
> >> * When using bit pattern + mask, it might make sense to have the initial
> >>  values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
> >>  False, True]" and "np.sum(bm) == np.NA")
> >
> > There seems to be a general idea that masks and NA bit patterns imply
> > particular differing semantics, something which I think is simply false.
>
> Well - first - it's helpful surely to separate the concepts and the
> implementation.
>
> Concepts / use patterns (as delineated by Nathaniel):
> A) missing values == 'np.NA' in my emails.  Can we call that CMV
> (concept missing values)?
> B) masks == np.IGNORE in my emails . CMSK (concept masks)?
>
> Implementations
> 1) bit-pattern == na-dtype - how about we call that IBP
> (implementation bit patten)?
> 2) array.mask.  IM (implementation mask)?
>
>
Remember that the masks are invisible, you can't see them, they are an
implementation detail. A good reason to hide the implementation is so it can
be changed without impacting software that depends on the API.



Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Matthew Brett
Hi,

On Wed, Jun 29, 2011 at 7:20 PM, Lluís  wrote:
> Mark Wiebe writes:
>
>> There seems to be a general idea that masks and NA bit patterns imply
>> particular differing semantics, something which I think is simply
>> false.
>
> Well, my example contained a difference (the need for the "skipna=True"
> argument) precisely because it seemed that there was some need for
> different defaults.
>
> Honestly, I think this difference breaks the POLA (principle of least
> astonishment).
>
>
> [...]
>> As far as I can tell, the only required difference between them is
>> that NA bit patterns must destroy the data. Nothing else. Everything
>> on top of that is a choice of API and interface mechanisms. I want
>> them to behave exactly the same except for that necessary difference,
>> so that it will be possible to use the *exact same Python code* with
>> either approach.
>
> I completely agree. What I'd suggest is a global and/or per-object
> "ndarray.flags.skipna" for people like me that just want to ignore these
> entries without caring about setting it on each operaion (or the other
> way around, depends on the default behaviour).
>
> The downside is that it adds yet another tweaking knob, which is not
> desirable...

Oh - dear - that would be horrible, if, depending on the tweak
somewhere in the distant past of your script, this:

>>> a = np.array([np.NA, 1.0], masked=True)
>>> np.sum(a)

could return either np.NA or 1.0...

Imagine someone twiddled the knob the other way and ran your script...

See you,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Matthew Brett
Oops,

On Wed, Jun 29, 2011 at 8:32 PM, Matthew Brett  wrote:
> Hi,
>
> On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe  wrote:
>> On Wed, Jun 29, 2011 at 8:20 AM, Lluís  wrote:
>>>
>>> Matthew Brett writes:
>>>
>>> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
>>> >> the idea that the entry is still there, but we're just ignoring it.  Of
>>> >> course, that goes against common convention, but it might be easier to
>>> >> explain.
>>>
>>> > I think Nathaniel's point is that np.IGNORE is a different idea than
>>> > np.NA, and that is why joining the implementations can lead to
>>> > conceptual confusion.
>>>
>>> This is how I see it:
>>>
>>> >>> a = np.array([0, 1, 2], dtype=int)
>>> >>> a[0] = np.NA
>>> ValueError
>>> >>> e = np.array([np.NA, 1, 2], dtype=int)
>>> ValueError
>>> >>> b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
>>> >>> m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
>>> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
>>> >>> b[1] = np.NA
>>> >>> np.sum(b)
>>> np.NA
>>> >>> np.sum(b, skipna=True)
>>> 2
>>> >>> b.mask
>>> None
>>> >>> m[1] = np.NA
>>> >>> np.sum(m)
>>> 2
>>> >>> np.sum(m, skipna=True)
>>> 2
>>> >>> m.mask
>>> [False, False, True]
>>> >>> bm[1] = np.NA
>>> >>> np.sum(bm)
>>> 2
>>> >>> np.sum(bm, skipna=True)
>>> 2
>>> >>> bm.mask
>>> [False, False, True]
>>>
>>> So:
>>>
>>> * Mask takes precedence over bit pattern on element assignment. There's
>>>  still the question of how to assign a bit pattern NA when the mask is
>>>  active.
>>>
>>> * When using mask, elements are automagically skipped.
>>>
>>> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
>>>
>>> * When using bit pattern + mask, it might make sense to have the initial
>>>  values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
>>>  False, True]" and "np.sum(bm) == np.NA")
>>
>> There seems to be a general idea that masks and NA bit patterns imply
>> particular differing semantics, something which I think is simply false.
>
> Well - first - it's helpful surely to separate the concepts and the
> implementation.
>
> Concepts / use patterns (as delineated by Nathaniel):
> A) missing values == 'np.NA' in my emails.  Can we call that CMV
> (concept missing values)?
> B) masks == np.IGNORE in my emails . CMSK (concept masks)?
>
> Implementations
> 1) bit-pattern == na-dtype - how about we call that IBP
> (implementation bit patten)?
> 2) array.mask.  IM (implementation mask)?
>
> Nathaniel implied that:
>
> CMV implies: sum([np.NA, 1]) == np.NA
> CMSK implies sum([np.NA, 1]) == 1
>
> and indeed, that's how R and masked arrays respectively behave.  So I
> think it's reasonable to say that at least R thought that the bitmask
> implied the first and Pierre and others thought the mask meant the
> second.
>
> The NEP as it stands thinks of CMV and and CM as being different views
> of the same thing,   Please correct me if I'm wrong.
>
>> Both NaN and Inf are implemented in hardware with the same idea as the NA
>> bit pattern, but they do not follow NA missing value semantics.
>
> Right - and that doesn't affect the argument, because the argument is
> about the concepts and not the implementation.
>
>> As far as I can tell, the only required difference between them is that NA
>> bit patterns must destroy the data. Nothing else.
>
> I think Nathaniel's point was about the expected default behavior in
> the different concepts.
>
>> Everything on top of that
>> is a choice of API and interface mechanisms. I want them to behave exactly
>> the same except for that necessary difference, so that it will be possible
>> to use the *exact same Python code* with either approach.
>
> Right.  And Nathaniel's point is that that desire leads to fusion of
> the two ideas into one when they should be separated.  For example, if
> I understand correctly:
>
 a = np.array([1.0, 2.0, 3, 7.0], masked=True)
 b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
 a[3] = np.NA  # actual real hand-on-heart assignment
 b[3] = np.NA # magic mask setting although it looks the same

I meant:

>>> a = np.array([1.0, 2.0, 3.0, 7.0], masked=True)
>>> b = np.array([1.0, 2.0, 3.0, 7.0], dtype='NA[f8]')
>>> b[3] = np.NA  # actual real hand-on-heart assignment
>>> a[3] = np.NA # magic mask setting although it looks the same

Sorry,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Matthew Brett
Hi,

On Wed, Jun 29, 2011 at 6:22 PM, Mark Wiebe  wrote:
> On Wed, Jun 29, 2011 at 8:20 AM, Lluís  wrote:
>>
>> Matthew Brett writes:
>>
>> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
>> >> the idea that the entry is still there, but we're just ignoring it.  Of
>> >> course, that goes against common convention, but it might be easier to
>> >> explain.
>>
>> > I think Nathaniel's point is that np.IGNORE is a different idea than
>> > np.NA, and that is why joining the implementations can lead to
>> > conceptual confusion.
>>
>> This is how I see it:
>>
>> >>> a = np.array([0, 1, 2], dtype=int)
>> >>> a[0] = np.NA
>> ValueError
>> >>> e = np.array([np.NA, 1, 2], dtype=int)
>> ValueError
>> >>> b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
>> >>> m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
>> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
>> >>> b[1] = np.NA
>> >>> np.sum(b)
>> np.NA
>> >>> np.sum(b, skipna=True)
>> 2
>> >>> b.mask
>> None
>> >>> m[1] = np.NA
>> >>> np.sum(m)
>> 2
>> >>> np.sum(m, skipna=True)
>> 2
>> >>> m.mask
>> [False, False, True]
>> >>> bm[1] = np.NA
>> >>> np.sum(bm)
>> 2
>> >>> np.sum(bm, skipna=True)
>> 2
>> >>> bm.mask
>> [False, False, True]
>>
>> So:
>>
>> * Mask takes precedence over bit pattern on element assignment. There's
>>  still the question of how to assign a bit pattern NA when the mask is
>>  active.
>>
>> * When using mask, elements are automagically skipped.
>>
>> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
>>
>> * When using bit pattern + mask, it might make sense to have the initial
>>  values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
>>  False, True]" and "np.sum(bm) == np.NA")
>
> There seems to be a general idea that masks and NA bit patterns imply
> particular differing semantics, something which I think is simply false.

Well - first - it's helpful surely to separate the concepts and the
implementation.

Concepts / use patterns (as delineated by Nathaniel):
A) missing values == 'np.NA' in my emails.  Can we call that CMV
(concept missing values)?
B) masks == np.IGNORE in my emails . CMSK (concept masks)?

Implementations
1) bit-pattern == na-dtype - how about we call that IBP
(implementation bit patten)?
2) array.mask.  IM (implementation mask)?

Nathaniel implied that:

CMV implies: sum([np.NA, 1]) == np.NA
CMSK implies sum([np.NA, 1]) == 1

and indeed, that's how R and masked arrays respectively behave.  So I
think it's reasonable to say that at least R thought that the bitmask
implied the first and Pierre and others thought the mask meant the
second.

The NEP as it stands thinks of CMV and and CM as being different views
of the same thing,   Please correct me if I'm wrong.

> Both NaN and Inf are implemented in hardware with the same idea as the NA
> bit pattern, but they do not follow NA missing value semantics.

Right - and that doesn't affect the argument, because the argument is
about the concepts and not the implementation.

> As far as I can tell, the only required difference between them is that NA
> bit patterns must destroy the data. Nothing else.

I think Nathaniel's point was about the expected default behavior in
the different concepts.

> Everything on top of that
> is a choice of API and interface mechanisms. I want them to behave exactly
> the same except for that necessary difference, so that it will be possible
> to use the *exact same Python code* with either approach.

Right.  And Nathaniel's point is that that desire leads to fusion of
the two ideas into one when they should be separated.  For example, if
I understand correctly:

>>> a = np.array([1.0, 2.0, 3, 7.0], masked=True)
>>> b = np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
>>> a[3] = np.NA  # actual real hand-on-heart assignment
>>> b[3] = np.NA # magic mask setting although it looks the same

> Say you're using NA dtypes, and suddenly you think, "what if I temporarily
> treated these as NA too". Now you have to copy your whole array to avoid
> destroying your data! The NA bit pattern didn't save you memory here... Say
> you're using masks, and it turns out you didn't actually need masking
> semantics. If they're different, you now have to do lots of code changes to
> switch to NA dtypes!

I personally have not run across that case.  I'd imagine that, if you
knew you wanted to do something so explicitly masking-like, you'd
start with the masking interface.

Clearly there are some overlaps between what masked arrays are trying
to achieve and what Rs NA mechanisms are trying to achieve.  Are they
really similar enough that they should function using the same API?
And if so, won't that be confusing?  I think that's the question
that's being asked.

See you,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Bruce Southey
On 06/29/2011 01:07 PM, Dag Sverre Seljebotn wrote:
> On 06/29/2011 07:38 PM, Mark Wiebe wrote:
>> On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
>> mailto:d.s.seljeb...@astro.uio.no>>  wrote:
>>
>>  On 06/29/2011 03:45 PM, Matthew Brett wrote:
>>   >  Hi,
>>   >
>>   >  On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe>  >   wrote:
>>   >>  On Tue, Jun 28, 2011 at 5:20 PM, Matthew
>>  Brettmailto:matthew.br...@gmail.com>>
>>   >>  wrote:
>>   >>>
>>   >>>  Hi,
>>   >>>
>>   >>>  On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith>  >   wrote:
>>   >>>  ...
>>     (You might think, what difference does it make if you *can*
>>  unmask an
>>     item? Us missing data folks could just ignore this feature. But:
>>     whatever we end up implementing is something that I will have to
>>     explain over and over to different people, most of them not
>>     particularly sophisticated programmers. And there's just no
>>  sensible
>>     way to explain this idea that if you store some particular
>>  value, then
>>     it replaces the old value, but if you store NA, then the old
>>  value is
>>     still there.
>>   >>>
>>   >>>  Ouch - yes.  No question, that is difficult to explain.   Well, I
>>   >>>  think the explanation might go like this:
>>   >>>
>>   >>>  "Ah, yes, well, that's because in fact numpy records missing
>>  values by
>>   >>>  using a 'mask'.   So when you say `a[3] = np.NA', what you mean 
>> is,
>>   >>>  'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
>>   >>>
>>   >>>  Is that fair?
>>   >>
>>   >>  My favorite way of explaining it would be to have a grid of
>>  numbers written
>>   >>  on paper, then have several cardboards with holes poked in them
>>  in different
>>   >>  configurations. Placing these cardboard masks in front of the
>>  grid would
>>   >>  show different sets of non-missing data, without affecting the
>>  values stored
>>   >>  on the paper behind them.
>>   >
>>   >  Right - but here of course you are trying to explain the mask, and
>>   >  this is Nathaniel's point, that in order to explain NAs, you have to
>>   >  explain masks, and so, even at a basic level, the fusion of the two
>>   >  ideas is obvious, and already confusing.  I mean this:
>>   >
>>   >  a[3] = np.NA
>>   >
>>   >  "Oh, so you just set the a[3] value to have some missing value 
>> code?"
>>   >
>>   >  "Ah - no - in fact what I did was set a associated mask in position
>>   >  a[3] so that you can't any longer see the previous value of a[3]"
>>   >
>>   >  "Huh.  You mean I have a mask for every single value in order to be
>>   >  able to blank out a[3]?  It looks like an assignment.  I mean, it
>>   >  looks just like a[3] = 4.  But I guess it isn't?"
>>   >
>>   >  "Er..."
>>   >
>>   >  I think Nathaniel's point is a very good one - these are separate
>>   >  ideas, np.NA and np.IGNORE, and a joint implementation is bound to
>>   >  draw them together in the mind of the user.Apart from anything
>>   >  else, the user has to know that, if they want a single NA value in 
>> an
>>   >  array, they have to add a mask size array.shape in bytes.  They have
>>   >  to know then, that NA is implemented by masking, and then the 'NA 
>> for
>>   >  free by adding masking' idea breaks down and starts to feel like a
>>   >  kludge.
>>   >
>>   >  The counter argument is of course that, in time, the
>>  implementation of
>>   >  NA with masking will seem as obvious and intuitive, as, say,
>>   >  broadcasting, and that we are just reacting from lack of experience
>>   >  with the new API.
>>
>>  However, no matter how used we get to this, people coming from almost
>>  any other tool (in particular R) will keep think it is
>>  counter-intuitive. Why set up a major semantic incompatability that
>>  people then have to overcome in order to start using NumPy.
>>
>>
>> I'm not aware of a semantic incompatibility. I believe R doesn't support
>> views like NumPy does, so the things you have to do to see masking
>> semantics aren't even possible in R.
> Well, whether the same feature is possible or not in R is irrelevant to
> whether a semantic incompatability would exist.
>
> Views themselves are a *major* semantic incompatability, and are highly
> confusing at first to MATLAB/Fortran/R people. However they have major
> advantages outweighing the disadvantage of having to caution new users.
>
> But there's simply no precedence anywhere for an assignment that doesn't
> erase the old value for a particular input value, and the advantages
> seem pretty minor (well, I think it is ugly in its own

Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Lluís
Mark Wiebe writes:

> There seems to be a general idea that masks and NA bit patterns imply
> particular differing semantics, something which I think is simply
> false.

Well, my example contained a difference (the need for the "skipna=True"
argument) precisely because it seemed that there was some need for
different defaults.

Honestly, I think this difference breaks the POLA (principle of least
astonishment).


[...]
> As far as I can tell, the only required difference between them is
> that NA bit patterns must destroy the data. Nothing else. Everything
> on top of that is a choice of API and interface mechanisms. I want
> them to behave exactly the same except for that necessary difference,
> so that it will be possible to use the *exact same Python code* with
> either approach.

I completely agree. What I'd suggest is a global and/or per-object
"ndarray.flags.skipna" for people like me that just want to ignore these
entries without caring about setting it on each operaion (or the other
way around, depends on the default behaviour).

The downside is that it adds yet another tweaking knob, which is not
desirable...


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Dag Sverre Seljebotn
On 06/29/2011 07:38 PM, Mark Wiebe wrote:
> On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
> mailto:d.s.seljeb...@astro.uio.no>> wrote:
>
> On 06/29/2011 03:45 PM, Matthew Brett wrote:
>  > Hi,
>  >
>  > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe >  wrote:
>  >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew
> Brettmailto:matthew.br...@gmail.com>>
>  >> wrote:
>  >>>
>  >>> Hi,
>  >>>
>  >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith >  wrote:
>  >>> ...
>   (You might think, what difference does it make if you *can*
> unmask an
>   item? Us missing data folks could just ignore this feature. But:
>   whatever we end up implementing is something that I will have to
>   explain over and over to different people, most of them not
>   particularly sophisticated programmers. And there's just no
> sensible
>   way to explain this idea that if you store some particular
> value, then
>   it replaces the old value, but if you store NA, then the old
> value is
>   still there.
>  >>>
>  >>> Ouch - yes.  No question, that is difficult to explain.   Well, I
>  >>> think the explanation might go like this:
>  >>>
>  >>> "Ah, yes, well, that's because in fact numpy records missing
> values by
>  >>> using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
>  >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
>  >>>
>  >>> Is that fair?
>  >>
>  >> My favorite way of explaining it would be to have a grid of
> numbers written
>  >> on paper, then have several cardboards with holes poked in them
> in different
>  >> configurations. Placing these cardboard masks in front of the
> grid would
>  >> show different sets of non-missing data, without affecting the
> values stored
>  >> on the paper behind them.
>  >
>  > Right - but here of course you are trying to explain the mask, and
>  > this is Nathaniel's point, that in order to explain NAs, you have to
>  > explain masks, and so, even at a basic level, the fusion of the two
>  > ideas is obvious, and already confusing.  I mean this:
>  >
>  > a[3] = np.NA
>  >
>  > "Oh, so you just set the a[3] value to have some missing value code?"
>  >
>  > "Ah - no - in fact what I did was set a associated mask in position
>  > a[3] so that you can't any longer see the previous value of a[3]"
>  >
>  > "Huh.  You mean I have a mask for every single value in order to be
>  > able to blank out a[3]?  It looks like an assignment.  I mean, it
>  > looks just like a[3] = 4.  But I guess it isn't?"
>  >
>  > "Er..."
>  >
>  > I think Nathaniel's point is a very good one - these are separate
>  > ideas, np.NA and np.IGNORE, and a joint implementation is bound to
>  > draw them together in the mind of the user.Apart from anything
>  > else, the user has to know that, if they want a single NA value in an
>  > array, they have to add a mask size array.shape in bytes.  They have
>  > to know then, that NA is implemented by masking, and then the 'NA for
>  > free by adding masking' idea breaks down and starts to feel like a
>  > kludge.
>  >
>  > The counter argument is of course that, in time, the
> implementation of
>  > NA with masking will seem as obvious and intuitive, as, say,
>  > broadcasting, and that we are just reacting from lack of experience
>  > with the new API.
>
> However, no matter how used we get to this, people coming from almost
> any other tool (in particular R) will keep think it is
> counter-intuitive. Why set up a major semantic incompatability that
> people then have to overcome in order to start using NumPy.
>
>
> I'm not aware of a semantic incompatibility. I believe R doesn't support
> views like NumPy does, so the things you have to do to see masking
> semantics aren't even possible in R.

Well, whether the same feature is possible or not in R is irrelevant to 
whether a semantic incompatability would exist.

Views themselves are a *major* semantic incompatability, and are highly 
confusing at first to MATLAB/Fortran/R people. However they have major 
advantages outweighing the disadvantage of having to caution new users.

But there's simply no precedence anywhere for an assignment that doesn't 
erase the old value for a particular input value, and the advantages 
seem pretty minor (well, I think it is ugly in its own right, but that 
is besides the point...)

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn <
d.s.seljeb...@astro.uio.no> wrote:

> On 06/29/2011 03:45 PM, Matthew Brett wrote:
> > Hi,
> >
> > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe  wrote:
> >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith
>  wrote:
> >>> ...
>  (You might think, what difference does it make if you *can* unmask an
>  item? Us missing data folks could just ignore this feature. But:
>  whatever we end up implementing is something that I will have to
>  explain over and over to different people, most of them not
>  particularly sophisticated programmers. And there's just no sensible
>  way to explain this idea that if you store some particular value, then
>  it replaces the old value, but if you store NA, then the old value is
>  still there.
> >>>
> >>> Ouch - yes.  No question, that is difficult to explain.   Well, I
> >>> think the explanation might go like this:
> >>>
> >>> "Ah, yes, well, that's because in fact numpy records missing values by
> >>> using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
> >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
> >>>
> >>> Is that fair?
> >>
> >> My favorite way of explaining it would be to have a grid of numbers
> written
> >> on paper, then have several cardboards with holes poked in them in
> different
> >> configurations. Placing these cardboard masks in front of the grid would
> >> show different sets of non-missing data, without affecting the values
> stored
> >> on the paper behind them.
> >
> > Right - but here of course you are trying to explain the mask, and
> > this is Nathaniel's point, that in order to explain NAs, you have to
> > explain masks, and so, even at a basic level, the fusion of the two
> > ideas is obvious, and already confusing.  I mean this:
> >
> > a[3] = np.NA
> >
> > "Oh, so you just set the a[3] value to have some missing value code?"
> >
> > "Ah - no - in fact what I did was set a associated mask in position
> > a[3] so that you can't any longer see the previous value of a[3]"
> >
> > "Huh.  You mean I have a mask for every single value in order to be
> > able to blank out a[3]?  It looks like an assignment.  I mean, it
> > looks just like a[3] = 4.  But I guess it isn't?"
> >
> > "Er..."
> >
> > I think Nathaniel's point is a very good one - these are separate
> > ideas, np.NA and np.IGNORE, and a joint implementation is bound to
> > draw them together in the mind of the user.Apart from anything
> > else, the user has to know that, if they want a single NA value in an
> > array, they have to add a mask size array.shape in bytes.  They have
> > to know then, that NA is implemented by masking, and then the 'NA for
> > free by adding masking' idea breaks down and starts to feel like a
> > kludge.
> >
> > The counter argument is of course that, in time, the implementation of
> > NA with masking will seem as obvious and intuitive, as, say,
> > broadcasting, and that we are just reacting from lack of experience
> > with the new API.
>
> However, no matter how used we get to this, people coming from almost
> any other tool (in particular R) will keep think it is
> counter-intuitive. Why set up a major semantic incompatability that
> people then have to overcome in order to start using NumPy.
>

I'm not aware of a semantic incompatibility. I believe R doesn't support
views like NumPy does, so the things you have to do to see masking semantics
aren't even possible in R.

I really don't see what's wrong with some more explicit API like
> a.mask[3] = True. "Explicit is better than implicit".
>

I agree, but initial feedback was that the way R deals with NA values is
very nice, and I've come to agree that it's worth emulating.

-Mark


>
> Dag Sverre
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 8:45 AM, Matthew Brett wrote:

> Hi,
>
> On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe  wrote:
> > On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett 
> > wrote:
> >>
> >> Hi,
> >>
> >> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith  wrote:
> >> ...
> >> > (You might think, what difference does it make if you *can* unmask an
> >> > item? Us missing data folks could just ignore this feature. But:
> >> > whatever we end up implementing is something that I will have to
> >> > explain over and over to different people, most of them not
> >> > particularly sophisticated programmers. And there's just no sensible
> >> > way to explain this idea that if you store some particular value, then
> >> > it replaces the old value, but if you store NA, then the old value is
> >> > still there.
> >>
> >> Ouch - yes.  No question, that is difficult to explain.   Well, I
> >> think the explanation might go like this:
> >>
> >> "Ah, yes, well, that's because in fact numpy records missing values by
> >> using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
> >> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
> >>
> >> Is that fair?
> >
> > My favorite way of explaining it would be to have a grid of numbers
> written
> > on paper, then have several cardboards with holes poked in them in
> different
> > configurations. Placing these cardboard masks in front of the grid would
> > show different sets of non-missing data, without affecting the values
> stored
> > on the paper behind them.
>
> Right - but here of course you are trying to explain the mask, and
> this is Nathaniel's point, that in order to explain NAs, you have to
> explain masks, and so, even at a basic level, the fusion of the two
> ideas is obvious, and already confusing.  I mean this:
>
> a[3] = np.NA
>
> "Oh, so you just set the a[3] value to have some missing value code?"
>

I would answer "Yes, that's basically true." The abstraction works that way,
and there's no reason to confuse people with those implementation details
right off the bat. When you introduce a new user to floating point numbers,
it would seem odd to first point out that addition isn't associative. That
kind of detail is important when you're learning more about the system and
digging deeper.

I think it was in a Knuth book that I read the idea that the best teaching
is a series of lies that successively correct the previous lies.


> "Ah - no - in fact what I did was set a associated mask in position
> a[3] so that you can't any longer see the previous value of a[3]"
>
> "Huh.  You mean I have a mask for every single value in order to be
> able to blank out a[3]?  It looks like an assignment.  I mean, it
> looks just like a[3] = 4.  But I guess it isn't?"
>
> "Er..."
>
> I think Nathaniel's point is a very good one - these are separate
> ideas, np.NA and np.IGNORE, and a joint implementation is bound to
> draw them together in the mind of the user.


R jointly implements them with the rm.na=T parameter, and that's our model
system for missing data.


> Apart from anything
> else, the user has to know that, if they want a single NA value in an
> array, they have to add a mask size array.shape in bytes.  They have
> to know then, that NA is implemented by masking, and then the 'NA for
> free by adding masking' idea breaks down and starts to feel like a
> kludge.
>
> The counter argument is of course that, in time, the implementation of
> NA with masking will seem as obvious and intuitive, as, say,
> broadcasting, and that we are just reacting from lack of experience
> with the new API.
>

It will literally work the same as the implementation with NA dtypes, except
for the masking semantics which requires the extra steps of taking views.


>
> Of course, that does happen, but here, unless I am mistaken, the
> primary drive to fuse NA and masking is because of ease of
> implementation.


That's not the case, and I've tried to give a slightly better justification
for this in my answer Lluis' email.


> That doesn't necessarily mean that they don't go
> together - if something is easy to implement, sometimes it means it
> will also feel natural in use, but at least we might say that there is
> some risk of the implementation driving the API, and that that can
> lead to problems.
>

In the design process I'm doing, the implementation concerns are affecting
the interface concerns and vice versa, but the missing data semantics are
the main driver.

-Mark


>
> See you,
>
> Matthew
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] missing data discussion round 2

2011-06-29 Thread Mark Wiebe
On Wed, Jun 29, 2011 at 8:20 AM, Lluís  wrote:

> Matthew Brett writes:
>
> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
> >> the idea that the entry is still there, but we're just ignoring it.  Of
> >> course, that goes against common convention, but it might be easier to
> >> explain.
>
> > I think Nathaniel's point is that np.IGNORE is a different idea than
> > np.NA, and that is why joining the implementations can lead to
> > conceptual confusion.
>
> This is how I see it:
>
> >>> a = np.array([0, 1, 2], dtype=int)
> >>> a[0] = np.NA
> ValueError
> >>> e = np.array([np.NA, 1, 2], dtype=int)
> ValueError
> >>> b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
> >>> m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
> >>> b[1] = np.NA
> >>> np.sum(b)
> np.NA
> >>> np.sum(b, skipna=True)
> 2
> >>> b.mask
> None
> >>> m[1] = np.NA
> >>> np.sum(m)
> 2
> >>> np.sum(m, skipna=True)
> 2
> >>> m.mask
> [False, False, True]
> >>> bm[1] = np.NA
> >>> np.sum(bm)
> 2
> >>> np.sum(bm, skipna=True)
> 2
> >>> bm.mask
> [False, False, True]
>
> So:
>
> * Mask takes precedence over bit pattern on element assignment. There's
>  still the question of how to assign a bit pattern NA when the mask is
>  active.
>
> * When using mask, elements are automagically skipped.
>
> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
>
> * When using bit pattern + mask, it might make sense to have the initial
>  values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
>  False, True]" and "np.sum(bm) == np.NA")
>

There seems to be a general idea that masks and NA bit patterns imply
particular differing semantics, something which I think is simply false.
Both NaN and Inf are implemented in hardware with the same idea as the NA
bit pattern, but they do not follow NA missing value semantics.

As far as I can tell, the only required difference between them is that NA
bit patterns must destroy the data. Nothing else. Everything on top of that
is a choice of API and interface mechanisms. I want them to behave exactly
the same except for that necessary difference, so that it will be possible
to use the *exact same Python code* with either approach.

Say you're using NA dtypes, and suddenly you think, "what if I temporarily
treated these as NA too". Now you have to copy your whole array to avoid
destroying your data! The NA bit pattern didn't save you memory here... Say
you're using masks, and it turns out you didn't actually need masking
semantics. If they're different, you now have to do lots of code changes to
switch to NA dtypes!

-Mark



>
> Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


  1   2   >