Re: [Numpy-discussion] Missing data again

2012-03-15 Thread Nathaniel Smith
Hi Chuck,

I think I let my frustration get the better of me, and the message
below is too confrontational. I apologize.

I truly would like to understand where you're coming from on this,
though, so I'll try to make this more productive. My summary of points
that no-one has disagreed with yet is here:
  https://github.com/njsmith/numpy/wiki/NA-discussion-status
Of course, this means that there's lots that's left out. Instead of
getting into all those contentious details, I'll stick to just a few
basic questions that might let us get at least of bit of common
ground:
1) Do you disagree with anything that is stated there?
2) Do you feel like that document accurately summarises your basic
idea of what this feature is supposed to do (I assume under the
IGNORED heading)?

Thanks,
-- Nathaniel

On Wed, Mar 7, 2012 at 11:10 PM, Nathaniel Smith n...@pobox.com wrote:
 On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:


 On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote:
 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


 Maybe for float, for other things, no. And we have lots of otherthings.

 It would be easier to discuss this if you'd, like, discuss :-(. If you
 know of some advantage that masks have over bitpatterns when it comes
 to missing data, can you please share it, instead of just asserting
 it?

 Not that I'm immune... I perhaps should have been more explicit
 myself, when I said performance characteristics, let me clarify that
 I was thinking of both speed (for floats) and memory (for
 most-but-not-all things).

 The
 performance is a strawman,

 How many users need to speak up to say that this is a serious problem
 they have with the current implementation before you stop calling it a
 strawman? Because when Wes says that it's not going to fly for his
 stats/econometics cases, and the neuroimaging folk like Gary and Matt
 say it's not going to fly for their use cases... surely just waving
 that away is a bit dismissive?

 I'm not saying that we *have* to implement bitpatterns because
 performance is *the most important feature* -- I'm just saying, well,
 what I said. For *missing data use* cases, bitpatterns have better
 performance characteristics than masks. If we decide that these use
 cases are important, then we should take this into account and weigh
 it against other considerations. Maybe what you think is that these
 use cases shouldn't be the focus of this feature and it should focus
 on the ignored use cases instead? That would be a legitimate
 argument... but if that's what you want to say, say it, don't just
 dismiss your users!

 and it *isn't* easier to implement.

 If I thought bitpatterns would be easier to implement, I would have
 said so... What I said was that they're not harder. You have some
 extra complexity, mostly in casting, and some reduced complexity -- no
 need to allocate and manipulate the mask. (E.g., simple same-type
 assignments and slicing require special casing for masks, but not for
 bitpatterns.) In many places the complexity is identical -- printing
 routines need to check for either special bitpatterns or masked
 values, whatever. Ufunc loops need to either find the appropriate part
 of the mask, or create a temporary mask buffer by calling a dtype
 func, whatever. On net they seem about equivalent, complexity-wise.

 ...I assume you disagree with this analysis, since I've said it
 before, wrote up a sketch for how the implementation would work at the
 C level, etc., and you continue to claim that simplicity is a
 compelling advantage for the masked approach. But I still don't know
 why you think that :-(.

  Also, different folks adopt different values
  for 'missing' data, and distributing one or several masks along with the
  data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

  One inconvenience I have run into with the current API is that is should
  be
  easier to clear the mask from an ignored value without taking a new
  view
  or assigning known data. So maybe two types of masks (different
  payloads),
  or an additional flag could be helpful. The process of assigning masks
  could
  also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then 

Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Pierre Haessig
Hi,

Thanks you very much for your lights !

Le 06/03/2012 21:59, Nathaniel Smith a écrit :
 Right -- R has a very impoverished type system as compared to numpy.
 There's basically four types: numeric (meaning double precision
 float), integer, logical (boolean), and character (string). And
 in practice the integer type is essentially unused, because R parses
 numbers like 1 as being floating point, not integer; the only way to
 get an integer value is to explicitly cast to it. Each of these types
 has a specific bit-pattern set aside for representing NA. And...
 that's it. It's very simple when it works, but also very limited.
I also suspected R to be less powerful in terms of types.
However, I think  the fact that It's very simple when it works is
important to take into account. At the end of the day, when using all
the fanciness it is not only about can I have some NAs in my array ?
but also how *easily* can I have some NAs in my array ?. It's about
balancing the how easy and the how powerful.

The easyness-of-use is the reason of my concern about having separate
types nafloatNN and floatNN. Of course, I won't argue that not
breaking everything is even more important !!

Coming back to Travis proposition bit-pattern approaches to missing
data (*at least* for float64 and int32) need to be implemented., I
wonder what is the amount of extra work to go from nafloat64 to
nafloat32/16 ? Is there an hardware support NaN payloads with these
smaller floats ? If not, or if it is too complicated, I feel it is
acceptable to say it's too complicated and fall back to mask. One may
have to choose between fancy types and fancy NAs...

Best,
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 4:35 PM, Pierre Haessig pierre.haes...@crans.org wrote:
 Hi,

 Thanks you very much for your lights !

 Le 06/03/2012 21:59, Nathaniel Smith a écrit :
 Right -- R has a very impoverished type system as compared to numpy.
 There's basically four types: numeric (meaning double precision
 float), integer, logical (boolean), and character (string). And
 in practice the integer type is essentially unused, because R parses
 numbers like 1 as being floating point, not integer; the only way to
 get an integer value is to explicitly cast to it. Each of these types
 has a specific bit-pattern set aside for representing NA. And...
 that's it. It's very simple when it works, but also very limited.
 I also suspected R to be less powerful in terms of types.
 However, I think  the fact that It's very simple when it works is
 important to take into account. At the end of the day, when using all
 the fanciness it is not only about can I have some NAs in my array ?
 but also how *easily* can I have some NAs in my array ?. It's about
 balancing the how easy and the how powerful.

 The easyness-of-use is the reason of my concern about having separate
 types nafloatNN and floatNN. Of course, I won't argue that not
 breaking everything is even more important !!

It's a good point, I just don't see how we can really tell what the
trade-offs are at this point. You should bring this up again once more
of the big picture stuff is hammered out.

 Coming back to Travis proposition bit-pattern approaches to missing
 data (*at least* for float64 and int32) need to be implemented., I
 wonder what is the amount of extra work to go from nafloat64 to
 nafloat32/16 ? Is there an hardware support NaN payloads with these
 smaller floats ? If not, or if it is too complicated, I feel it is
 acceptable to say it's too complicated and fall back to mask. One may
 have to choose between fancy types and fancy NAs...

All modern floating point formats can represent NaNs with payloads, so
in principle there's no difficulty in supporting NA the same way for
all of them. If you're using float16 because you want to offload
computation to a GPU then I would test carefully before trusting the
GPU to handle NaNs correctly, and there may need to be a bit of care
to make sure that casts between these types properly map NAs to NAs,
but generally it should be fine.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Charles R Harris
On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.orgwrote:

 Hi,

 Thanks you very much for your lights !

 Le 06/03/2012 21:59, Nathaniel Smith a écrit :
  Right -- R has a very impoverished type system as compared to numpy.
  There's basically four types: numeric (meaning double precision
  float), integer, logical (boolean), and character (string). And
  in practice the integer type is essentially unused, because R parses
  numbers like 1 as being floating point, not integer; the only way to
  get an integer value is to explicitly cast to it. Each of these types
  has a specific bit-pattern set aside for representing NA. And...
  that's it. It's very simple when it works, but also very limited.
 I also suspected R to be less powerful in terms of types.
 However, I think  the fact that It's very simple when it works is
 important to take into account. At the end of the day, when using all
 the fanciness it is not only about can I have some NAs in my array ?
 but also how *easily* can I have some NAs in my array ?. It's about
 balancing the how easy and the how powerful.

 The easyness-of-use is the reason of my concern about having separate
 types nafloatNN and floatNN. Of course, I won't argue that not
 breaking everything is even more important !!

 Coming back to Travis proposition bit-pattern approaches to missing
 data (*at least* for float64 and int32) need to be implemented., I
 wonder what is the amount of extra work to go from nafloat64 to
 nafloat32/16 ? Is there an hardware support NaN payloads with these
 smaller floats ? If not, or if it is too complicated, I feel it is
 acceptable to say it's too complicated and fall back to mask. One may
 have to choose between fancy types and fancy NAs...


I'm in agreement here, and that was a major consideration in making a
'masked' implementation first. Also, different folks adopt different values
for 'missing' data, and distributing one or several masks along with the
data is another common practice.

One inconvenience I have run into with the current API is that is should be
easier to clear the mask from an ignored value without taking a new view
or assigning known data. So maybe two types of masks (different payloads),
or an additional flag could be helpful. The process of assigning masks
could also be made a bit easier than using fancy indexing.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Lluís
Charles R Harris writes:
[...]
 One inconvenience I have run into with the current API is that is should be
 easier to clear the mask from an ignored value without taking a new view or
 assigning known data.

AFAIR, the inability to directly access a mask attribute was intentional to
make bit-patterns and masks indistinguishable from the POV of the array user.

What's the workflow that leads you to un-ignore specific elements?


 So maybe two types of masks (different payloads), or an additional flag could
 be helpful.

Do you mean different NA values? If that's the case, I think it was taken into
account when implementing the current mechanisms (and was also mentioned in the
NEP), so that it could be supported by both bit-patterns and masks (as one of
the main design points was to make them indistinguishable in the common case).

I think the name was parametrized dtypes.


 The process of assigning masks could also be made a bit easier than using
 fancy indexing.

I don't get what you mean here, sorry.

Do you mean here that this is too cumbersome to use?

 a[a  5] = np.NA

(obviously oversimplified example where everything looks sufficiently simple :))




Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Charles R Harris
On Wed, Mar 7, 2012 at 11:21 AM, Lluís xscr...@gmx.net wrote:

 Charles R Harris writes:
 [...]
  One inconvenience I have run into with the current API is that is should
 be
  easier to clear the mask from an ignored value without taking a new
 view or
  assigning known data.

 AFAIR, the inability to directly access a mask attribute was intentional
 to
 make bit-patterns and masks indistinguishable from the POV of the array
 user.

 What's the workflow that leads you to un-ignore specific elements?



Because they are not 'unknown', just (temporarily) 'ignored'. This might be
the case if you are experimenting with what happens if certain data is left
out of a fit. The current implementation tries to handle both these case,
and can do so, I would just like the 'ignored' use to be more convenient
than it is.


  So maybe two types of masks (different payloads), or an additional flag
 could
  be helpful.

 Do you mean different NA values? If that's the case, I think it was taken
 into
 account when implementing the current mechanisms (and was also mentioned
 in the
 NEP), so that it could be supported by both bit-patterns and masks (as one
 of
 the main design points was to make them indistinguishable in the common
 case).


No, the mask as currently implemented is eight bits and can be extended to
handle different mask values, aka, payloads.


 I think the name was parametrized dtypes.


They don't interest me in the least. But that is a whole different area of
discussion.



  The process of assigning masks could also be made a bit easier than using
  fancy indexing.

 I don't get what you mean here, sorry.


Suppose I receive a data set, say an hdf file, that also includes a mask.
I'd like to load the data and apply the mask directly without doing
something like

data[mask] = np.NA


Do you mean here that this is too cumbersome to use?

 a[a  5] = np.NA

 (obviously oversimplified example where everything looks sufficiently
 simple :))


Mostly speed and memory.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
charlesr.har...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org
 Coming back to Travis proposition bit-pattern approaches to missing
 data (*at least* for float64 and int32) need to be implemented., I
 wonder what is the amount of extra work to go from nafloat64 to
 nafloat32/16 ? Is there an hardware support NaN payloads with these
 smaller floats ? If not, or if it is too complicated, I feel it is
 acceptable to say it's too complicated and fall back to mask. One may
 have to choose between fancy types and fancy NAs...

 I'm in agreement here, and that was a major consideration in making a
 'masked' implementation first.

When it comes to missing data, bitpatterns can do everything that
masks can do, are no more complicated to implement, and have better
performance characteristics.

 Also, different folks adopt different values
 for 'missing' data, and distributing one or several masks along with the
 data is another common practice.

True, but not really relevant to the current debate, because you have
to handle such issues as part of your general data import workflow
anyway, and none of these is any more complicated no matter which
implementations are available.

 One inconvenience I have run into with the current API is that is should be
 easier to clear the mask from an ignored value without taking a new view
 or assigning known data. So maybe two types of masks (different payloads),
 or an additional flag could be helpful. The process of assigning masks could
 also be made a bit easier than using fancy indexing.

So this, uh... this was actually the whole goal of the alterNEP
design for masks -- making all this stuff easy for people (like you,
apparently?) that want support for ignored values, separately from
missing data, and want a nice clean API for it. Basically having a
separate .mask attribute which was an ordinary, assignable array
broadcastable to the attached array's shape. Nobody seemed interested
in talking about it much then but maybe there's interest now?

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Charles R Harris
On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org
 
  Coming back to Travis proposition bit-pattern approaches to missing
  data (*at least* for float64 and int32) need to be implemented., I
  wonder what is the amount of extra work to go from nafloat64 to
  nafloat32/16 ? Is there an hardware support NaN payloads with these
  smaller floats ? If not, or if it is too complicated, I feel it is
  acceptable to say it's too complicated and fall back to mask. One may
  have to choose between fancy types and fancy NAs...
 
  I'm in agreement here, and that was a major consideration in making a
  'masked' implementation first.

 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


Maybe for float, for other things, no. And we have lots of otherthings. The
performance is a strawman, and it *isn't* easier to implement.


  Also, different folks adopt different values
  for 'missing' data, and distributing one or several masks along with the
  data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

  One inconvenience I have run into with the current API is that is should
 be
  easier to clear the mask from an ignored value without taking a new
 view
  or assigning known data. So maybe two types of masks (different
 payloads),
  or an additional flag could be helpful. The process of assigning masks
 could
  also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then but maybe there's interest now?


Come off it, Nathaniel, the problem is minor and fixable. The intent of the
initial implementation was to discover such things. These things are less
accessible with the current API *precisely* because of the feedback from R
users. It didn't start that way.

We now have something to evolve into what we want. That is a heck of a lot
more useful than endless discussion.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Benjamin Root
On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig pierre.haes...@crans.org
 
  Coming back to Travis proposition bit-pattern approaches to missing
  data (*at least* for float64 and int32) need to be implemented., I
  wonder what is the amount of extra work to go from nafloat64 to
  nafloat32/16 ? Is there an hardware support NaN payloads with these
  smaller floats ? If not, or if it is too complicated, I feel it is
  acceptable to say it's too complicated and fall back to mask. One may
  have to choose between fancy types and fancy NAs...
 
  I'm in agreement here, and that was a major consideration in making a
  'masked' implementation first.

 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


Not true.  bitpatterns inherently destroys the data, while masks do not.
For matplotlib, we can not use bitpatterns because it could over-write user
data (or we have to copy the data).  I would imagine other extension
writers would have similar issues when they need to play around with input
data in a safe manner.

Also, I doubt that the performance characteristics for strings and integers
are the same as it is for masks.

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Matthew Brett
Hi,

On Wed, Mar 7, 2012 at 11:37 AM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessig
  pierre.haes...@crans.org
  Coming back to Travis proposition bit-pattern approaches to missing
  data (*at least* for float64 and int32) need to be implemented., I
  wonder what is the amount of extra work to go from nafloat64 to
  nafloat32/16 ? Is there an hardware support NaN payloads with these
  smaller floats ? If not, or if it is too complicated, I feel it is
  acceptable to say it's too complicated and fall back to mask. One may
  have to choose between fancy types and fancy NAs...
 
  I'm in agreement here, and that was a major consideration in making a
  'masked' implementation first.

 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


 Maybe for float, for other things, no. And we have lots of otherthings. The
 performance is a strawman, and it *isn't* easier to implement.


  Also, different folks adopt different values
  for 'missing' data, and distributing one or several masks along with the
  data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

  One inconvenience I have run into with the current API is that is should
  be
  easier to clear the mask from an ignored value without taking a new
  view
  or assigning known data. So maybe two types of masks (different
  payloads),
  or an additional flag could be helpful. The process of assigning masks
  could
  also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then but maybe there's interest now?


 Come off it, Nathaniel, the problem is minor and fixable. The intent of the
 initial implementation was to discover such things. These things are less
 accessible with the current API *precisely* because of the feedback from R
 users. It didn't start that way.

 We now have something to evolve into what we want. That is a heck of a lot
 more useful than endless discussion.

The endless discussion is for the following reason:

- The discussion was never adequately resolved.

The discussion was never adequately resolved because there was not
enough work done to understand the various arguments.   In particular,
you've several times said things that indicate to me, as to Nathaniel,
that you either have not read or have not understood the points that
Nathaniel was making.

Travis' recent email - to me - also indicates that there is still a
genuine problem here that has not been adequately explored.

There is no future in trying to stop discussion, and trying to do so
will only prolong it and make it less useful.  It will make the
discussion - endless.

If you want to help - read the alterNEP, respond to it directly, and
further the discussion by engaged debate.

Best,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Eric Firing
On 03/07/2012 09:26 AM, Nathaniel Smith wrote:
 On Wed, Mar 7, 2012 at 5:17 PM, Charles R Harris
 charlesr.har...@gmail.com  wrote:
 On Wed, Mar 7, 2012 at 9:35 AM, Pierre Haessigpierre.haes...@crans.org
 Coming back to Travis proposition bit-pattern approaches to missing
 data (*at least* for float64 and int32) need to be implemented., I
 wonder what is the amount of extra work to go from nafloat64 to
 nafloat32/16 ? Is there an hardware support NaN payloads with these
 smaller floats ? If not, or if it is too complicated, I feel it is
 acceptable to say it's too complicated and fall back to mask. One may
 have to choose between fancy types and fancy NAs...

 I'm in agreement here, and that was a major consideration in making a
 'masked' implementation first.

 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.

 Also, different folks adopt different values
 for 'missing' data, and distributing one or several masks along with the
 data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

 One inconvenience I have run into with the current API is that is should be
 easier to clear the mask from an ignored value without taking a new view
 or assigning known data. So maybe two types of masks (different payloads),
 or an additional flag could be helpful. The process of assigning masks could
 also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then but maybe there's interest now?

In other words, good low-level support for numpy.ma functionality?  With 
a migration path so that a separate numpy.ma might wither away?  Yes, 
there is interest; this is exactly what I think is needed for my own 
style of applications (which I think are common at least in geoscience), 
and for matplotlib.  The question is how to achieve it as simply and 
cleanly as possible while also satisfying the needs of the R users, and 
while making it easy for matplotlib, for example, to handle *any* 
reasonable input: ma, other masking, nan, or NA-bitpattern.

It may be that a rather pragmatic approach to implementation will prove 
better than a highly idealized set of data models.  Or, it may be that a 
dual approach is best, in which the flag value missing data 
implementation is tightly bound to the R model and the mask 
implementation is explicitly designed for the numpy.ma model. In any 
case, a reasonable level of agreement on the goals is needed.  I presume 
Travis's involvement will facilitate a clarification of the goals and of 
the implementation; and I expect that much of Mark's work will end up 
serving well, even if much needs to be added and the API evolves 
considerably.

Eric


 -- Nathaniel
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Pierre Haessig
Hi,
Le 07/03/2012 20:57, Eric Firing a écrit :
 In other words, good low-level support for numpy.ma functionality?
Coming back to *existing* ma support, I was just wondering whether it
was now possible to np.save a masked array.
(I'm using numpy 1.5)
In the end, this is the most annoying problem I have with the existing
ma module which otherwise is pretty useful to me. I'm happy not to need
to process 100% of my data though.

Best,
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Eric Firing
On 03/07/2012 11:15 AM, Pierre Haessig wrote:
 Hi,
 Le 07/03/2012 20:57, Eric Firing a écrit :
 In other words, good low-level support for numpy.ma functionality?
 Coming back to *existing* ma support, I was just wondering whether it
 was now possible to np.save a masked array.
 (I'm using numpy 1.5)

No, not with the mask preserved.  This is one of the improvements I am 
hoping for with the upcoming missing data work.

Eric

 In the end, this is the most annoying problem I have with the existing
 ma module which otherwise is pretty useful to me. I'm happy not to need
 to process 100% of my data though.

 Best,
 Pierre




 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith n...@pobox.com wrote:
 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


 Maybe for float, for other things, no. And we have lots of otherthings.

It would be easier to discuss this if you'd, like, discuss :-(. If you
know of some advantage that masks have over bitpatterns when it comes
to missing data, can you please share it, instead of just asserting
it?

Not that I'm immune... I perhaps should have been more explicit
myself, when I said performance characteristics, let me clarify that
I was thinking of both speed (for floats) and memory (for
most-but-not-all things).

 The
 performance is a strawman,

How many users need to speak up to say that this is a serious problem
they have with the current implementation before you stop calling it a
strawman? Because when Wes says that it's not going to fly for his
stats/econometics cases, and the neuroimaging folk like Gary and Matt
say it's not going to fly for their use cases... surely just waving
that away is a bit dismissive?

I'm not saying that we *have* to implement bitpatterns because
performance is *the most important feature* -- I'm just saying, well,
what I said. For *missing data use* cases, bitpatterns have better
performance characteristics than masks. If we decide that these use
cases are important, then we should take this into account and weigh
it against other considerations. Maybe what you think is that these
use cases shouldn't be the focus of this feature and it should focus
on the ignored use cases instead? That would be a legitimate
argument... but if that's what you want to say, say it, don't just
dismiss your users!

 and it *isn't* easier to implement.

If I thought bitpatterns would be easier to implement, I would have
said so... What I said was that they're not harder. You have some
extra complexity, mostly in casting, and some reduced complexity -- no
need to allocate and manipulate the mask. (E.g., simple same-type
assignments and slicing require special casing for masks, but not for
bitpatterns.) In many places the complexity is identical -- printing
routines need to check for either special bitpatterns or masked
values, whatever. Ufunc loops need to either find the appropriate part
of the mask, or create a temporary mask buffer by calling a dtype
func, whatever. On net they seem about equivalent, complexity-wise.

...I assume you disagree with this analysis, since I've said it
before, wrote up a sketch for how the implementation would work at the
C level, etc., and you continue to claim that simplicity is a
compelling advantage for the masked approach. But I still don't know
why you think that :-(.

  Also, different folks adopt different values
  for 'missing' data, and distributing one or several masks along with the
  data is another common practice.

 True, but not really relevant to the current debate, because you have
 to handle such issues as part of your general data import workflow
 anyway, and none of these is any more complicated no matter which
 implementations are available.

  One inconvenience I have run into with the current API is that is should
  be
  easier to clear the mask from an ignored value without taking a new
  view
  or assigning known data. So maybe two types of masks (different
  payloads),
  or an additional flag could be helpful. The process of assigning masks
  could
  also be made a bit easier than using fancy indexing.

 So this, uh... this was actually the whole goal of the alterNEP
 design for masks -- making all this stuff easy for people (like you,
 apparently?) that want support for ignored values, separately from
 missing data, and want a nice clean API for it. Basically having a
 separate .mask attribute which was an ordinary, assignable array
 broadcastable to the attached array's shape. Nobody seemed interested
 in talking about it much then but maybe there's interest now?


 Come off it, Nathaniel, the problem is minor and fixable. The intent of the
 initial implementation was to discover such things.

Implementation can be wonderful, I absolutely agree. But you
understand that I'd be more impressed by this example if your
discovery weren't something I had been arguing for since before the
implementation began :-).

 These things are less
 accessible with the current API *precisely* because of the feedback from R
 users. It didn't start that way.

 We now have something to evolve into what we want. That is a heck of a lot
 more useful than endless discussion.

No, you are still missing the point completely! There is no what *we*
want, because what you want is different than what I want. The
masking stuff in the alterNEP was an attempt to give people like you
who wanted ignored support what they wanted, and the bitpattern
stuff was to 

Re: [Numpy-discussion] Missing data again

2012-03-07 Thread Nathaniel Smith
On Wed, Mar 7, 2012 at 7:39 PM, Benjamin Root ben.r...@ou.edu wrote:
 On Wed, Mar 7, 2012 at 1:26 PM, Nathaniel Smith n...@pobox.com wrote:
 When it comes to missing data, bitpatterns can do everything that
 masks can do, are no more complicated to implement, and have better
 performance characteristics.


 Not true.  bitpatterns inherently destroys the data, while masks do not.

Yes, that's why I only wrote that this is true for missing data, not
in general :-). If you have data that is being destroyed, then that's
not missing data, by definition. We don't have consensus yet on
whether that's the use case we are aiming for, but it's the one that
Pierre was worrying about.

 For matplotlib, we can not use bitpatterns because it could over-write user
 data (or we have to copy the data).  I would imagine other extension writers
 would have similar issues when they need to play around with input data in a
 safe manner.

Right. You clearly need some sort of masking, either an explicit mask
array that you keep somewhere, or one that gets attached to the
underlying ndarray in some non-destructive way.

 Also, I doubt that the performance characteristics for strings and integers
 are the same as it is for masks.

Not sure what you mean by this, but I'd be happy to hear more.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Pierre Haessig
Hi Mark,

I went through the NA NEP a few days ago, but only too quickly so that
my question is probably a rather dumb one. It's about the usability of
bitpatter-based NAs, based on your recent post :

Le 03/03/2012 22:46, Mark Wiebe a écrit :
 Also, here's a thought for the usability of NA-float64. As much as
 global state is a bad idea, something which determines whether
 implicit float dtypes are NA-float64 or float64 could help. In
 IPython, pylab mode would default to float64, and statlab or
 pystat would default to NA-float64. One way to write this might be:

  np.set_default_float(np.nafloat64)
  np.array([1.0, 2.0, 3.0])
 array([ 1.,  2.,  3.], dtype=nafloat64)
  np.set_default_float(np.float64)
  np.array([1.0, 2.0, 3.0])
 array([ 1.,  2.,  3.], dtype=float64)

Q: Is is an *absolute* necessity to have two separate dtypes nafloatNN
and floatNN to enable NA bitpattern storage ?

From a potential user perspective, I feel it would be nice to have NA
and non-NA cases look as similar as possible. Your code example is
particularly striking : two different dtypes to store (from a user
perspective) the exact same content ! If this *could* be avoided, it
would be great...

I don't know how the NA machinery is working R. Does it works with a
kind of nafloat64 all the time or is there some type inference
mechanics involved in choosing the appropriate type ?

Best,
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Mark Wiebe
Hi Pierre,

On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig pierre.haes...@crans.orgwrote:

 Hi Mark,

 I went through the NA NEP a few days ago, but only too quickly so that
 my question is probably a rather dumb one. It's about the usability of
 bitpatter-based NAs, based on your recent post :

 Le 03/03/2012 22:46, Mark Wiebe a écrit :
  Also, here's a thought for the usability of NA-float64. As much as
  global state is a bad idea, something which determines whether
  implicit float dtypes are NA-float64 or float64 could help. In
  IPython, pylab mode would default to float64, and statlab or
  pystat would default to NA-float64. One way to write this might be:
 
   np.set_default_float(np.nafloat64)
   np.array([1.0, 2.0, 3.0])
  array([ 1.,  2.,  3.], dtype=nafloat64)
   np.set_default_float(np.float64)
   np.array([1.0, 2.0, 3.0])
  array([ 1.,  2.,  3.], dtype=float64)

 Q: Is is an *absolute* necessity to have two separate dtypes nafloatNN
 and floatNN to enable NA bitpattern storage ?

 From a potential user perspective, I feel it would be nice to have NA
 and non-NA cases look as similar as possible. Your code example is
 particularly striking : two different dtypes to store (from a user
 perspective) the exact same content ! If this *could* be avoided, it
 would be great...


The biggest reason to keep the two types separate is performance. The
straight float dtypes map directly to hardware floating-point operations,
which can be very fast. The NA-float dtypes have to use additional logic to
handle the NA values correctly. NA is treated as a particular NaN, and if
the hardware float operations were used directly, NA would turn into NaN.
This additional logic usually means more branches, so is slower.

One possibility we could consider is to automatically convert an array's
dtype from float64 to nafloat64 the first time an NA is assigned. This
would have good performance when there are no NAs, but would transparently
switch on NA support when it's needed.


 I don't know how the NA machinery is working R. Does it works with a
 kind of nafloat64 all the time or is there some type inference
 mechanics involved in choosing the appropriate type ?


My understanding of R is that it works with the nafloat64 for all its
operations, yes.

Cheers,
Mark


 Best,
 Pierre


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Nathaniel Smith
On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io wrote:
 Hi all,

Hi Travis,

Thanks for bringing this back up.

Have you looked at the summary from the last thread?
  https://github.com/njsmith/numpy/wiki/NA-discussion-status
The goal was to try and at least work out what points we all *could*
agree on, to have some common footing for further discussion. I won't
copy the whole thing here, but I'd summarize the state as:
  -- It's pretty clear that there are two fairly different conceptual
models/use cases in play here. For one of them (R-style missing data
cases) it's pretty clear what the desired semantics would be. For the
other (temporary ignored values) there's still substantive
disagreement.
  -- We *haven't* yet established what we want numpy to actually support.

IMHO the critical next step is this latter one -- maybe we want to
fully support both use cases. Maybe it's really only one of them
that's worth trying to support in the numpy core right now. Maybe it's
just one of them, but it's worth doing so thoroughly that it should
have multiple implementations. Or whatever.

I fear that if we don't talk about these big picture questions and
just wade directly back into round-and-round arguments about API
details then we'll never get anywhere.

[...]
 Because it is slated to go into release 1.7, we need to re-visit the masked 
 array discussion again.    The NEP process is the appropriate one and I'm 
 glad we are taking that route for these discussions.   My goal is to get 
 consensus in order for code to get into NumPy (regardless of who writes the 
 code).    It may be that we don't come to a consensus (reasonable and 
 intelligent people can disagree on things --- look at the coming 
 election...).   We can represent different parts of what is fortunately a 
 very large user-base of NumPy users.

 First of all, I want to be clear that I think there is much great work that 
 has been done in the current missing data code.  There are some nice features 
 in the where clause of the ufunc and the machinery for the iterator that 
 allows re-using ufunc loops that are not re-written to check for missing 
 data.   I'm sure there are other things as well that I'm not quite aware of 
 yet.    However, I don't think the API presented to the numpy user presently 
 is the correct one for NumPy 1.X.

 A few particulars:

        * the reduction operations need to default to skipna --- this is the 
 most common use case which has been re-inforced again to me today by a new 
 user to Python who is using masked arrays presently

This is one of the points where the two conceptual models disagree
(see also Skipper's point down-thread). If you have missing data,
then propagation has to be the default -- the sum of 1, 2, and
I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there
but you've asked numpy to temporarily ignore it, then, well, duh, of
course it should ignore it.

        * the mask needs to be visible to the user if they use that approach 
 to missing data (people should be able to get a hold of the mask and work 
 with it in Python)

This is also a point where the two conceptual models disagree.

Actually this is one of the original arguments we made against the NEP
design -- that if you want missing data, then having a mask at all is
counterproductive, and if you are ignoring data, then of course it
should be easy to manipulate the ignore mask. The rationale for the
current design is to compromise between these two approaches -- there
is a mask, but it's hidden behind a curtain. Mostly. (This may be a
compromise in the Solomonic sense.)

        * bit-pattern approaches to missing data (at least for float64 and 
 int32) need to be implemented.

        * there should be some way when using masks (even if it's hidden 
 from most users) for missing data to separate the low-level ufunc operation 
 from the operation
           on the masks...

I don't understand what this means.

 I have heard from several users that they will *not use the missing data* in 
 NumPy as currently implemented, and I can now see why.    For better or for 
 worse, my approach to software is generally very user-driven and very 
 pragmatic.  On the other hand, I'm also a mathematician and appreciate the 
 cognitive compression that can come out of well-formed structure.    
 None-the-less, I'm an *applied* mathematician and am ultimately motivated by 
 applications.

 I will get a hold of the NEP and spend some time with it to discuss some of 
 this in that document.   This will take several weeks (as PyCon is next week 
 and I have a tutorial I'm giving there).    For now, I do not think 1.7 can 
 be released unless the masked array is labeled *experimental*.

In project management terms, I see three options:
1) Put a big warning label on the functionality and leave it for now
(If this option is given, np.asarray returns a masked array. NOTE: IN
THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF 

Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Nathaniel Smith
On Tue, Mar 6, 2012 at 4:38 PM, Mark Wiebe mwwi...@gmail.com wrote:
 On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig pierre.haes...@crans.org
 wrote:
 From a potential user perspective, I feel it would be nice to have NA
 and non-NA cases look as similar as possible. Your code example is
 particularly striking : two different dtypes to store (from a user
 perspective) the exact same content ! If this *could* be avoided, it
 would be great...

 The biggest reason to keep the two types separate is performance. The
 straight float dtypes map directly to hardware floating-point operations,
 which can be very fast. The NA-float dtypes have to use additional logic to
 handle the NA values correctly. NA is treated as a particular NaN, and if
 the hardware float operations were used directly, NA would turn into NaN.
 This additional logic usually means more branches, so is slower.

Actually, no -- hardware float operations preserve NA-as-NaN. You
might well need to be careful around more exotic code like optimized
BLAS kernels, but all the basic ufuncs should Just Work at full speed.
Demo:

 def hexify(x): return hex(np.float64(x).view(np.int64))
 hexify(np.nan)
'0x7ff8L'
# IIRC this is R's NA bitpattern (presumably 1974 is someone's birthday)
 NA = np.int64(0x7ff8 + 1974).view(np.float64)
# It is an NaN...
 NA
nan
# But it has a distinct bitpattern:
 hexify(NA)
'0x7ff807b6L'
# Like any NaN, it propagates through floating point operations:
 NA + 3
nan
# But, critically, so does the bitpattern; ordinary Python + is
returning NA on this operation:
 hexify(NA + 3)
'0x7ff807b6L'

This is how R does it, which is more evidence that this actually works
on real hardware.

There is one place where it fails. In a binary operation with *two*
NaN values, there's an ambiguity about which payload should be
returned. IEEE754 recommends just returning the first one. This means
that NA + NaN = NA, NaN + NA = NaN. This is ugly, but it's an obscure
case that nobody cares about, so it's probably worth it for the speed
gain. (In fact, if you type those two expressions at the R prompt,
then that's what you get, and I can't find any reference to anyone
even noticing this.)

 I don't know how the NA machinery is working R. Does it works with a
 kind of nafloat64 all the time or is there some type inference
 mechanics involved in choosing the appropriate type ?

 My understanding of R is that it works with the nafloat64 for all its
 operations, yes.

Right -- R has a very impoverished type system as compared to numpy.
There's basically four types: numeric (meaning double precision
float), integer, logical (boolean), and character (string). And
in practice the integer type is essentially unused, because R parses
numbers like 1 as being floating point, not integer; the only way to
get an integer value is to explicitly cast to it. Each of these types
has a specific bit-pattern set aside for representing NA. And...
that's it. It's very simple when it works, but also very limited.

I'm still skeptical that we could make the floating point types
NA-aware by default -- until we have an implementation in hand, I'm
nervous there'd be some corner case that broke everything. (Maybe
ufuncs are fine but np.dot has an unavoidable overhead, or maybe it
would mess up casting from float types to non-NA-aware types, etc.)
But who knows. Probably not something we can really make a meaningful
decision about yet.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Ralf Gommers
On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith n...@pobox.com wrote:

 On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io
 wrote:
  Hi all,

 Hi Travis,

 Thanks for bringing this back up.

 Have you looked at the summary from the last thread?
  https://github.com/njsmith/numpy/wiki/NA-discussion-status


Re-reading that summary and the main documents and threads linked from it,
I could find either examples of statistical software that treats missing
and ignored data explicitly separately, or links to relevant literature.
Those would probably help the discussion a lot.

The goal was to try and at least work out what points we all *could*
 agree on, to have some common footing for further discussion. I won't
 copy the whole thing here, but I'd summarize the state as:
  -- It's pretty clear that there are two fairly different conceptual
 models/use cases in play here. For one of them (R-style missing data
 cases) it's pretty clear what the desired semantics would be. For the
 other (temporary ignored values) there's still substantive
 disagreement.
  -- We *haven't* yet established what we want numpy to actually support.

 IMHO the critical next step is this latter one -- maybe we want to
 fully support both use cases. Maybe it's really only one of them
 that's worth trying to support in the numpy core right now. Maybe it's
 just one of them, but it's worth doing so thoroughly that it should
 have multiple implementations. Or whatever.

 I fear that if we don't talk about these big picture questions and
 just wade directly back into round-and-round arguments about API
 details then we'll never get anywhere.

 [...]
  Because it is slated to go into release 1.7, we need to re-visit the
 masked array discussion again.The NEP process is the appropriate one
 and I'm glad we are taking that route for these discussions.   My goal is
 to get consensus in order for code to get into NumPy (regardless of who
 writes the code).It may be that we don't come to a consensus
 (reasonable and intelligent people can disagree on things --- look at the
 coming election...).   We can represent different parts of what is
 fortunately a very large user-base of NumPy users.
 
  First of all, I want to be clear that I think there is much great work
 that has been done in the current missing data code.  There are some nice
 features in the where clause of the ufunc and the machinery for the
 iterator that allows re-using ufunc loops that are not re-written to check
 for missing data.   I'm sure there are other things as well that I'm not
 quite aware of yet.However, I don't think the API presented to the
 numpy user presently is the correct one for NumPy 1.X.
 
  A few particulars:
 
 * the reduction operations need to default to skipna --- this
 is the most common use case which has been re-inforced again to me today by
 a new user to Python who is using masked arrays presently

 This is one of the points where the two conceptual models disagree
 (see also Skipper's point down-thread). If you have missing data,
 then propagation has to be the default -- the sum of 1, 2, and
 I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there
 but you've asked numpy to temporarily ignore it, then, well, duh, of
 course it should ignore it.

 * the mask needs to be visible to the user if they use that
 approach to missing data (people should be able to get a hold of the mask
 and work with it in Python)

 This is also a point where the two conceptual models disagree.

 Actually this is one of the original arguments we made against the NEP
 design -- that if you want missing data, then having a mask at all is
 counterproductive, and if you are ignoring data, then of course it
 should be easy to manipulate the ignore mask. The rationale for the
 current design is to compromise between these two approaches -- there
 is a mask, but it's hidden behind a curtain. Mostly. (This may be a
 compromise in the Solomonic sense.)

 * bit-pattern approaches to missing data (at least for float64
 and int32) need to be implemented.
 
 * there should be some way when using masks (even if it's
 hidden from most users) for missing data to separate the low-level ufunc
 operation from the operation
on the masks...

 I don't understand what this means.

  I have heard from several users that they will *not use the missing
 data* in NumPy as currently implemented, and I can now see why.For
 better or for worse, my approach to software is generally very user-driven
 and very pragmatic.  On the other hand, I'm also a mathematician and
 appreciate the cognitive compression that can come out of well-formed
 structure.None-the-less, I'm an *applied* mathematician and am
 ultimately motivated by applications.
 
  I will get a hold of the NEP and spend some time with it to discuss some
 of this in that document.   This will take several weeks (as PyCon is next
 week and I have a 

Re: [Numpy-discussion] Missing data again

2012-03-06 Thread Nathaniel Smith
On Tue, Mar 6, 2012 at 9:14 PM, Ralf Gommers
ralf.gomm...@googlemail.com wrote:
 On Tue, Mar 6, 2012 at 9:25 PM, Nathaniel Smith n...@pobox.com wrote:
 On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant tra...@continuum.io
 wrote:
  Hi all,

 Hi Travis,

 Thanks for bringing this back up.

 Have you looked at the summary from the last thread?
  https://github.com/njsmith/numpy/wiki/NA-discussion-status

 Re-reading that summary and the main documents and threads linked from it, I
 could find either examples of statistical software that treats missing and
 ignored data explicitly separately, or links to relevant literature. Those
 would probably help the discussion a lot.

(I think you mean couldn't find?)

I'm not aware of any software that supports the IGNORED concept at
all, whether in combination with missing data or not. np.ma is
probably the closest example. I think we'd be breaking new ground
there. This is also probably why it is less clear how it should work
:-).

IIUC, the basic reason that people want IGNORED in the core is that it
provides convenience and syntactic sugar for efficient in place
operation on subsets of large arrays. So there are actually two parts
there -- the efficient operation, and the convenience/syntactic sugar.
The key feature for efficient operation is the where= feature, which
is not controversial at all. So, there's an argument that for now we
should focus on where=, give people some time to work with it, and
then use that experience to decide what kind of convenience/sugar
would be useful, if any. But, that's just my own idea; I definitely
can't claim any consensus on it.

 In project management terms, I see three options:
 1) Put a big warning label on the functionality and leave it for now
 (If this option is given, np.asarray returns a masked array. NOTE: IN
 THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF RABID, HUNGRY
 WEASELS. NO GUARANTEES.)

 I've opened http://projects.scipy.org/numpy/ticket/2072 for that.

Cool, thanks.

 Assuming
 we stick with this option, I'd appreciate it if you could check in the first
 beta that comes out whether or not the warnings are obvious enough and in
 all the right places. There probably won't be weasels though:)

Of course. I've added myself to the CC list. (Err, if the beta won't
be for a bit, though, then please remind me if you remember? I'm
juggling a lot of balls right now.)

 2) Move the code back out of mainline and into a branch until until
 there's consensus.
 3) Hold up the release until this is all sorted.

 I come from the project-management school that says you should always
 have a releasable mainline, keep unready code in branches, and never
 hold up the release for features, so (2) seems obvious to me.

 While it may sound obvious, I hope you've understood why in practice it's
 not at all obvious and why you got such strong reactions to your proposal of
 taking out all that code. If not, just look at what happened with the
 numpy-refactor work.

Of course, and that's why I'm not pressing the point. These trade-offs
might be worth talking about at some point -- there are reasons that
basically all the major FOSS projects have moved towards time-based
releases :-) -- but that'd be a huge discussion at a time when we
already have more than enough of those on our plate...

 But I seem to be very much in the minority on that[1], so oh well :-). I
 don't have any objection to (1), personally. (3) seems like a bad
 idea. Just my 2 pence.


 Agreed that (3) is a bad idea. +1 for (1).

 Ralf


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Cheers,
-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Missing data again

2012-03-03 Thread Charles R Harris
On Sat, Mar 3, 2012 at 1:30 PM, Travis Oliphant tra...@continuum.io wrote:

 Hi all,

 I've been thinking a lot about the masked array implementation lately.
 I finally had the time to look hard at what has been done and now am of the
 opinion that I do not think that 1.7 can be released with the current state
 of the masked array implementation *unless* it is clearly marked as
 experimental and may be changed in 1.8


That was the intention.


 I wish I had been able to be a bigger part of this conversation last year.
   But, that is why I took the steps I took to try and figure out another
 way to feed my family *and* stay involved in the NumPy community.   I would
 love to stay involved in what is happening in the SciPy community, but I am
 more satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles,
 Stefan, and others are doing there right now, and don't have time to keep
 up with everything.Even though SciPy was the heart and soul of why I
 even got involved with Python for open source in the first place and took
 many years of my volunteer labor, I won't be able to spend significant time
 on SciPy code over the coming months.   At some point, I really hope to be
 able to make contributions again to that code-base.   Time will tell
 whether or not my aspirations will be realized.  It depends quite a bit on
 whether or not my kids have what they need from me (which right now is
 money and time).

 NumPy, on the other hand, is not in a position where I can feel
 comfortable leaving my baby to others.  I recognize and value the
 contributions from many people to make NumPy what it is today (e.g. code
 contributions, code rearrangement and standardization, build and install
 improvement, and most recently some architectural changes).But, I feel
 a personal responsibility for the code base as I spent a great many months
 writing NumPy in the first place, and I've spent a great deal of time
 interacting with NumPy users and feel like I have at least some sense of
 their stories.Of course, I built on the shoulders of giants, and much
 of what is there is *because of* where the code was adapted from (it was
 not created de-novo).   Currently,  there remains much that needs to be
 communicated, improved, and worked on, and I have specific opinions about
 what some changes and improvements should be, how they should be written,
 and how the resulting users need to be benefited.
  It will take time to discuss all of this, and that's where I will spend
 my open-source time in the coming months.

 In that vein:

 Because it is slated to go into release 1.7, we need to re-visit the
 masked array discussion again.The NEP process is the appropriate one
 and I'm glad we are taking that route for these discussions.   My goal is
 to get consensus in order for code to get into NumPy (regardless of who
 writes the code).It may be that we don't come to a consensus
 (reasonable and intelligent people can disagree on things --- look at the
 coming election...).   We can represent different parts of what is
 fortunately a very large user-base of NumPy users.

 First of all, I want to be clear that I think there is much great work
 that has been done in the current missing data code.  There are some nice
 features in the where clause of the ufunc and the machinery for the
 iterator that allows re-using ufunc loops that are not re-written to check
 for missing data.   I'm sure there are other things as well that I'm not
 quite aware of yet.However, I don't think the API presented to the
 numpy user presently is the correct one for NumPy 1.X.


A few particulars:

* the reduction operations need to default to skipna --- this is
 the most common use case which has been re-inforced again to me today by a
 new user to Python who is using masked arrays presently

* the mask needs to be visible to the user if they use that
 approach to missing data (people should be able to get a hold of the mask
 and work with it in Python)

* bit-pattern approaches to missing data (at least for float64 and
 int32) need to be implemented.

* there should be some way when using masks (even if it's hidden
 from most users) for missing data to separate the low-level ufunc operation
 from the operation
   on the masks...


Mind, Mark only had a few weeks to write code. I think the unfinished state
is a direct function of that.


 I have heard from several users that they will *not use the missing data*
 in NumPy as currently implemented, and I can now see why.For better or
 for worse, my approach to software is generally very user-driven and very
 pragmatic.  On the other hand, I'm also a mathematician and appreciate the
 cognitive compression that can come out of well-formed structure.
  None-the-less, I'm an *applied* mathematician and am ultimately motivated
 by applications.


I think that would be Wes. I thought the current state wasn't that far away
from what he wanted 

Re: [Numpy-discussion] Missing data again

2012-03-03 Thread Travis Oliphant
 
 Mind, Mark only had a few weeks to write code. I think the unfinished state 
 is a direct function of that.
  
 I have heard from several users that they will *not use the missing data* in 
 NumPy as currently implemented, and I can now see why.For better or for 
 worse, my approach to software is generally very user-driven and very 
 pragmatic.  On the other hand, I'm also a mathematician and appreciate the 
 cognitive compression that can come out of well-formed structure.
 None-the-less, I'm an *applied* mathematician and am ultimately motivated by 
 applications.
 
 
 I think that would be Wes. I thought the current state wasn't that far away 
 from what he wanted in the only post where he was somewhat explicit. I think 
 it would be useful for him to sit down with Mark at some time and thrash 
 things out since I think there is some misunderstanding involved.
  

Actually it wasn't Wes.  It was 3 other people.   I'm already well aware of 
Wes's perspective and actually think his concerns have been handled already.
Also, the person who showed me their use-case was a new user.

But, your point about getting people together is well-taken.  I also recognize 
the fact that there have been (and likely continue to be) misunderstandings on 
multiple fronts.   Fortunately, many of us will be at PyCon later this week.   
We tried really hard to get Mark Wiebe here this weekend as well --- but he 
could only sacrifice a week away from his degree work to join us for PyCon. 

It would be great if you could come to PyCon as well.   Perhaps we can apply to 
NumFOCUS for a travel grant to bring NumPy developers together with other 
interested people to finish the masked array design and implementation.

-Travis


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion