Hi Bruce,

I'm replying on the list instead of on github, to make it easier for
others to join in the discussion if they want. [For those joining in:
this was a comment posted at https://gist.github.com/1068264 ]

On Fri, Jul 8, 2011 at 10:36 AM, bsouthey wrote:
> I presume missing float values could be addressed with one of the 'special' 
> ranges such as 'Indeterminate' in IEEE 754 
> (http://babbage.cs.qc.edu/IEEE-754/References.xhtml). The outcome should be 
> determined by the IEEE special operations.

Right. An IEEE 754 double has IIRC about 2^53 distinct bit-patterns
that all mean "not a number". A few of these are used to signal
different invalid operations:

In [20]: hex(np.asarray([np.nan]).view(dtype=np.uint64)[0])
Out[20]: '0x7ff8000000000000L'
In [21]: hex(np.log([0]).view(dtype=np.uint64)[0])
Out[21]: '0xfff0000000000000L'
In [22]: hex(np.divide([0.], [0,]).view(dtype=np.uint64)[0])
Out[22]: '0xfff8000000000000L'

...but that only accounts for, like, 10 of the 2^53 or something. The
rest are simply unused. So what R does, and what we would do for
dtype-style NAs, is just pick one of those (ideally the same one R
uses), and declare that that is *not* not a number; it's NA.

> So my real concern is handling integer arrays:
> 1) How will you find where the missing values are in an array? If there is a 
> variable that denotes missing values are present (NA_flags?) then do you have 
> to duplicate code to avoid this searching when an array has no missing values?

Each dtype has a bunch of C functions associated with it that say how
to do comparisons, assignment, etc. In the miniNEP design, we add a
new function to this list called 'isna', which every dtype that wants
to support NAs has to define.

Yes, this does mean that code which wants to treat NAs separately has
to check for and call this function if it's present, but that seems to
be inevitable... *all* of the dtype C functions are supposedly
optional, so we have to check for them before calling them and do
something sensible if they aren't defined. We could define a wrapper
that calls the function if its defined, or else just fills the
provided buffer with zeros (to mean "there are no NAs), and then code
which wanted to avoid a special case could use that. But in general we
probably do want to handle arrays that might have NAs differently from
arrays which don't have NAs, because if there are no NAs present then
it's quicker to skip the handling altogether. That's true for any NA
implementation.

> 2) What happens if a normal operation equates to that value: If you use 
> max(np.int8), such as when adding 1 to an array with an element of 126 or 
> when overflow occurs:
>>>> np.arange(120,127, dtype=np.int8)+2
> array([ 122,  123,  124,  125,  126,  127, -128], dtype=int8)
> The -128 corresponds to the missing element but is the second to last element 
> now missing? This is worse if the overflow is larger.

Yeah, in the design as written, overflow (among other things) can
create accidental NAs. Which kind of sucks. There are a few options:

-- Just live with it.

-- We could add a flag like NPY_NA_AUTO_CHECK, and when this flag is
set, the ufunc loop runs 'isna' on its output buffer before returning.
If there are any NAs there that did not arise from NAs in the input,
then it raises an error. (The reason we would want to make it a flag
is that this checking is pointless for dtypes like NA-string, and
mostly pointless for dtypes like NA-float.) Also, we'd only want to
enable this if we were using the NPY_NA_AUTO_UFUNC ufunc-delegation
logic, because if you registered a special ufunc loop *specifically
for your NA-dtype*, then presumably it knows what it's doing. This
would also allow such an NA-dtype-specific ufunc loop to return NAs on
purpose if it wanted to.

-- Use a dtype that adds a separate flag next to the actual integer to
indicate NA-ness, instead of stealing one of the integer's values. So
your NA-int8 would actually be 2 bytes, where the first byte was 1 to
indicate NA, or 0 to indicate that the second byte contains an actual
int8. If you do this with larger integers, say an int32, then you have
a choice: you could store your int32 in 8 bytes, in which case
arithmetic etc. is fast, but you waste a bit of memory. Or you could
store your int32 in 5 bytes, in which case arithmetic etc. become
somewhat slower, but you don't waste any memory. (This latter case
would basically be like using an unaligned or byteswapped array in
current numpy, in terms of mechanisms and speed.)

-- Nothing in this design rules out a second implementation of NAs
based on masking. Personally, as you know, I'm not a big fan, but if
it were added anyway, then you could use that for your integers as
well.

A related issue is, of the many ways we *can* do integer NA-dtype,
which one *should* we do by default. I don't have a strong opinion,
really; I haven't heard anyone say that they have huge quantities of
integer-plus-NA data that they want to manipulate and
memory/speed/allowing the full range of values are all really
important. (Maybe that's you?) In the design as written, they're all
pretty trivial to implement (you just tweak a few magic numbers in the
dtype structure), and probably we should support all of them via
more-or-less exotic invocations of np.withNA. (E.g.,
'np.withNA(np.int32, useflag=True, flagsize=1)' to get a 5-byte
int32.)

...I kind of like that NPY_NA_AUTO_CHECK idea, it's pretty clean and
would definitely make things safer. I think I'll add it.

-- Nathaniel
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to