Hi Bruce, I'm replying on the list instead of on github, to make it easier for others to join in the discussion if they want. [For those joining in: this was a comment posted at https://gist.github.com/1068264 ]
On Fri, Jul 8, 2011 at 10:36 AM, bsouthey wrote: > I presume missing float values could be addressed with one of the 'special' > ranges such as 'Indeterminate' in IEEE 754 > (http://babbage.cs.qc.edu/IEEE-754/References.xhtml). The outcome should be > determined by the IEEE special operations. Right. An IEEE 754 double has IIRC about 2^53 distinct bit-patterns that all mean "not a number". A few of these are used to signal different invalid operations: In [20]: hex(np.asarray([np.nan]).view(dtype=np.uint64)[0]) Out[20]: '0x7ff8000000000000L' In [21]: hex(np.log([0]).view(dtype=np.uint64)[0]) Out[21]: '0xfff0000000000000L' In [22]: hex(np.divide([0.], [0,]).view(dtype=np.uint64)[0]) Out[22]: '0xfff8000000000000L' ...but that only accounts for, like, 10 of the 2^53 or something. The rest are simply unused. So what R does, and what we would do for dtype-style NAs, is just pick one of those (ideally the same one R uses), and declare that that is *not* not a number; it's NA. > So my real concern is handling integer arrays: > 1) How will you find where the missing values are in an array? If there is a > variable that denotes missing values are present (NA_flags?) then do you have > to duplicate code to avoid this searching when an array has no missing values? Each dtype has a bunch of C functions associated with it that say how to do comparisons, assignment, etc. In the miniNEP design, we add a new function to this list called 'isna', which every dtype that wants to support NAs has to define. Yes, this does mean that code which wants to treat NAs separately has to check for and call this function if it's present, but that seems to be inevitable... *all* of the dtype C functions are supposedly optional, so we have to check for them before calling them and do something sensible if they aren't defined. We could define a wrapper that calls the function if its defined, or else just fills the provided buffer with zeros (to mean "there are no NAs), and then code which wanted to avoid a special case could use that. But in general we probably do want to handle arrays that might have NAs differently from arrays which don't have NAs, because if there are no NAs present then it's quicker to skip the handling altogether. That's true for any NA implementation. > 2) What happens if a normal operation equates to that value: If you use > max(np.int8), such as when adding 1 to an array with an element of 126 or > when overflow occurs: >>>> np.arange(120,127, dtype=np.int8)+2 > array([ 122, 123, 124, 125, 126, 127, -128], dtype=int8) > The -128 corresponds to the missing element but is the second to last element > now missing? This is worse if the overflow is larger. Yeah, in the design as written, overflow (among other things) can create accidental NAs. Which kind of sucks. There are a few options: -- Just live with it. -- We could add a flag like NPY_NA_AUTO_CHECK, and when this flag is set, the ufunc loop runs 'isna' on its output buffer before returning. If there are any NAs there that did not arise from NAs in the input, then it raises an error. (The reason we would want to make it a flag is that this checking is pointless for dtypes like NA-string, and mostly pointless for dtypes like NA-float.) Also, we'd only want to enable this if we were using the NPY_NA_AUTO_UFUNC ufunc-delegation logic, because if you registered a special ufunc loop *specifically for your NA-dtype*, then presumably it knows what it's doing. This would also allow such an NA-dtype-specific ufunc loop to return NAs on purpose if it wanted to. -- Use a dtype that adds a separate flag next to the actual integer to indicate NA-ness, instead of stealing one of the integer's values. So your NA-int8 would actually be 2 bytes, where the first byte was 1 to indicate NA, or 0 to indicate that the second byte contains an actual int8. If you do this with larger integers, say an int32, then you have a choice: you could store your int32 in 8 bytes, in which case arithmetic etc. is fast, but you waste a bit of memory. Or you could store your int32 in 5 bytes, in which case arithmetic etc. become somewhat slower, but you don't waste any memory. (This latter case would basically be like using an unaligned or byteswapped array in current numpy, in terms of mechanisms and speed.) -- Nothing in this design rules out a second implementation of NAs based on masking. Personally, as you know, I'm not a big fan, but if it were added anyway, then you could use that for your integers as well. A related issue is, of the many ways we *can* do integer NA-dtype, which one *should* we do by default. I don't have a strong opinion, really; I haven't heard anyone say that they have huge quantities of integer-plus-NA data that they want to manipulate and memory/speed/allowing the full range of values are all really important. (Maybe that's you?) In the design as written, they're all pretty trivial to implement (you just tweak a few magic numbers in the dtype structure), and probably we should support all of them via more-or-less exotic invocations of np.withNA. (E.g., 'np.withNA(np.int32, useflag=True, flagsize=1)' to get a 5-byte int32.) ...I kind of like that NPY_NA_AUTO_CHECK idea, it's pretty clean and would definitely make things safer. I think I'll add it. -- Nathaniel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion