On Tue, Mar 6, 2012 at 4:38 PM, Mark Wiebe <mwwi...@gmail.com> wrote: > On Tue, Mar 6, 2012 at 5:48 AM, Pierre Haessig <pierre.haes...@crans.org> > wrote: >> >From a potential user perspective, I feel it would be nice to have NA >> and non-NA cases look as similar as possible. Your code example is >> particularly striking : two different dtypes to store (from a user >> perspective) the exact same content ! If this *could* be avoided, it >> would be great... > > The biggest reason to keep the two types separate is performance. The > straight float dtypes map directly to hardware floating-point operations, > which can be very fast. The NA-float dtypes have to use additional logic to > handle the NA values correctly. NA is treated as a particular NaN, and if > the hardware float operations were used directly, NA would turn into NaN. > This additional logic usually means more branches, so is slower.
Actually, no -- hardware float operations preserve NA-as-NaN. You might well need to be careful around more exotic code like optimized BLAS kernels, but all the basic ufuncs should Just Work at full speed. Demo: >>> def hexify(x): return hex(np.float64(x).view(np.int64)) >>> hexify(np.nan) '0x7ff8000000000000L' # IIRC this is R's NA bitpattern (presumably 1974 is someone's birthday) >>> NA = np.int64(0x7ff8000000000000 + 1974).view(np.float64) # It is an NaN... >>> NA nan # But it has a distinct bitpattern: >>> hexify(NA) '0x7ff80000000007b6L' # Like any NaN, it propagates through floating point operations: >>> NA + 3 nan # But, critically, so does the bitpattern; ordinary Python "+" is returning NA on this operation: >>> hexify(NA + 3) '0x7ff80000000007b6L' This is how R does it, which is more evidence that this actually works on real hardware. There is one place where it fails. In a binary operation with *two* NaN values, there's an ambiguity about which payload should be returned. IEEE754 recommends just returning the first one. This means that NA + NaN = NA, NaN + NA = NaN. This is ugly, but it's an obscure case that nobody cares about, so it's probably worth it for the speed gain. (In fact, if you type those two expressions at the R prompt, then that's what you get, and I can't find any reference to anyone even noticing this.) >> I don't know how the NA machinery is working R. Does it works with a >> kind of "nafloat64" all the time or is there some type inference >> mechanics involved in choosing the appropriate type ? > > My understanding of R is that it works with the "nafloat64" for all its > operations, yes. Right -- R has a very impoverished type system as compared to numpy. There's basically four types: "numeric" (meaning double precision float), "integer", "logical" (boolean), and "character" (string). And in practice the integer type is essentially unused, because R parses numbers like "1" as being floating point, not integer; the only way to get an integer value is to explicitly cast to it. Each of these types has a specific bit-pattern set aside for representing NA. And... that's it. It's very simple when it works, but also very limited. I'm still skeptical that we could make the floating point types NA-aware by default -- until we have an implementation in hand, I'm nervous there'd be some corner case that broke everything. (Maybe ufuncs are fine but np.dot has an unavoidable overhead, or maybe it would mess up casting from float types to non-NA-aware types, etc.) But who knows. Probably not something we can really make a meaningful decision about yet. -- Nathaniel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion