Thanks Tim, this is exactly the explanation I was hoping to see. Much appreciated!
On Mon, Feb 10, 2014 at 7:21 AM, Tim Hesterberg <timhesterb...@gmail.com> wrote: > This isn't quite what you were asking, but might inform your choice. > > R doesn't try to maintain the distinction between NA and NaN when > doing calculations, e.g.: >> NA + NaN > [1] NA >> NaN + NA > [1] NaN > So for the aggregate package, I didn't attempt to treat them differently. > > The aggregate package is available at > http://www.timhesterberg.net/r-packages > > Here is the inst/doc/missingValues.txt file from that package: > > -------------------------------------------------- > Copyright 2012 Google Inc. All Rights Reserved. > Author: Tim Hesterberg <roc...@google.com> > Distributed under GPL 2 or later. > > > Handling of missing values and not-a-numbers. > > > Here I'll note how this package handles missing values. > I do it the way R handles them, rather than the more strict way that S+ does. > > First, for terminology, > NaN = "not-a-number", e.g. the result of 0/0 > NA = "missing value" or "true missing value", e.g. survey non-response > xx = I'll uses this for the union of those, or "missing value of any kind". > > For background, at the hardware level there is an IEEE standard that > specifies that certain bit patterns are NaN, and specifies that > operations involving an NaN result in another NaN. > > That standard doesn't say anything about missing values, which are > important in statistics. > > So what R and S+ do is to pick one of the bit patterns and declare > that to be a NA. In other words, the NA bit pattern is a subset of > the NaN bit patterns. > > At the user level, the reverse seems to hold. > You can assign either NA or NaN to an object. > But: > is.na(x) returns TRUE for both > is.nan(x) returns TRUE for NaN and FALSE for NA > Based on that, you'd think that NaN is a subset of NA. > To tell whether something is a true missing value do: > (is.na(x) & !is.nan(x)) > > The S+ convention is that any operation involving NA results in an NA; > otherwise any operation involving NaN results in NaN. > > The R convention is that any operation involving xx results in an xx; > a missing value of any kind results in another missing value of any > kind. R considers NA and NaN equivalent for testing purposes: > all.equal(NA_real_, NaN) > gives TRUE. > > Some R functions follow the S+ convention, e.g. the Math2 functions > in src/main/arithmetic.c use this macro: > #define if_NA_Math2_set(y,a,b) \ > if (ISNA (a) || ISNA (b)) y = NA_REAL; \ > else if (ISNAN(a) || ISNAN(b)) y = R_NaN; > > Other R functions, like the basic arithmetic operations +-/*^, > do not (search for PLUSOP in src/main/arithmetic.c). > They just let the hardware do the calculations. > As a result, you can get odd results like >> is.nan(NA_real_ + NaN) > [1] FALSE >> is.nan(NaN + NA_real_) > [1] TRUE > > The R help files help(is.na) and help(is.nan) suggest that > computations involving NA and NaN are indeterminate. > > It is faster to use the R convention; most operations are just > handled by the hardware, without extra work. > > In cases like sum(x, na.rm=TRUE), the help file specifies that both NA > and NaN are removed. > > > > >>There is one NA but mulitple NaNs. >> >>And please re-read 'man memcmp': your cast is wrong. >> >>On 10/02/2014 06:52, Kevin Ushey wrote: >>> Hi R-devel, >>> >>> I have a question about the differentiation between NA and NaN values >>> as implemented in R. In arithmetic.c, we have >>> >>> int R_IsNA(double x) >>> { >>> if (isnan(x)) { >>> ieee_double y; >>> y.value = x; >>> return (y.word[lw] == 1954); >>> } >>> return 0; >>> } >>> >>> ieee_double is just used for type punning so we can check the final >>> bits and see if they're equal to 1954; if they are, x is NA, if >>> they're not, x is NaN (as defined for R_IsNaN). >>> >>> My question is -- I can see a substantial increase in speed (on my >>> computer, in certain cases) if I replace this check with >>> >>> int R_IsNA(double x) >>> { >>> return memcmp( >>> (char*)(&x), >>> (char*)(&NA_REAL), >>> sizeof(double) >>> ) == 0; >>> } >>> >>> IIUC, there is only one bit pattern used to encode R NA values, so >>> this should be safe. But I would like to be sure: >>> >>> Is there any guarantee that the different functions in R would return >>> NA as identical to the bit pattern defined for NA_REAL, for a given >>> architecture? Similarly for NaN value(s) and R_NaN? >>> >>> My guess is that it is possible some functions used internally by R >>> might encode NaN values differently; ie, setting the lower word to a >>> value different than 1954 (hence being NaN, but potentially not >>> identical to R_NaN), or perhaps this is architecture-dependent. >>> However, NA should be one specific bit pattern (?). And, I wonder if >>> there is any guarantee that the different functions used in R would >>> return an NaN value as identical to R_NaN (which appears to be the >>> 'IEEE NaN')? >>> >>> (interested parties can see + run a simple benchmark from the gist at >>> https://gist.github.com/kevinushey/8911432) >>> >>> Thanks, >>> Kevin >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> >> >>-- >>Brian D. Ripley, rip...@stats.ox.ac.uk >>Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >>University of Oxford, Tel: +44 1865 272861 (self) >>1 South Parks Road, +44 1865 272866 (PA) >>Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel