Consider the following:

> set.seed(42)
> ff <- factor(sample(c(1,3,5),42,TRUE),levels=1:5)
> x <- runif(42)
> tapply(x,ff,sum)
       1        2        3        4        5
3.675436       NA 7.519675       NA 9.094210

I got bitten by those NAs in the result of tapply().  Effectively
one is summing over the empty set, and consequently (according to what
I learned as a child) I thought that the result would be 0.

And that's what one gets if one does the sum ``by hand'':

> sum(x[ff==1])
[1] 3.675436
> sum(x[ff==2])
[1] 0
 > sum(x[ff==4])
[1] 0

On reflection I realized that since tapply() needs to work with arbitrary functions, and since there is no way to determine what an arbitrary function
will do to the empty set, this is the Way It's Got to Be.

But it's a trap for young players, and so I thought I'd post my experience
as a warning to others to be careful about this.

To work around the problem one ***could*** do something like

> result[is.na(result)] <- 0

but that's very infra dig in my book.  I figured out something I like
much better:

        sapply(tapply(x,ff,I,simplify=FALSE),sum)

That simplify=FALSE is needed just in case there is at most one entry of
x for each level of ff, in which case tapply will return an array with
NAs in it, rather than a list with NULL entries corresponding to empty cells,
unless simplify=FALSE is specified.

        cheers,

                Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to