Recently, I came across a strange and potentially troublesome behaviour of the lm
and aov functions that ask questions about calculation accuracy. Let us consider
the 2 following datasets dat1 & dat2 :
(dat1 <- data.frame(Y=c(1:3, 10+1:3), F=c(rep("A",3), rep("B",3))))
Y F
1 1 A
2 2 A
3 3 A
4 11 B
5 12 B
6 13 B
(dat2 <- data.frame(Y=c(10+1:3, 1:3), F=c(rep("A",3), rep("B",3))))
Y F
1 11 A
2 12 A
3 13 A
4 1 B
5 2 B
6 3 B
They only differ in the order of values that were exchanged between samples A
and B. Thus the sd is 1 for each sample in either data sets, and the absolute
mean difference |A-B| is 10 in both datasets.
Now, let us perform an anova to compare samples A and B in both datasets (of
course, in such simple case, a bilateral T test would do the job, but an anova
is nevertheless allowed and should give the same probability than Student's
test):
(anova1 <- anova(lm(Y~F, dat1)))
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
F 1 150 150 150 0.0002552 ***
Residuals 4 4 1
(anova2 <- anova(lm(Y~F, dat2)))
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
F 1 150 150 150 0.0002552 ***
Residuals 4 4 1
As expected, both datasets give a same anova table, but this is only apparent.
Indeed :
anova1$F[1] == anova2$F[1]
[1] FALSE
anova1$F[1] - anova2$F[1]
[1] 5.684342e-14
In fact the F values differ slightly, and this holds also for the aov function.
I checked also (not shown) that both the residual and factorial sums of squares
differ between dat1 and dat2. Thus, for some undocumented reason (at least for
the end user), the F values depend on the order of data!
While such tiny differences (e-14 in this example) are devoid of consequences
on the risk evaluation by Fisher's distribution, they may have huge
consequences on the risk evaluation by the permutation method. Indeed, the
shift from continuous to discrete distributions is far from being insignificant.
For instance, the following code in R is at the basis of many permutation
algorithms found in the internet and in teaching because it seems quite
straightforward (see for example
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Permutation%20Anova/PermTestsAnova.html
http://www.usna.edu/Users/math/jager/courses/sm439/labs/Lab7.pdf
http://biostatistics1014.blogspot.fr/2013/04/one-way-anova-permutation-test-and.html
http://adn.biol.umontreal.ca/~numericalecology/Rcode/):
nperm <- 1000 # number of permutations
Fperm <- replicate(n=nperm, anova(lm(sample(Y) ~ F, dat1))$F[1]) # calculates
nperm F values
(prob <- (sum(anova1$F[1] <= Fperm) + 1)/(nperm +1)) # risk calculation
[1] 0.04695305
Of course, because of the sample function, repeating this code gives different
prob values, but they remain always close to 5% instead of the exact
probability of 10%. Indeed, there are only choose(6,3) = 20 possible
permutations, but because they are symmetric, they give only 10 absolute mean
differences. Thus, the only exact probabilities are 10%, 20% ... 100%. In our
example where samples do not overlap, 10% is obviously the right answer.
Thus, the use of lm and aov functions in permutation methods does not seem a
good idea as it results in biases that underestimate the exact risk. In the
simple case of one-way anova, it is quite simple to remedy this problem. As the
total Sums of squares and the degrees of freedom do not change with
permutations, it is easier and much faster to compare the residual sums of
squares. For instance, the exact probabilities can be calculated that way :
combi <- combn(6, 3, FUN=function(i) {append(dat1$Y[i], dat1$Y[-i])}) # all
permutations
SCEResid <- apply(combi, 2, FUN=function(x) sum(tapply(x, dat1$F, function(x)
sum((x - mean(x))^2)))) # all resi SS
(prob <- mean(SCEResid <= SCEResid[1])) # risk calculation
[1] 0.1
10% is indeed the exact risk.
Finally, my problem is : How can we know if R libraries that use randomization
procedures are not biased ? In the basic case of one way anova, it seems easy
to submit the above example (by the way, the defunct lmPerm library does not
succeed ...), but how can we check more complex anova models ?
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.