Re: [R] ANOVA and permutation tests : beware of traps

2014-09-23 Thread Joshua Wiley
Hi Stephane,

This is the well known result of limitted floating point precision (e.g.,
http://www.validlab.com/goldberg/addendum.html).  Using a test of
approximate rather than exact equality shows R yields the correct answer:

nperm - 1
Fperm - replicate(n=nperm, anova(lm(sample(Y) ~ F, dat1))$F[1]) #
calculates nperm F values
(prob - (sum(anova1$F[1] = (Fperm + .Machine$double.eps ^ 0.5)) +
1)/(nperm +1))  # risk calculation

yields

0.10009

I picked .Machine$double.eps ^ 0.5 somewhat arbitrarily as the default for
equality testing from ?all.equal

Cheers,

Josh



On Tue, Sep 23, 2014 at 5:14 PM, Stéphane Adamowicz 
stephane.adamow...@avignon.inra.fr wrote:

 Recently, I came across a strange and potentially troublesome behaviour of
 the lm and aov functions that ask questions about calculation accuracy. Let
 us consider the 2 following datasets dat1  dat2 :

  (dat1 - data.frame(Y=c(1:3, 10+1:3), F=c(rep(A,3), rep(B,3
Y F
 1  1 A
 2  2 A
 3  3 A
 4 11 B
 5 12 B
 6 13 B
  (dat2 - data.frame(Y=c(10+1:3, 1:3), F=c(rep(A,3), rep(B,3
Y F
 1 11 A
 2 12 A
 3 13 A
 4  1 B
 5  2 B
 6  3 B

 They only differ in the order of values that were exchanged between
 samples A and B. Thus the sd is 1 for each sample in either data sets, and
 the absolute mean difference |A-B| is 10 in both datasets.
 Now, let us perform an anova to compare samples A and B in both datasets
 (of course, in such simple case, a bilateral T test would do the job, but
 an anova is nevertheless allowed and should give the same probability than
 Student's test):

  (anova1 - anova(lm(Y~F, dat1)))
 Analysis of Variance Table

 Response: Y
   Df Sum Sq Mean Sq F valuePr(F)
 F  1150 150 150 0.0002552 ***
 Residuals  4  4   1

  (anova2 - anova(lm(Y~F, dat2)))
 Analysis of Variance Table

 Response: Y
   Df Sum Sq Mean Sq F valuePr(F)
 F  1150 150 150 0.0002552 ***
 Residuals  4  4   1

 As expected, both datasets give a same anova table, but this is only
 apparent. Indeed :

  anova1$F[1] == anova2$F[1]
 [1] FALSE
  anova1$F[1] - anova2$F[1]
 [1] 5.684342e-14

 In fact the F values differ slightly, and this holds also for the aov
 function. I checked also (not shown) that both the residual and factorial
 sums of squares differ between dat1 and dat2. Thus, for some undocumented
 reason (at least for the end user), the F values depend on the order of
 data!
 While such tiny differences (e-14 in this example) are devoid of
 consequences on the risk evaluation by Fisher's distribution, they may have
 huge consequences on the risk evaluation by the permutation method. Indeed,
 the shift from continuous to discrete distributions is far from being
 insignificant.

 For instance, the following code in R is at the basis of many permutation
 algorithms found in the internet and in teaching because it seems quite
 straightforward (see for example
 http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Permutation%20Anova/PermTestsAnova.html
 http://www.usna.edu/Users/math/jager/courses/sm439/labs/Lab7.pdf

 http://biostatistics1014.blogspot.fr/2013/04/one-way-anova-permutation-test-and.html
 http://adn.biol.umontreal.ca/~numericalecology/Rcode/):

  nperm - 1000 # number of permutations
  Fperm - replicate(n=nperm, anova(lm(sample(Y) ~ F, dat1))$F[1]) #
 calculates nperm F values
  (prob - (sum(anova1$F[1] = Fperm) + 1)/(nperm +1))  # risk calculation
 [1] 0.04695305

 Of course, because of the sample function, repeating this code gives
 different prob values, but they remain always close to 5% instead of the
 exact probability of 10%. Indeed, there are only choose(6,3) = 20 possible
 permutations, but because they are symmetric, they give only 10 absolute
 mean differences. Thus, the only exact probabilities are 10%, 20% ... 100%.
 In our example where samples do not overlap, 10% is obviously the right
 answer.

 Thus, the use of lm and aov functions in permutation methods does not seem
 a good idea as it results in biases that underestimate the exact risk. In
 the simple case of one-way anova, it is quite simple to remedy this
 problem. As the total Sums of squares and the degrees of freedom do not
 change with permutations, it is easier and much faster to compare the
 residual sums of squares. For instance, the exact probabilities can be
 calculated that way :

  combi - combn(6, 3, FUN=function(i) {append(dat1$Y[i], dat1$Y[-i])}) #
 all permutations
  SCEResid - apply(combi, 2, FUN=function(x) sum(tapply(x, dat1$F,
 function(x) sum((x - mean(x))^2 # all resi SS
  (prob - mean(SCEResid = SCEResid[1])) # risk calculation
 [1] 0.1

 10% is indeed the exact risk.

 Finally, my problem is : How can we know if R libraries that use
 randomization procedures are not biased ? In the basic case of one way
 anova, it seems easy to submit the above example (by the way, the defunct
 lmPerm library does not succeed ...), but how can we check more complex
 anova models 

Re: [R] ANOVA and permutation tests : beware of traps

2014-09-23 Thread Prof Brian Ripley
Beware of the trap of listening to people with no knowledge of basic 
numerical methods!


It really is basic that the results of floating-point computer 
calculations depends on the order in which they are done (and the 
compiler can change the order).  Using == on such calculations is warned 
about on its help page.


The same warning applies to using = if equality is important.

It bears repeating that the problems are exacerbated on platforms which 
use extended-precision registers for some but not all of their 
calculations (and most R platforms do).  The classic reference is pp. 
248ff of http://www.validlab.com/goldberg/paper.pdf (part of a Sun manual).



On 23/09/2014 08:14, Stéphane Adamowicz wrote:

Recently, I came across a strange and potentially troublesome behaviour of the lm 
and aov functions that ask questions about calculation accuracy. Let us consider 
the 2 following datasets dat1  dat2 :


(dat1 - data.frame(Y=c(1:3, 10+1:3), F=c(rep(A,3), rep(B,3

Y F
1  1 A
2  2 A
3  3 A
4 11 B
5 12 B
6 13 B

(dat2 - data.frame(Y=c(10+1:3, 1:3), F=c(rep(A,3), rep(B,3

Y F
1 11 A
2 12 A
3 13 A
4  1 B
5  2 B
6  3 B

They only differ in the order of values that were exchanged between samples A 
and B. Thus the sd is 1 for each sample in either data sets, and the absolute 
mean difference |A-B| is 10 in both datasets.
Now, let us perform an anova to compare samples A and B in both datasets (of 
course, in such simple case, a bilateral T test would do the job, but an anova 
is nevertheless allowed and should give the same probability than Student's 
test):


(anova1 - anova(lm(Y~F, dat1)))

Analysis of Variance Table

Response: Y
   Df Sum Sq Mean Sq F valuePr(F)
F  1150 150 150 0.0002552 ***
Residuals  4  4   1


(anova2 - anova(lm(Y~F, dat2)))

Analysis of Variance Table

Response: Y
   Df Sum Sq Mean Sq F valuePr(F)
F  1150 150 150 0.0002552 ***
Residuals  4  4   1

As expected, both datasets give a same anova table, but this is only apparent. 
Indeed :


anova1$F[1] == anova2$F[1]

[1] FALSE

anova1$F[1] - anova2$F[1]

[1] 5.684342e-14

In fact the F values differ slightly, and this holds also for the aov function. 
I checked also (not shown) that both the residual and factorial sums of squares 
differ between dat1 and dat2. Thus, for some undocumented reason (at least for 
the end user), the F values depend on the order of data!
While such tiny differences (e-14 in this example) are devoid of consequences 
on the risk evaluation by Fisher's distribution, they may have huge 
consequences on the risk evaluation by the permutation method. Indeed, the 
shift from continuous to discrete distributions is far from being insignificant.

For instance, the following code in R is at the basis of many permutation 
algorithms found in the internet and in teaching because it seems quite 
straightforward (see for example 
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Permutation%20Anova/PermTestsAnova.html
http://www.usna.edu/Users/math/jager/courses/sm439/labs/Lab7.pdf
http://biostatistics1014.blogspot.fr/2013/04/one-way-anova-permutation-test-and.html
http://adn.biol.umontreal.ca/~numericalecology/Rcode/):


nperm - 1000# number of permutations
Fperm - replicate(n=nperm, anova(lm(sample(Y) ~ F, dat1))$F[1]) # calculates 
nperm F values
(prob - (sum(anova1$F[1] = Fperm) + 1)/(nperm +1))  # risk calculation
  

[1] 0.04695305

Of course, because of the sample function, repeating this code gives different 
prob values, but they remain always close to 5% instead of the exact 
probability of 10%. Indeed, there are only choose(6,3) = 20 possible 
permutations, but because they are symmetric, they give only 10 absolute mean 
differences. Thus, the only exact probabilities are 10%, 20% ... 100%. In our 
example where samples do not overlap, 10% is obviously the right answer.

Thus, the use of lm and aov functions in permutation methods does not seem a 
good idea as it results in biases that underestimate the exact risk. In the 
simple case of one-way anova, it is quite simple to remedy this problem. As the 
total Sums of squares and the degrees of freedom do not change with 
permutations, it is easier and much faster to compare the residual sums of 
squares. For instance, the exact probabilities can be calculated that way :


combi - combn(6, 3, FUN=function(i) {append(dat1$Y[i], dat1$Y[-i])}) # all 
permutations
SCEResid - apply(combi, 2, FUN=function(x) sum(tapply(x, dat1$F, function(x) 
sum((x - mean(x))^2 # all resi SS
(prob - mean(SCEResid = SCEResid[1])) # risk calculation

[1] 0.1

10% is indeed the exact risk.

Finally, my problem is : How can we know if R libraries that use randomization 
procedures are not biased ? In the basic case of one way anova, it seems easy 
to submit the above example (by the way, the defunct lmPerm library does not 
succeed ...), but how can we check more complex