Hi,

I have just encountered a strange behaviour from 'cor' with regards to the treatment of NAs when calculating Spearman correlations. I guess it is a subtle bug.

If I understand the help page correctly, the two modes 'complete.obs' and 'pairwise.complete.obs' specify how to deal with correlation coefficients when calculating a correlation _matrix_. When calculating a single (scalar) correlation coefficient for two data vectors x and y, both should give the same result.

For Pearson correlation, this is in fact the case:

x <- runif( 10 )
y <- runif( 10 )
y[5] <- NA

cor( x, y, use="complete.obs" )
[1] 0.407858
cor( x, y, use="pairwise.complete.obs" )
[1] 0.407858

For Spearman correlation, we do NOT get the same results

cor( x, y, method="spearman", use="complete.obs" )
[1] 0.3416009
cor( x, y, method="spearman", use="pairwise.complete.obs" )
[1] 0.3333333

To see the likely reason for this possible bug, observe:

goodobs <- !is.na(x) & !is.na(y)

cor( rank(x)[goodobs], rank(y)[goodobs] )
[1] 0.3416009
cor( rank(x[goodobs]), rank(y[goodobs]) )
[1] 0.3333333

I would claim that only the calculation resulting in 0.3333 is a proper Spearman correlation, while the line resulting in 0.3416 is not. After all, the following is not a complete set of ranks because there are 9 observations, numbered from 1 to 10, skipping the 3:

rank(x)[goodobs]
[1] 10  6  8  7  4  5  1  9  2

Would you hence agree that 'method="spearman"' with 'use="pairwise.complete.obs"' is incorrect?

Cheers
  Simon


sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
 [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8
 [7] LC_PAPER=en_US.utf8       LC_NAME=C
 [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pspearman_0.2-5 SuppDists_1.1-8

loaded via a namespace (and not attached):
[1] tools_2.12.0




+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sand...@fs.tum.de

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to