Dear R-ussers,

While looking for a way to calculate the association between a countinuous 
and a binary variable, 
i found a procedure called point biserial corralation.
Me, not being a mathematicion, i did my very best to understand what it 
was all about, and then i found a easily understandable paper (by steve 
simon) on ow to calculate this. ref ## 
http://www.childrens-mercy.org/stats/definitions/biserial.htm (this page 
has the same example)
Further i discovered the polycor package in R.
Now i'm having troubles with the fact that the polycor pkg never gives me 
the same output as the manuals aplication of the formula.

In the example below found, manualy  r(biserial) = 0.49 between fb an age, 
and ussing function polyserial (polycor pkg)  r(biserial) =-0.8591.
This is a rather big difference, no due to abriviation or flootingpoints.

Is there someone whom is familiar with biserial correlation, and the 
appropriate way to calculate it?

Kind regards,
Tom.

here is the example, at the end is the R file. 

1e I create the input

> library(abind) 
> library (polycor)
> ### data input
> no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8)
> fb <- c(19, 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, 14, 14, 22, 
17) 
> ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, 12, 14, 12, 
18)
> age <-c('elderly', 'elderly', 'elderly', 'elderly', 'elderly', 
'elderly', 'elderly', 'elderly', 'elderly', 'young', 'young', 'young', 
'young', 'young', 'young', 'young', 'young') 
> 
> dataset <- data.frame(no,fb,ss,age)
> dataset <- subset(dataset,select=c(fb:age))
> nrow(dataset) 
[1] 17
> data_eld <- subset(dataset,age=='elderly',select=fb)
> data_young <- subset(dataset,age=='young',select=fb)
> 

here i calculate the R_bis (biserial corelation) manualy

> ### point biserial correlation
> 
>  fb <- subset(dataset,select=fb)
>  fb0 <- subset(dataset,age!='elderly',select=fb)
>  fb1 <- subset(dataset,age=='elderly',select=fb)
>  meanfb0 <- mean(fb0,na.rm=T)
>  meanfb1 <- mean(fb1,na.rm=T)
>  sdfb<- sd(dataset$fb,na.rm=T)
> 
>  ss <- subset(dataset, select=ss)
>  ss0 <- subset(dataset,age!='elderly',select=ss)
>  ss1 <- subset(dataset,age=='elderly',select=ss)
>  meanss0 <- mean(ss0,na.rm=T)
>  meanss1 <- mean(ss1,na.rm=T)
>  sdss<- sd(dataset$ss,na.rm=T) 
>  age <- subset(dataset,select=age) 
>   n <- nrow(dataset)
> 
this is the formula from  ref ## 
http://www.childrens-mercy.org/stats/definitions/biserial.htm

> > R_bis <- function(x,x1,x0,n){  p <- (nrow(x1)/n)
>+        (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) 
*sqrt(p*(1-p)))  } 

this is the corrected formula from  ref ## 
http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient
>> R_bis2 <- function(x,x1,x0,n){ 
+        ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))  * (sqrt( 
(nrow(x1)*nrow(x0))/(n*(n-1))))}
> 
>>  R_bis(fb,fb1,fb0,n) 
>       fb 
>0.4798873 

result in paper was 0.49 
 
>>   R_bis2(fb,fb1,fb0,n)
>       fb 
>0.4946565 

equals result in paper  0.49 

Then i use the polycor package,
function hetcor will give all the different correlation ressults

 
>> hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE)
>
>Maximum-Likelihood Estimates
>
>Correlations/Type of Correlation:
 >           dataset$fb dataset.ss dataset.age
>dataset$fb           1    Pearson  Polyserial
>dataset.ss       0.703          1  Polyserial
>dataset.age    -0.8591    -0.6685           1
>
>Standard Errors:
>           dataset$fb dataset.ss
>dataset$fb 
>dataset.ss      0.1215 
>dataset.age     0.1106     0.2497
>
>n = 17 
>
>P-values for Tests of Bivariate Normality:
>            dataset$fb dataset.ss
>dataset$fb 
>dataset.ss      0.1782 
>dataset.age     0.4269     0.4034
> hetcor(dataset,ML=TRUE)
>
>Maximum-Likelihood Estimates
>
>Correlations/Type of Correlation:
 >        fb      ss        age
>fb        1 Pearson Polyserial
>ss    0.703       1 Polyserial
>age -0.8591 -0.6685          1
>
>Standard Errors:
 >       fb     ss
>fb 
>ss  0.1215 
>age 0.1106 0.2497
>
>n = 17 
>
>P-values for Tests of Bivariate Normality:
>       fb     ss
>fb 
>ss  0.1782 
>age 0.4269 0.4034

here a quick two step method is ussed to calculate the polyserial 
correlation

> >polyserial(dataset$fb,dataset$age)
>[1] -0.6205737
>> polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE) 

same method  as in hetcor, only for indecated variables

>Polyserial Correlation, ML est. = -0.8591 (0.1106)
>Test of bivariate normality: Chisquare = 4.91, df = 5, p = 0.4269
>
>               1
>Threshold 0.1811
>Std.Err.  0.1849
> 


> ### for side to side (ss)   incase no 9 is an outlier in fb, this will 
not be the case in ss 
> 
>> R_bis(ss,ss1,ss0,n)
>       ss 
>0.4153681 

result in paper was 0.43 

>> R_bis2(ss,ss1,ss0,n)
>       ss 
>0.4281516 

equals result in paper  0.43

>> polyserial(dataset$ss,dataset$age) 
>[1] -0.5371397

>> polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE)
>
>Polyserial Correlation, ML est. = -0.6685 (0.2497)
>Test of bivariate normality: Chisquare = 5.103, df = 5, p = 0.4034
>
>               1
>Threshold 0.1504
>Std.Err.  0.2583
#################################################################################

### R-file
#################################################################################

library(abind) 
library (polycor)
### data input
no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8)
fb <- c(19, 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, 14, 14, 22, 
17) 
ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, 12, 14, 12, 
18)
age <-c('elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'elderly', 
'elderly', 'elderly', 'elderly', 'young', 'young', 'young', 'young', 
'young', 'young', 'young', 'young') 

dataset <- data.frame(no,fb,ss,age)
dataset <- subset(dataset,select=c(fb:age))
nrow(dataset) 
data_eld <- subset(dataset,age=='elderly',select=fb)
data_young <- subset(dataset,age=='young',select=fb)

### point biserial correlation

 fb <- subset(dataset,select=fb)
 fb0 <- subset(dataset,age!='elderly',select=fb)
 fb1 <- subset(dataset,age=='elderly',select=fb)
 meanfb0 <- mean(fb0,na.rm=T)
 meanfb1 <- mean(fb1,na.rm=T)
 sdfb<- sd(dataset$fb,na.rm=T)

 ss <- subset(dataset, select=ss)
 ss0 <- subset(dataset,age!='elderly',select=ss)
 ss1 <- subset(dataset,age=='elderly',select=ss)
 meanss0 <- mean(ss0,na.rm=T)
 meanss1 <- mean(ss1,na.rm=T)
 sdss<- sd(dataset$ss,na.rm=T) 
 age <- subset(dataset,select=age) 
  n <- nrow(dataset)
 
 R_bis <- function(x,x1,x0,n){  p <- (nrow(x1)/n)
       (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) 
*sqrt(p*(1-p)))  }
R_bis2 <- function(x,x1,x0,n){ 
       ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))  * (sqrt( 
(nrow(x1)*nrow(x0))/(n*(n-1))))}
 
 R_bis(fb,fb1,fb0,n)
  R_bis2(fb,fb1,fb0,n)

 
hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE)
hetcor(dataset,ML=TRUE)

polyserial(dataset$fb,dataset$age)
polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE) 

### for side to side (ss)   incase no 9 is an outlier in fb, this will not 
be the case in ss 

R_bis(ss,ss1,ss0,n)
R_bis2(ss,ss1,ss0,n)

polyserial(dataset$ss,dataset$age) 
polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE)

## END 
##########################################################################

Willems Tom

E-mail: [EMAIL PROTECTED]




Disclaimer: click here
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to