Dear R-ussers, While looking for a way to calculate the association between a countinuous and a binary variable, i found a procedure called point biserial corralation. Me, not being a mathematicion, i did my very best to understand what it was all about, and then i found a easily understandable paper (by steve simon) on ow to calculate this. ref ## http://www.childrens-mercy.org/stats/definitions/biserial.htm (this page has the same example) Further i discovered the polycor package in R. Now i'm having troubles with the fact that the polycor pkg never gives me the same output as the manuals aplication of the formula.
In the example below found, manualy r(biserial) = 0.49 between fb an age, and ussing function polyserial (polycor pkg) r(biserial) =-0.8591. This is a rather big difference, no due to abriviation or flootingpoints. Is there someone whom is familiar with biserial correlation, and the appropriate way to calculate it? Kind regards, Tom. here is the example, at the end is the R file. 1e I create the input > library(abind) > library (polycor) > ### data input > no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8) > fb <- c(19, 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, 14, 14, 22, 17) > ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, 12, 14, 12, 18) > age <-c('elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'young', 'young', 'young', 'young', 'young', 'young', 'young', 'young') > > dataset <- data.frame(no,fb,ss,age) > dataset <- subset(dataset,select=c(fb:age)) > nrow(dataset) [1] 17 > data_eld <- subset(dataset,age=='elderly',select=fb) > data_young <- subset(dataset,age=='young',select=fb) > here i calculate the R_bis (biserial corelation) manualy > ### point biserial correlation > > fb <- subset(dataset,select=fb) > fb0 <- subset(dataset,age!='elderly',select=fb) > fb1 <- subset(dataset,age=='elderly',select=fb) > meanfb0 <- mean(fb0,na.rm=T) > meanfb1 <- mean(fb1,na.rm=T) > sdfb<- sd(dataset$fb,na.rm=T) > > ss <- subset(dataset, select=ss) > ss0 <- subset(dataset,age!='elderly',select=ss) > ss1 <- subset(dataset,age=='elderly',select=ss) > meanss0 <- mean(ss0,na.rm=T) > meanss1 <- mean(ss1,na.rm=T) > sdss<- sd(dataset$ss,na.rm=T) > age <- subset(dataset,select=age) > n <- nrow(dataset) > this is the formula from ref ## http://www.childrens-mercy.org/stats/definitions/biserial.htm > > R_bis <- function(x,x1,x0,n){ p <- (nrow(x1)/n) >+ (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) *sqrt(p*(1-p))) } this is the corrected formula from ref ## http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient >> R_bis2 <- function(x,x1,x0,n){ + ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) * (sqrt( (nrow(x1)*nrow(x0))/(n*(n-1))))} > >> R_bis(fb,fb1,fb0,n) > fb >0.4798873 result in paper was 0.49 >> R_bis2(fb,fb1,fb0,n) > fb >0.4946565 equals result in paper 0.49 Then i use the polycor package, function hetcor will give all the different correlation ressults >> hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE) > >Maximum-Likelihood Estimates > >Correlations/Type of Correlation: > dataset$fb dataset.ss dataset.age >dataset$fb 1 Pearson Polyserial >dataset.ss 0.703 1 Polyserial >dataset.age -0.8591 -0.6685 1 > >Standard Errors: > dataset$fb dataset.ss >dataset$fb >dataset.ss 0.1215 >dataset.age 0.1106 0.2497 > >n = 17 > >P-values for Tests of Bivariate Normality: > dataset$fb dataset.ss >dataset$fb >dataset.ss 0.1782 >dataset.age 0.4269 0.4034 > hetcor(dataset,ML=TRUE) > >Maximum-Likelihood Estimates > >Correlations/Type of Correlation: > fb ss age >fb 1 Pearson Polyserial >ss 0.703 1 Polyserial >age -0.8591 -0.6685 1 > >Standard Errors: > fb ss >fb >ss 0.1215 >age 0.1106 0.2497 > >n = 17 > >P-values for Tests of Bivariate Normality: > fb ss >fb >ss 0.1782 >age 0.4269 0.4034 here a quick two step method is ussed to calculate the polyserial correlation > >polyserial(dataset$fb,dataset$age) >[1] -0.6205737 >> polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE) same method as in hetcor, only for indecated variables >Polyserial Correlation, ML est. = -0.8591 (0.1106) >Test of bivariate normality: Chisquare = 4.91, df = 5, p = 0.4269 > > 1 >Threshold 0.1811 >Std.Err. 0.1849 > > ### for side to side (ss) incase no 9 is an outlier in fb, this will not be the case in ss > >> R_bis(ss,ss1,ss0,n) > ss >0.4153681 result in paper was 0.43 >> R_bis2(ss,ss1,ss0,n) > ss >0.4281516 equals result in paper 0.43 >> polyserial(dataset$ss,dataset$age) >[1] -0.5371397 >> polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE) > >Polyserial Correlation, ML est. = -0.6685 (0.2497) >Test of bivariate normality: Chisquare = 5.103, df = 5, p = 0.4034 > > 1 >Threshold 0.1504 >Std.Err. 0.2583 ################################################################################# ### R-file ################################################################################# library(abind) library (polycor) ### data input no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8) fb <- c(19, 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, 14, 14, 22, 17) ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, 12, 14, 12, 18) age <-c('elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'elderly', 'young', 'young', 'young', 'young', 'young', 'young', 'young', 'young') dataset <- data.frame(no,fb,ss,age) dataset <- subset(dataset,select=c(fb:age)) nrow(dataset) data_eld <- subset(dataset,age=='elderly',select=fb) data_young <- subset(dataset,age=='young',select=fb) ### point biserial correlation fb <- subset(dataset,select=fb) fb0 <- subset(dataset,age!='elderly',select=fb) fb1 <- subset(dataset,age=='elderly',select=fb) meanfb0 <- mean(fb0,na.rm=T) meanfb1 <- mean(fb1,na.rm=T) sdfb<- sd(dataset$fb,na.rm=T) ss <- subset(dataset, select=ss) ss0 <- subset(dataset,age!='elderly',select=ss) ss1 <- subset(dataset,age=='elderly',select=ss) meanss0 <- mean(ss0,na.rm=T) meanss1 <- mean(ss1,na.rm=T) sdss<- sd(dataset$ss,na.rm=T) age <- subset(dataset,select=age) n <- nrow(dataset) R_bis <- function(x,x1,x0,n){ p <- (nrow(x1)/n) (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) *sqrt(p*(1-p))) } R_bis2 <- function(x,x1,x0,n){ ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T)) * (sqrt( (nrow(x1)*nrow(x0))/(n*(n-1))))} R_bis(fb,fb1,fb0,n) R_bis2(fb,fb1,fb0,n) hetcor(dataset$fb,dataset$ss,dataset$age ,ML=TRUE) hetcor(dataset,ML=TRUE) polyserial(dataset$fb,dataset$age) polyserial(dataset$fb,dataset$age, ML=TRUE, std.err=TRUE) ### for side to side (ss) incase no 9 is an outlier in fb, this will not be the case in ss R_bis(ss,ss1,ss0,n) R_bis2(ss,ss1,ss0,n) polyserial(dataset$ss,dataset$age) polyserial(dataset$ss,dataset$age, ML=TRUE, std.err=TRUE) ## END ########################################################################## Willems Tom E-mail: [EMAIL PROTECTED] Disclaimer: click here [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.