Dear List, Apologies for this off-topic post but it is R-related in the sense that I am trying to understand what R is telling me with the data to hand.
ROC curves have recently been used to determine a dissimilarity threshold for identifying whether two samples are from the same "type" or not. Given the bashing that ROC curves get whenever anyone asks about them on this list (and having implemented the ROC methodology in my analogue package) I wanted to try directly modelling the probability that two sites are analogues for one another for given dissimilarity using glm(). The data I have then are a logical vector ('analogs') indicating whether the two sites come from the same vegetation and a vector of the dissimilarity between the two sites ('Dij'). These are in a csv file currently in my university web space. Each 'row' in this file corresponds to single comparison between 2 sites. When I analyse these data using glm() I get the familiar "fitted probabilities numerically 0 or 1 occurred" warning. The data do not look linearly separable when plotted (code for which is below). I have read Venables and Ripley's discussion of this in MASS4 and other sources that discuss this warning and R (Faraway's Extending the Linear Model with R and John Fox's new Applied Regression, Generalized Linear Models, and Related Methods, 2nd Ed) as well as some of the literature on Firth's bias reduction method. But I am still somewhat unsure what (quasi-)separation is and if this is the reason for the warnings in this case. My question then is, is this a separation issue with my data, or is it quasi-separation that I have read a bit about whilst researching this problem? Or is this something completely different? Code to reproduce my problem with the actual data is given below. I'd appreciate any comments or thoughts on this. #### Begin code snippet ################################################ ## note data file is ~93Kb in size dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/dat.csv")) head(dat) ## fit model --- produces warning mod <- glm(analogs ~ Dij, data = dat, family = binomial) ## plot the data plot(analogs ~ Dij, data = dat) fit.mod <- fitted(mod) ord <- with(dat, order(Dij)) with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2)) #### End code snippet ################################################## Thanks in advance Gavin -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.