Re: [R] OT: (quasi-?) separation in a logistic GLM

David Winsemius Mon, 15 Dec 2008 18:11:51 -0800

If you look at the distribution of those with analogs==TRUE versus thewhole groups it is not surprising that the upper range of Dij's resultin a very low probability estimate:


 plot(density(dat$Dij))
lines(density(dat[dat$analogs == TRUE, 2]))

Appears as though more than 25% of the predict(mod, type="response")elements will satisfyall.equal(0, <predict>) which on my machine occurs when <predict> is(roughly) below 1e-9



Best;

David Winsemius

On Dec 15, 2008, at 1:03 PM, Gavin Simpson wrote:

Dear List,

Apologies for this off-topic post but it is R-related in the sensethat

I am trying to understand what R is telling me with the data to hand.

ROC curves have recently been used to determine a dissimilarity
threshold for identifying whether two samples are from the same "type"

or not. Given the bashing that ROC curves get whenever anyone asksabout

them on this list (and having implemented the ROC methodology in my
analogue package) I wanted to try directly modelling the probability
that two sites are analogues for one another for given dissimilarity
using glm().

The data I have then are a logical vector ('analogs') indicatingwhether

the two sites come from the same vegetation and a vector of the
dissimilarity between the two sites ('Dij'). These are in a csv file
currently in my university web space. Each 'row' in this file
corresponds to single comparison between 2 sites.

When I analyse these data using glm() I get the familiar "fitted

probabilities numerically 0 or 1 occurred" warning. The data do notlook

linearly separable when plotted (code for which is below). I have read

Venables and Ripley's discussion of this in MASS4 and other sourcesthatdiscuss this warning and R (Faraway's Extending the Linear Modelwith R

and John Fox's new Applied Regression, Generalized Linear Models, and
Related Methods, 2nd Ed) as well as some of the literature on Firth's
bias reduction method. But I am still somewhat unsure what

(quasi-)separation is and if this is the reason for the warnings inthis

case.

My question then is, is this a separation issue with my data, or is it
quasi-separation that I have read a bit about whilst researching this
problem? Or is this something completely different?

Code to reproduce my problem with the actual data is given below. I'd
appreciate any comments or thoughts on this.

#### Begin code snippet################################################


## note data file is ~93Kb in size

dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/dat.csv"))

head(dat)
## fit model --- produces warning
mod <- glm(analogs ~ Dij, data = dat, family = binomial)
## plot the data
plot(analogs ~ Dij, data = dat)
fit.mod <- fitted(mod)
ord <- with(dat, order(Dij))
with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))

#### End code snippet##################################################


Thanks in advance

Gavin
--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] OT: (quasi-?) separation in a logistic GLM

Reply via email to