Hi:
I think you created a problem for yourself in the way you generated your
data.
y-rbinom(2000,1,.7)
euro - rnorm(2000, m = 300 * y + 50 * (1 - y), s = 20 * y + 12 * (1 - y))
# Create a 2000 x 2 matrix of probabilities
prmat - cbind(0.8 * y + 0.2 * (1 - y), 0.2 * y + 0.8 * (1 - y))
# sample sex from each row of prmat with the rows comprising the
distribution
sex - apply(prmat, 1, function(x) sample(c('m', 'f'), 1, prob = x))
df - data.frame(euro, sex, y)
# Histogram of euro: notice the separation in distributions
hist(euro, nclass = 50)
# Generate an indicator between the two clusters of euro
spl - euro 150
# Now show a table of that split vs. response
table(spl, y)
This is what I get for my simulation:
table(spl, y)
y
spl01
FALSE 5720
TRUE 0 1428
which in turn leads to
m - glm(y ~ euro + sex, data = df, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
This is what is known as 'complete data separation' in the logistic
regression literature. Basically, you've generated data so that all the
successes are associated with a N(300, 20) distribution and all the failures
with a N(50, 12) distribution. If this is a classification problem,
congratulations - the margin on the support vector will be huge :) OTOH, if
you're trying to fit a logistic model for purposes of explanation, you've
created a problem, especially with respect to prediction.
...and it does matter whether this is a regression problem or a
classification problem. In the latter, separation is a good thing; in the
former, it creates convergence problems.
Since you have a continuous predictor in the model, there is an additional
complication: the logistic regression null deviance does not have an
asymptotic chi-square distribution, so tests involving reductions of
deviance from the null model are not guaranteed to have asymptotic
chi-square distributions *when the predictor x is truly continuous*.
More below.
On Wed, Dec 29, 2010 at 9:48 AM, Federico Bonofiglio bonori...@gmail.comwrote:
Dear Masters,
first I'd like to wish u all a great 2011 and happy holydays by now,
second (here it come the boring stuff) I have a question to which I hope u
would answer:
I run a logistic regression by glm(), on the following data type
(y1=1,x1=x1); (y2=0,x2=x2);..(yn=0,xn=xn), where the response (y) is
abinary outcome on 0,1 amd x is any explanatory variable (continuous or
not)
observed with the i-th outcome.
This is indeed one of the most frequent case when challenged with binary
responses, though I know R manages such responses slightly differently (a
vector for the successes counts and one for the failures) and I'm not sure
wheather my summary.glm gives me any senseful answer at all
for the purpose I have tried to to emphasize the numbers so to obtain
significant results
y-rbinom(2000,1,.7)#response
for(i in 1:2000){
euro[i]-if(y[i]==1){rnorm(1,300,20)#explanatory 1
}else{rnorm(1,50,12)}
}
for(i in 1:2000){
sex[i]-if(y[i]==1){sample(c(m,f),1,prob=c(.8,.2))#explanatory 2
}else{sample(c(m,f),1,prob=c(.2,.8))}
}
m-glm(y~euro+factor(sex),family=binomial)
summary(m)
My worries:
- are the estimates correct?
The people who wrote the glm() routine were smart enough to anticipate the
data separation case and are warning you of potential instability in the
model estimates/predictions as a result of producing predictions of exactly
0 or 1. This is a warning to take seriously - your generated data produced
these based on x alone.
- degrees of freedom exponentiate dramatically (one per cell) , so may I
risk to never obtain a significant result?
When using grouped or ungrouped data, comparisons between nested models will
be the same whether the data are grouped or ungrouped in non-pathological
situations.
I also take the chance to ask wheater u know any implemented method to plot
logistic curves directly out of a glm() model
The following is an example to illustrate some of the questions you raised.
# Example to illustrate the difference between grouped and ungrouped
# logistic regression analyses
library(reshape2)
library(lattice)
# Sample 50 distinct x values 300 times
x - sample(1:50, 300, replace = TRUE)
# P(Y = 1 | X) increases with x
y - rbinom(300, 1, (10 + x)/80)
ind - x 25
# males sampled more heavily when x 25
p - cbind(0.7 * ind + 0.3 * (1 - ind), 0.3 * ind + 0.7 * (1 - ind))
sex - apply(p, 1, function(x) sample(c('m', 'f'), 1, prob = x))
df - data.frame(sex, x, y)
# Ungrouped logistic regression
# treat x as a continuous covariate
m1 - glm(y ~ sex + x, data = df, family = binomial)
# Group the data by x * sex combinations
u - as.data.frame(xtabs(~ y + sex + x, data = df))
# cast() reshapes the data so that the 0/1 frequencies become separate
columns
# cast() comes from package reshape(2)
# In this case, sex and x are ID variables, and y is reshaped