[R] logistic regression with response 0,1

2010-12-29 Thread Federico Bonofiglio
Dear Masters,
first I'd like to wish u all a great 2011 and happy holydays by now,

second (here it come the boring stuff) I have a question to which I hope u
would answer:

I run a logistic regression by glm(), on the following data type
(y1=1,x1=x1); (y2=0,x2=x2);..(yn=0,xn=xn), where the response (y) is
abinary outcome on 0,1 amd x is any explanatory variable (continuous or not)
observed with the i-th outcome.

This is indeed one of the most frequent case when challenged with binary
responses, though I know R manages such responses slightly differently (a
vector for the successes counts and one for the failures) and I'm not sure
wheather my summary.glm gives me any senseful answer at all

for the purpose I have tried to to emphasize the numbers so to obtain
significant results

y-rbinom(2000,1,.7)#response

for(i in 1:2000){
euro[i]-if(y[i]==1){rnorm(1,300,20)#explanatory 1
}else{rnorm(1,50,12)}
}

for(i in 1:2000){
sex[i]-if(y[i]==1){sample(c(m,f),1,prob=c(.8,.2))#explanatory 2
}else{sample(c(m,f),1,prob=c(.2,.8))}
}



m-glm(y~euro+factor(sex),family=binomial)

summary(m)




My worries:

   - are the estimates correct?
   -  degrees of freedom exponentiate dramatically (one per cell) , so may I
   risk to never obtain a significant result?

I also take the chance to ask wheater u know any implemented method to plot
logistic curves directly out of a glm() model


I would like to thank u all by the way

Federico Bonofiglio

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression with response 0,1

2010-12-29 Thread Dennis Murphy
Hi:

I think you created a problem for yourself in the way you generated your
data.

y-rbinom(2000,1,.7)
euro - rnorm(2000, m = 300 * y + 50 * (1 - y), s = 20 * y + 12 * (1 - y))
# Create a 2000 x 2 matrix of probabilities
prmat - cbind(0.8 * y + 0.2 * (1 - y), 0.2 * y + 0.8 * (1 - y))
# sample sex from each row of prmat with the rows comprising the
distribution
sex - apply(prmat, 1, function(x) sample(c('m', 'f'), 1, prob = x))

df - data.frame(euro, sex, y)

# Histogram of euro: notice the separation in distributions
hist(euro, nclass = 50)
# Generate an indicator between the two clusters of euro
spl - euro  150
# Now show a table of that split vs. response
table(spl, y)

This is what I get for my simulation:

table(spl, y)
   y
spl01
  FALSE  5720
  TRUE 0 1428

which in turn leads to

m - glm(y ~ euro + sex, data = df, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred

This is what is known as 'complete data separation' in the logistic
regression literature. Basically, you've generated data so that all the
successes are associated with a N(300, 20) distribution and all the failures
with a N(50, 12) distribution. If this is a classification problem,
congratulations - the margin on the support vector will be huge :)  OTOH, if
you're trying to fit a logistic model for purposes of explanation, you've
created a problem, especially with respect to prediction.

...and it does matter whether this is a regression problem or a
classification problem. In the latter, separation is a good thing; in the
former, it creates convergence problems.

Since you have a continuous predictor in the model, there is an additional
complication: the logistic regression null deviance does not have an
asymptotic chi-square distribution, so tests involving reductions of
deviance from the null model are not guaranteed to have asymptotic
chi-square distributions *when the predictor x is truly continuous*.

More below.

On Wed, Dec 29, 2010 at 9:48 AM, Federico Bonofiglio bonori...@gmail.comwrote:

 Dear Masters,
 first I'd like to wish u all a great 2011 and happy holydays by now,

 second (here it come the boring stuff) I have a question to which I hope u
 would answer:

 I run a logistic regression by glm(), on the following data type
 (y1=1,x1=x1); (y2=0,x2=x2);..(yn=0,xn=xn), where the response (y) is
 abinary outcome on 0,1 amd x is any explanatory variable (continuous or
 not)
 observed with the i-th outcome.

 This is indeed one of the most frequent case when challenged with binary
 responses, though I know R manages such responses slightly differently (a
 vector for the successes counts and one for the failures) and I'm not sure
 wheather my summary.glm gives me any senseful answer at all

 for the purpose I have tried to to emphasize the numbers so to obtain
 significant results

 y-rbinom(2000,1,.7)#response

 for(i in 1:2000){
 euro[i]-if(y[i]==1){rnorm(1,300,20)#explanatory 1
 }else{rnorm(1,50,12)}
 }

 for(i in 1:2000){
 sex[i]-if(y[i]==1){sample(c(m,f),1,prob=c(.8,.2))#explanatory 2
 }else{sample(c(m,f),1,prob=c(.2,.8))}
 }



 m-glm(y~euro+factor(sex),family=binomial)

 summary(m)




 My worries:

   - are the estimates correct?


The people who wrote the glm() routine were smart enough to anticipate the
data separation case and are warning you of potential instability in the
model estimates/predictions as a result of producing predictions of exactly
0 or 1. This is a warning to take seriously -  your generated data produced
these based on x alone.

  -  degrees of freedom exponentiate dramatically (one per cell) , so may I
   risk to never obtain a significant result?


When using grouped or ungrouped data, comparisons between nested models will
be the same whether the data are grouped or ungrouped in non-pathological
situations.


 I also take the chance to ask wheater u know any implemented method to plot
 logistic curves directly out of a glm() model



The following is an example to illustrate some of the questions you raised.


# Example to illustrate the difference between grouped and ungrouped
# logistic regression analyses

library(reshape2)
library(lattice)

# Sample 50 distinct x values 300 times
x - sample(1:50, 300, replace = TRUE)
# P(Y = 1 | X)  increases with x
y - rbinom(300, 1, (10 + x)/80)
ind - x  25
# males sampled more heavily when x  25
p - cbind(0.7 * ind + 0.3 * (1 - ind), 0.3 * ind + 0.7 * (1 - ind))
sex - apply(p, 1, function(x) sample(c('m', 'f'), 1, prob = x))
df - data.frame(sex, x, y)

# Ungrouped logistic regression
# treat x as a continuous covariate
m1 - glm(y ~ sex + x, data = df, family = binomial)

# Group the data by x * sex combinations
u - as.data.frame(xtabs(~ y + sex + x, data = df))
# cast() reshapes the data so that the 0/1 frequencies become separate
columns
# cast() comes from package reshape(2)
# In this case, sex and x are ID variables, and y is reshaped