Hi: I think you created a problem for yourself in the way you generated your data.
y<-rbinom(2000,1,.7) euro <- rnorm(2000, m = 300 * y + 50 * (1 - y), s = 20 * y + 12 * (1 - y)) # Create a 2000 x 2 matrix of probabilities prmat <- cbind(0.8 * y + 0.2 * (1 - y), 0.2 * y + 0.8 * (1 - y)) # sample sex from each row of prmat with the rows comprising the distribution sex <- apply(prmat, 1, function(x) sample(c('m', 'f'), 1, prob = x)) df <- data.frame(euro, sex, y) # Histogram of euro: notice the separation in distributions hist(euro, nclass = 50) # Generate an indicator between the two clusters of euro spl <- euro > 150 # Now show a table of that split vs. response table(spl, y) This is what I get for my simulation: table(spl, y) y spl 0 1 FALSE 572 0 TRUE 0 1428 which in turn leads to m <- glm(y ~ euro + sex, data = df, family = binomial) Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred This is what is known as 'complete data separation' in the logistic regression literature. Basically, you've generated data so that all the successes are associated with a N(300, 20) distribution and all the failures with a N(50, 12) distribution. If this is a classification problem, congratulations - the margin on the support vector will be huge :) OTOH, if you're trying to fit a logistic model for purposes of explanation, you've created a problem, especially with respect to prediction. ...and it does matter whether this is a regression problem or a classification problem. In the latter, separation is a good thing; in the former, it creates convergence problems. Since you have a continuous predictor in the model, there is an additional complication: the logistic regression null deviance does not have an asymptotic chi-square distribution, so tests involving reductions of deviance from the null model are not guaranteed to have asymptotic chi-square distributions *when the predictor x is truly continuous*. More below. On Wed, Dec 29, 2010 at 9:48 AM, Federico Bonofiglio <bonori...@gmail.com>wrote: > Dear Masters, > first I'd like to wish u all a great 2011 and happy holydays by now, > > second (here it come the boring stuff) I have a question to which I hope u > would answer: > > I run a logistic regression by glm(), on the following data type > (y1=1,x1=x1); (y2=0,x2=x2);......(yn=0,xn=xn), where the response (y) is > abinary outcome on 0,1 amd x is any explanatory variable (continuous or > not) > observed with the i-th outcome. > > This is indeed one of the most frequent case when challenged with binary > responses, though I know R manages such responses slightly differently (a > vector for the successes counts and one for the failures) and I'm not sure > wheather my summary.glm gives me any senseful answer at all.... > > for the purpose I have tried to to emphasize the numbers so to obtain > significant results > > y<-rbinom(2000,1,.7)#response > > for(i in 1:2000){ > euro[i]<-if(y[i]==1){rnorm(1,300,20)#explanatory 1 > }else{rnorm(1,50,12)} > } > > for(i in 1:2000){ > sex[i]<-if(y[i]==1){sample(c("m","f"),1,prob=c(.8,.2))#explanatory 2 > }else{sample(c("m","f"),1,prob=c(.2,.8))} > } > > > > m<-glm(y~euro+factor(sex),family=binomial) > > summary(m) > > > > > My worries: > > - are the estimates correct? > The people who wrote the glm() routine were smart enough to anticipate the data separation case and are warning you of potential instability in the model estimates/predictions as a result of producing predictions of exactly 0 or 1. This is a warning to take seriously - your generated data produced these based on x alone. - degrees of freedom exponentiate dramatically (one per cell) , so may I > risk to never obtain a significant result? > When using grouped or ungrouped data, comparisons between nested models will be the same whether the data are grouped or ungrouped in non-pathological situations. > > I also take the chance to ask wheater u know any implemented method to plot > logistic curves directly out of a glm() model > The following is an example to illustrate some of the questions you raised. # Example to illustrate the difference between grouped and ungrouped # logistic regression analyses library(reshape2) library(lattice) # Sample 50 distinct x values 300 times x <- sample(1:50, 300, replace = TRUE) # P(Y = 1 | X) increases with x y <- rbinom(300, 1, (10 + x)/80) ind <- x > 25 # males sampled more heavily when x > 25 p <- cbind(0.7 * ind + 0.3 * (1 - ind), 0.3 * ind + 0.7 * (1 - ind)) sex <- apply(p, 1, function(x) sample(c('m', 'f'), 1, prob = x)) df <- data.frame(sex, x, y) # Ungrouped logistic regression # treat x as a continuous covariate m1 <- glm(y ~ sex + x, data = df, family = binomial) # Group the data by x * sex combinations u <- as.data.frame(xtabs(~ y + sex + x, data = df)) # cast() reshapes the data so that the 0/1 frequencies become separate columns # cast() comes from package reshape(2) # In this case, sex and x are ID variables, and y is reshaped from long to wide form v <- cast(u, sex + x ~ y, value = 'Freq') # rename new columns names(v)[3:4] <- c('f', 's') # x converted to factor in the process; undo and return to numeric v$x <- as.numeric(as.character(v$x)) # Grouped logistic regression # still treat x as a continuous covariate - same model as above m2 <- glm(cbind(s, f) ~ sex + x, data = v, family = binomial) summary(m1) # ungrouped data summary(m2) # grouped data # If all sex * x combinations are sampled, there should be 97 df for residual deviance; # anything less than that means some combinations had zero frequency # Prediction and plot of the estimated response probabilities: pframe <- data.frame(sex = rep(c('m', 'f'), each = 50), x = rep(1:50, 2)) pframe$pp <- predict(m2, newdata = pframe, type = 'response') xyplot(pp ~ x, data = pframe, groups = sex, type = 'l') # from package lattice Note that the coefficients and tests are the same in the two models, but the deviances (null and residual) are different with different degrees of freedom. Because of this, the AICs will also be different. Try this again with a sample of 3000; the predicted probabilities in the two groups will be closer since sex is not a significant factor, but the total and residual df will be close to the same as in the n = 300 case for the *grouped data*, but not the ungrouped data. Also observe that in the n = 3000 case, the estimates and tests of the model effects are again the same between the grouped and ungrouped data. In other words, increasing the sample size doesn't, in itself, create inferential problems. Observe, however, that the same 50 x values are being sampled in both cases, so grouping has a positive effect in this case, particularly wrt residual analysis. There is redundant information among the inputs. Even though we are acting as though x is continuous, it is (at least conditionally) an ordered factor. By behaving as though it is continuous, we are in effect saying that the linear contrast dominates the overall effect of x on the response. [Note that in the grouped data case, there are 50 x 2 = 100 possible x * sex combinations. With three parameters to estimate, the maximum residual df for grouped data is 100 - 3 = 97, assuming all combinations have frequency >= 1.] In contrast, if you sample x from a normal distribution, say, then grouping the data by x provides no real reduction unless you artificially bin the x's into intervals, as in a histogram. In that case, however, the estimates and tests will not be exactly the same between the ungrouped and grouped data because, in this situation, binning coarsens the information in x. For yet another perspective, tweak my code above as follows: sample x from a normal distribution instead of from 1:50 and use the same code to group the data. How much difference is there between the ungrouped and grouped data? What would happen if we repeated the exercise with a sample of size 3000 from a normal distribution and then grouped it? Would the same sized table of grouped sex * x occur in both cases? Some related topics to search on in regard to the issues you raised include: (i) separability of data (as mentioned earlier); (ii) the Hauck-Donner effect and (iii) the Neyman-Scott problem (which comes into play when considering the asymptotic distribution of the null deviance when a continuous covariate is present in a logistic regression model). Since I'm still learning this subject myself, corrections/amplifications would be welcome. HTH, Dennis ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.