Dear R-core,

I have found an edge-case where the glm function falsely concludes that the 
model has converged. The issue is the following: my data contains a number of 
covariates, one of these covariates has a very small variance. For most of the 
rows of this covariate, the value is 0, except for one of the rows, where it is 
1.


The glm function correctly determines the beta and standard error estimates for 
all other covariates.


I've placed the data here: http://www.harmjanwestra.nl/rtestdata.txt


The model I'm using is very simple:


data <- read.table("rtestdata.txt")

model <- glm(data[,1] ~ data[,2] + data[,3] + data[,4] + data[,5] + data[,6] + 
data[,7] + data[,8] + data[,9] + data[,10] + data[,11] + data[,12] + data[,13] 
+ data[,14], family=binomial("logit"))

summary(model)


You will see that for covariate data[,13], the beta/coefficient estimate is 
around -9. The number of iterations that has been performed is 8, and 
model$converged returns TRUE.


I've used some alternate logistic regression code in C 
(https://github.com/czep/mlelr/blob/master/src/mlelr.c), which produces 
identical estimates for the other covariates and comparable deviance values. 
However, using this C code, I'm seeing that the estimate for data[,13] is 
around -100 (since I'm allowing a maximum of 100 MLE iterations). There, the 
conclusion is that the model does not converge.


The difference between the two pieces of code is that in R, the glm() function 
determines convergence of the whole model by measuring the difference between 
deviance of the current iteration versus the deviance of the prior iteration, 
and calls the model converged when it reaches a certain epsilon value. In the 
C++ code, the model is converged when all parameters haven't changed markedly 
compared to the previous iteration.


I think both approaches are valid, although the R variant (while faster) makes 
it vulnerable to wrongly concluding convergence in edge cases such as the one 
presented above, resulting in wrong coefficient estimates. For people wanting 
to use logistic regression in a training/prediction kind of setting, using 
these estimates might influence their predictive performance.


The problem here is that the glm function does not return any warnings when one 
of the covariates in the model does not converge. For someone who is not paying 
attention, this may lead them to conclude there is nothing wrong with their 
data. In my opinion, the default behavior in this case should therefore be to 
conclude that the model did not converge, or at least to show a warning message.


Please let me know whether you believe this is an issue, and whether I can 
provide additional information.


With kind regards,


Harm-Jan Westra









        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to