Re: [Rd] Wrongly converging glm()

Harm-Jan Westra Thu, 20 Jul 2017 12:44:13 -0700

Dear Mark,

I agree that convergence is a problem that applies to optimization in general, 
where the function you�re trying to optimize may have more than one local 
minimum. In your case, you probably would have to try different starting points 
for the MLE procedure. This should not be the case for logistic regression 
however (unless, like in my data, you have something that defies your model 
assumptions; check Simon Bonner�s response).


Still, I would think it would be a bit odd if the deviance wouldn�t change, but 
one of the model parameters did after the next MLE iteration. It would tell me 
that these parameters wouldn�t add to the model fit, which in my opinion would 
be useful debugging information, even when I would be hitting a local minimum 
(it could even help me inform that there is another, more optimal, solution?). 
Probably I should try to figure out whether this observation is also true for 
other models/link functions (I honestly don�t know).

However, thanks to your response, I can see that my suggestion is probably not 
applicable to all glm link functions, and I see how implementation of my 
proposed �warning system� could be confusing to the user. Thanks alot!

With kind regards,

Harm-Jan

From: Mark Leeds<mailto:marklee...@gmail.com>
Sent: Thursday, July 20, 2017 14:54
To: Harm-Jan Westra<mailto:westra.harm...@outlook.com>
Cc: Joris Meys<mailto:jorism...@gmail.com>; 
r-devel@r-project.org<mailto:r-devel@r-project.org>
Subject: Re: [Rd] Wrongly converging glm()

Hi Harm-Jan. I've been following this thread to some degree and just want to 
add that
 this issue is not specific to the GLM. It's a problem with optimization of 
functions in general. I was using use Rvmmin with constraints which is an 
extremely solid optimization package written by John Nash ( uses a modified 
BFGS  algorithm) and it took me two years to realize that, although my 
optimization generally converged, there was an idenitifiability issue with my 
model that basically meant that the results meant nothing. I only eventually 
found this out because, in the econometrics literature,  the type of economic 
model I was estimating ( rational expectations ) is known to have an 
identifiability issue. I guess if I was an economics expert, I  may have been 
able to know this but, in general, I think what you are asking
optimization code to do is EXTREMELY DIFFICULT.

John Nash can say more because he's THE optimization masteR but it's much more 
difficult to write optimization algorithms with convergence rules that are able 
to identify when mathematical convergence ( norm near zero say ) is not 
necessarily model convergence. That I can tell you from experience !!!!!!!





On Thu, Jul 20, 2017 at 2:32 PM, Harm-Jan Westra 
<westra.harm...@outlook.com<mailto:westra.harm...@outlook.com>> wrote:
My apologies if I seemed to �blame R�. This was in no way my intention. I get 
the feeling that you�re missing my point as well.

I observed something that I thought was confusing, when comparing two more or 
less identical methods (when validating the C code), and wanted to make a 
suggestion as to how to help future R users. Note that I already acknowledged 
that my data was bad. Note that I also mention that the way R determines 
convergence is a valid approach.

What strikes me as odd is that R would warn you when your data is faulty for a 
function such as cor(), but not for glm(). I don�t see why you wouldn�t want to 
check both convergence criteria if you know multiple of such criteria exist. It 
would make the software more user friendly in the end.

It may be true that there are millions of edge cases causing issues with glm(), 
as you say, but here I am presenting an edge case that can be easily detected, 
by checking whether the difference in beta estimates between the current and 
previous iteration is bigger than a certain epsilon value.

I agree �that everybody using R should first do the effort of learning what 
they're doing�, but it is a bit of a non-argument, because we all know that, 
the world just doesn�t work that way, plus this is one of the arguments that 
has held for example the Linux community back for quite a while (i.e. let�s not 
make the software more user friendly because the user should be more 
knowledgeable).

Harm-Jan


From: Joris Meys<mailto:jorism...@gmail.com<mailto:jorism...@gmail.com>>
Sent: Thursday, July 20, 2017 13:16
To: Harm-Jan 
Westra<mailto:westra.harm...@outlook.com<mailto:westra.harm...@outlook.com>>
Cc: 
r-devel@r-project.org<mailto:r-devel@r-project.org><mailto:r-devel@r-project.org<mailto:r-devel@r-project.org>>
Subject: Re: [Rd] Wrongly converging glm()



On Thu, Jul 20, 2017 at 6:21 PM, Harm-Jan Westra 
<westra.harm...@outlook.com<mailto:westra.harm...@outlook.com><mailto:westra.harm...@outlook.com<mailto:westra.harm...@outlook.com>>>
 wrote:
Dear Joris,


I agree that such a covariate should not be used in the analysis, and fully 
agree with your assessment. However, your response assumes that everybody who 
uses R knows what they're doing, which is a dangerous assumption to make. I bet 
there are a lot of people who blindly trust the output from R, even when 
there's clearly something wrong with the estimates.

You missed my point then. I don't assume that everybody who uses R knows what 
they're doing. Actually, I know for a fact quite a few people using R have 
absolutely no clue about what they are doing. My point is that everybody using 
R should first do the effort of learning what they're doing. And if they don't, 
one shouldn't blame R. There's a million different cases where both algorithms 
would converge and the resulting estimates are totally meaningless regardless. 
R cannot be held responsible for that.



In terms of your conclusion that the C++ estimate corresponds to a value within 
the R estimated confidence interval: if I allow the C++ code to run for 1000 
iterations, it's estimate would be around -1000. It simply never converges.

I didn't test that far, and you're right in the sense that -100 is indeed not 
the final estimate. After looking at the C code, it appears as if the author of 
that code combines a Newton-Raphson approach with a different convergence rule. 
And then it's quite understandible it doesn't converge. You can wildly vary 
that estimate, the effect it has on the jacobian, log likelihood or deviance 
will be insignificant. So the model won't improve, it would just move all over 
the parameter space.



I think there's nothing wrong with letting the user know there might be 
something wrong with one of the estimates, especially if your code can easily 
figure it out for you, by adding an additional rule. Not everyone is always 
paying attention (even if they know what they're doing).

If R would do that, it wouldn't start the fitting procedure but just return an 
error "Your analysis died due to a lack of useable data." . Because that's the 
problem here.



With kind regards,


Harm-Jan


________________________________
From: Joris Meys 
<jorism...@gmail.com<mailto:jorism...@gmail.com><mailto:jorism...@gmail.com<mailto:jorism...@gmail.com>>>
Sent: Thursday, July 20, 2017 11:38 AM
To: Harm-Jan Westra
Cc: 
r-devel@r-project.org<mailto:r-devel@r-project.org><mailto:r-devel@r-project.org<mailto:r-devel@r-project.org>>
Subject: Re: [Rd] Wrongly converging glm()

Allow me to chime in. That's an interesting case you present, but as far as I'm 
concerned the algorithm did converge. The estimate of -9.25 has an estimated 
standard error of 72.4, meaning that frequentists would claim the true value 
would lie anywhere between appx. -151 and 132 (CI) and hence the estimate from 
the glm algorithm is perfectly compatible with the one from the C++ code. And 
as the glm algorithm uses a different convergence rule, the algorithm 
rightfully reported it converged. It's not because another algorithm based on 
another rule doesn't converge, that the one glm uses didn't.

On top of that: In both cases the huge standard error on that estimate clearly 
tells you that the estimate should not be trusted, and the fit is unstable. 
That's to be expected, given the insane inbalance in your data, especially for 
the 13th column. If my students would incorporate that variable in a 
generalized linear model and tries to formulate a conclusion based on that 
coefficient, they failed the exam. So if somebody does this analysis and tries 
to draw any conclusion whatsoever on that estimate, maybe they should leave the 
analysis to somebody who does know what they're doing.

Cheers
Joris

On Thu, Jul 20, 2017 at 5:02 PM, Harm-Jan Westra 
<westra.harm...@outlook.com<mailto:westra.harm...@outlook.com><mailto:westra.harm...@outlook.com<mailto:westra.harm...@outlook.com>><mailto:westra.harm...@outlook.com<mailto:westra.harm...@outlook.com><mailto:westra.harm...@outlook.com<mailto:westra.harm...@outlook.com>>>>
 wrote:
Dear R-core,


I have found an edge-case where the glm function falsely concludes that the 
model has converged. The issue is the following: my data contains a number of 
covariates, one of these covariates has a very small variance. For most of the 
rows of this covariate, the value is 0, except for one of the rows, where it is 
1.


The glm function correctly determines the beta and standard error estimates for 
all other covariates.


I've placed the data here: http://www.harmjanwestra.nl/rtestdata.txt


The model I'm using is very simple:


data <- read.table("rtestdata.txt")

model <- glm(data[,1] ~ data[,2] + data[,3] + data[,4] + data[,5] + data[,6] + 
data[,7] + data[,8] + data[,9] + data[,10] + data[,11] + data[,12] + data[,13] 
+ data[,14], family=binomial("logit"))

summary(model)


You will see that for covariate data[,13], the beta/coefficient estimate is 
around -9. The number of iterations that has been performed is 8, and 
model$converged returns TRUE.


I've used some alternate logistic regression code in C 
(https://github.com/czep/mlelr/blob/master/src/mlelr.c), which produces 
identical estimates for the other covariates and comparable deviance values. 
However, using this C code, I'm seeing that the estimate for data[,13] is 
around -100 (since I'm allowing a maximum of 100 MLE iterations). There, the 
conclusion is that the model does not converge.


The difference between the two pieces of code is that in R, the glm() function 
determines convergence of the whole model by measuring the difference between 
deviance of the current iteration versus the deviance of the prior iteration, 
and calls the model converged when it reaches a certain epsilon value. In the 
C++ code, the model is converged when all parameters haven't changed markedly 
compared to the previous iteration.


I think both approaches are valid, although the R variant (while faster) makes 
it vulnerable to wrongly concluding convergence in edge cases such as the one 
presented above, resulting in wrong coefficient estimates. For people wanting 
to use logistic regression in a training/prediction kind of setting, using 
these estimates might influence their predictive performance.


The problem here is that the glm function does not return any warnings when one 
of the covariates in the model does not converge. For someone who is not paying 
attention, this may lead them to conclude there is nothing wrong with their 
data. In my opinion, the default behavior in this case should therefore be to 
conclude that the model did not converge, or at least to show a warning message.


Please let me know whether you believe this is an issue, and whether I can 
provide additional information.


With kind regards,


Harm-Jan Westra









        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org<mailto:R-devel@r-project.org><mailto:R-devel@r-project.org<mailto:R-devel@r-project.org>><mailto:R-devel@r-project.org<mailto:R-devel@r-project.org><mailto:R-devel@r-project.org<mailto:R-devel@r-project.org>>>
 mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 
79<tel:%2B32%20%280%299%20264%2061%2079><tel:%2B32%20%280%299%20264%2061%2079>
joris.m...@ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org<mailto:R-devel@r-project.org><mailto:R-devel@r-project.org<mailto:R-devel@r-project.org>>
 mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79<tel:%2B32%20%280%299%20264%2061%2079>
joris.m...@ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php


        [[alternative HTML version deleted]]


______________________________________________
R-devel@r-project.org<mailto:R-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Wrongly converging glm()

Reply via email to