Re: [R] Problem with lm.resid() when weights are provided
Dear Hamed, > -Original Message- > From: Hamed Ha [mailto:hamedhas...@gmail.com] > Sent: Monday, September 17, 2018 3:56 AM > To: Fox, John > Cc: r-help@r-project.org > Subject: Re: [R] Problem with lm.resid() when weights are provided > > H i John, > > > Thank you for your reply. > > > I see your point, thanks. I checked lm.wfit() and realised that there is a tol > parameter that is already set to 10^-7. This is not even the half decimal to > the > machine precision. Furthermore, plying with tol parameter does not solve the > problem, as far as I checked. tol plays a different role in lm.wfit(). It's for the QR decomposition (done in C code), I suppose to determine the rank of the weighted model matrix. Generally in this kind of context, you'd use something like the square root of the machine double epsilon to define a number that's effectively 0, and the tolerance used here isn't too far off that -- about an order of magnitude larger. I'm not an expert in computer arithmetic or numerical linear algebra, so I don't have anything more to say about this. > > > I still see this issue as critical and we should report it to the R core team > to be > investigated more. What do you think? I don't think that it's a critical issue because it isn't sensible to specify nonzero weights so close to 0. A simple solution is to change these weights to 0 in your code calling lm(). That said, I suppose that it might be better to make lm.wfit() more robust to near-zero weights. If you feel strongly about this, you can file a bug report, but I'm not interested in pursuing it. Best, John > > > Regards, > Hamed. > > > On Fri, 14 Sep 2018 at 22:46, Fox, John <mailto:j...@mcmaster.ca> > wrote: > > > Dear Hamed, > > When you post a question to r-help, generally you should cc > subsequent messages there as well, as I've done to this response. > > The algorithm that lm() uses is much more numerically stable than > inverting the weighted sum-of-squares-and-product matrix. If you want to see > how the computations are done, look at lm.wfit(), in which the residuals and > fits are computed as > > z$residuals <- z$residuals/wts > z$fitted.values <- y - z$residuals > > Zero weights are handled specially, and your tiny weights are thus the > source of the problem. When you divide by a number less than the machine > double-epsilon, you can't expect numerically stable results. I suppose that > lm.wfit() could check for 0 weights to a tolerance rather than exactly. > > John > > > -----Original Message- > > From: Hamed Ha [mailto:hamedhas...@gmail.com > <mailto:hamedhas...@gmail.com> ] > > Sent: Friday, September 14, 2018 5:34 PM > > To: Fox, John mailto:j...@mcmaster.ca> > > > Subject: Re: [R] Problem with lm.resid() when weights are provided > > > > Hi John, > > > > Thank you for your reply. > > > > I agree that the small weights are the potential source of the > instability in the > > result. I also suspected that there are some failure/bugs in the > actual > > algorithm that R uses for fitting the model. I remember that at some > points I > > checked the theoretical estimation of the parameters, solve(t(x) > %*% w %*% > > x) %*% t(x) %*% w %*% y, (besides the point that I had to set tol > parameter in > > solve() to a super small value) and realised that lm() and the > theoretical > > results match together. That is the parameter estimation is right in > R. > > Moreover, I checked the predictions, predict(lm.fit), and it was > right. > Then the > > only source of error remained was resid() function. I further checked > this > > function and it is nothing more than calling a sub-element from and > lm() fit. > > Putting all together, I think that there is something wrong/bug/miss- > > configuration in the lm() algorithm and I highly recommend the R > core team to > > fix that. > > > > Please feel free to contact me for more details if required. > > > > Warm regards, > > Hamed. > > > > > > > > > > > > > > > > > > > > On Fri, 14 Sep 2018 at 13:35, Fox, John <mailto:j...@mcmaster.ca> > > <mailto:j...@mcmaster.ca <mailto:j...@mcmaster.ca> > > wrote: > > > &g
Re: [R] Problem with lm.resid() when weights are provided
H i John, Thank you for your reply. I see your point, thanks. I checked lm.wfit() and realised that there is a tol parameter that is already set to 10^-7. This is not even the half decimal to the machine precision. Furthermore, plying with tol parameter does not solve the problem, as far as I checked. I still see this issue as critical and we should report it to the R core team to be investigated more. What do you think? Regards, Hamed. On Fri, 14 Sep 2018 at 22:46, Fox, John wrote: > Dear Hamed, > > When you post a question to r-help, generally you should cc subsequent > messages there as well, as I've done to this response. > > The algorithm that lm() uses is much more numerically stable than > inverting the weighted sum-of-squares-and-product matrix. If you want to > see how the computations are done, look at lm.wfit(), in which the > residuals and fits are computed as > > z$residuals <- z$residuals/wts > z$fitted.values <- y - z$residuals > > Zero weights are handled specially, and your tiny weights are thus the > source of the problem. When you divide by a number less than the machine > double-epsilon, you can't expect numerically stable results. I suppose that > lm.wfit() could check for 0 weights to a tolerance rather than exactly. > > John > > > -Original Message- > > From: Hamed Ha [mailto:hamedhas...@gmail.com] > > Sent: Friday, September 14, 2018 5:34 PM > > To: Fox, John > > Subject: Re: [R] Problem with lm.resid() when weights are provided > > > > Hi John, > > > > Thank you for your reply. > > > > I agree that the small weights are the potential source of the > instability in the > > result. I also suspected that there are some failure/bugs in the actual > > algorithm that R uses for fitting the model. I remember that at some > points I > > checked the theoretical estimation of the parameters, solve(t(x) %*% w > %*% > > x) %*% t(x) %*% w %*% y, (besides the point that I had to set tol > parameter in > > solve() to a super small value) and realised that lm() and the > theoretical > > results match together. That is the parameter estimation is right in R. > > Moreover, I checked the predictions, predict(lm.fit), and it was right. > Then the > > only source of error remained was resid() function. I further checked > this > > function and it is nothing more than calling a sub-element from and lm() > fit. > > Putting all together, I think that there is something wrong/bug/miss- > > configuration in the lm() algorithm and I highly recommend the R core > team to > > fix that. > > > > Please feel free to contact me for more details if required. > > > > Warm regards, > > Hamed. > > > > > > > > > > > > > > > > > > > > On Fri, 14 Sep 2018 at 13:35, Fox, John > <mailto:j...@mcmaster.ca> > wrote: > > > > > > Dear Hamed, > > > > I don't think that anyone has picked up on this problem. > > > > What's peculiar about your weights is that several are 0 within > > rounding error but not exactly 0: > > > > > head(df) > > y x weight > > 1 1.5115614 0.5520924 2.117337e-34 > > 2 -0.6365313 -0.1259932 2.117337e-34 > > 3 0.3778278 0.4209538 4.934135e-31 > > 4 3.0379232 1.4031545 2.679495e-24 > > 5 1.5364652 0.4607686 2.679495e-24 > > 6 -2.3772787 -0.7396358 6.244160e-21 > > > > I can reproduce the results that you report: > > > > > (mod.1 <- lm(y ~ x, data=df)) > > > > Call: > > lm(formula = y ~ x, data = df) > > > > Coefficients: > > (Intercept)x > > -0.04173 2.03790 > > > > > max(resid(mod.1)) > > [1] 1.14046 > > > (mod.2 <- lm(y ~ x, data=df, weights=weight)) > > > > Call: > > lm(formula = y ~ x, data = df, weights = weight) > > > > Coefficients: > > (Intercept)x > > -0.05786 1.96087 > > > > > max(resid(mod.2)) > > [1] 36.84939 > > > > But the problem disappears when the tiny nonzero weight are set to > 0: > > > > > df2 <- df > > > df2$weight <- zapsmall(df2$weight) > > > head(df2) > > y x weight > > 1 1.5115614 0.5520924 0 > > 2 -0.6365313 -0.1259932 0 > > 3 0.3778278 0.4209
Re: [R] Problem with lm.resid() when weights are provided
Dear Hamed, When you post a question to r-help, generally you should cc subsequent messages there as well, as I've done to this response. The algorithm that lm() uses is much more numerically stable than inverting the weighted sum-of-squares-and-product matrix. If you want to see how the computations are done, look at lm.wfit(), in which the residuals and fits are computed as z$residuals <- z$residuals/wts z$fitted.values <- y - z$residuals Zero weights are handled specially, and your tiny weights are thus the source of the problem. When you divide by a number less than the machine double-epsilon, you can't expect numerically stable results. I suppose that lm.wfit() could check for 0 weights to a tolerance rather than exactly. John > -Original Message- > From: Hamed Ha [mailto:hamedhas...@gmail.com] > Sent: Friday, September 14, 2018 5:34 PM > To: Fox, John > Subject: Re: [R] Problem with lm.resid() when weights are provided > > Hi John, > > Thank you for your reply. > > I agree that the small weights are the potential source of the instability in > the > result. I also suspected that there are some failure/bugs in the actual > algorithm that R uses for fitting the model. I remember that at some points I > checked the theoretical estimation of the parameters, solve(t(x) %*% w %*% > x) %*% t(x) %*% w %*% y, (besides the point that I had to set tol parameter in > solve() to a super small value) and realised that lm() and the theoretical > results match together. That is the parameter estimation is right in R. > Moreover, I checked the predictions, predict(lm.fit), and it was right. Then > the > only source of error remained was resid() function. I further checked this > function and it is nothing more than calling a sub-element from and lm() fit. > Putting all together, I think that there is something wrong/bug/miss- > configuration in the lm() algorithm and I highly recommend the R core team to > fix that. > > Please feel free to contact me for more details if required. > > Warm regards, > Hamed. > > > > > > > > > > On Fri, 14 Sep 2018 at 13:35, Fox, John <mailto:j...@mcmaster.ca> > wrote: > > > Dear Hamed, > > I don't think that anyone has picked up on this problem. > > What's peculiar about your weights is that several are 0 within > rounding error but not exactly 0: > > > head(df) > y x weight > 1 1.5115614 0.5520924 2.117337e-34 > 2 -0.6365313 -0.1259932 2.117337e-34 > 3 0.3778278 0.4209538 4.934135e-31 > 4 3.0379232 1.4031545 2.679495e-24 > 5 1.5364652 0.4607686 2.679495e-24 > 6 -2.3772787 -0.7396358 6.244160e-21 > > I can reproduce the results that you report: > > > (mod.1 <- lm(y ~ x, data=df)) > > Call: > lm(formula = y ~ x, data = df) > > Coefficients: > (Intercept)x > -0.04173 2.03790 > > > max(resid(mod.1)) > [1] 1.14046 > > (mod.2 <- lm(y ~ x, data=df, weights=weight)) > > Call: > lm(formula = y ~ x, data = df, weights = weight) > > Coefficients: > (Intercept)x > -0.05786 1.96087 > > > max(resid(mod.2)) > [1] 36.84939 > > But the problem disappears when the tiny nonzero weight are set to 0: > > > df2 <- df > > df2$weight <- zapsmall(df2$weight) > > head(df2) > y x weight > 1 1.5115614 0.5520924 0 > 2 -0.6365313 -0.1259932 0 > 3 0.3778278 0.4209538 0 > 4 3.0379232 1.4031545 0 > 5 1.5364652 0.4607686 0 > 6 -2.3772787 -0.7396358 0 > > (mod.3 <- update(mod.2, data=df2)) > > Call: > lm(formula = y ~ x, data = df2, weights = weight) > > Coefficients: > (Intercept)x > -0.05786 1.96087 > > > max(resid(mod.3)) > [1] 1.146663 > > I don't know exactly why this happens, but suspect numerical > instability produced by the near-zero weights, which are smaller than the > machine double-epsilon > > > .Machine$double.neg.eps > [1] 1.110223e-16 > > The problem also disappears, e.g., if the tiny weight are set to 1e-15 > rather than 0. > > I hope this helps, >John > > - > John Fox > Professor Emeritus > McMaster University > Hami
Re: [R] Problem with lm.resid() when weights are provided
Dear Hamed, I don't think that anyone has picked up on this problem. What's peculiar about your weights is that several are 0 within rounding error but not exactly 0: > head(df) y x weight 1 1.5115614 0.5520924 2.117337e-34 2 -0.6365313 -0.1259932 2.117337e-34 3 0.3778278 0.4209538 4.934135e-31 4 3.0379232 1.4031545 2.679495e-24 5 1.5364652 0.4607686 2.679495e-24 6 -2.3772787 -0.7396358 6.244160e-21 I can reproduce the results that you report: > (mod.1 <- lm(y ~ x, data=df)) Call: lm(formula = y ~ x, data = df) Coefficients: (Intercept)x -0.04173 2.03790 > max(resid(mod.1)) [1] 1.14046 > (mod.2 <- lm(y ~ x, data=df, weights=weight)) Call: lm(formula = y ~ x, data = df, weights = weight) Coefficients: (Intercept)x -0.05786 1.96087 > max(resid(mod.2)) [1] 36.84939 But the problem disappears when the tiny nonzero weight are set to 0: > df2 <- df > df2$weight <- zapsmall(df2$weight) > head(df2) y x weight 1 1.5115614 0.5520924 0 2 -0.6365313 -0.1259932 0 3 0.3778278 0.4209538 0 4 3.0379232 1.4031545 0 5 1.5364652 0.4607686 0 6 -2.3772787 -0.7396358 0 > (mod.3 <- update(mod.2, data=df2)) Call: lm(formula = y ~ x, data = df2, weights = weight) Coefficients: (Intercept)x -0.05786 1.96087 > max(resid(mod.3)) [1] 1.146663 I don't know exactly why this happens, but suspect numerical instability produced by the near-zero weights, which are smaller than the machine double-epsilon > .Machine$double.neg.eps [1] 1.110223e-16 The problem also disappears, e.g., if the tiny weight are set to 1e-15 rather than 0. I hope this helps, John - John Fox Professor Emeritus McMaster University Hamilton, Ontario, Canada Web: https://socialsciences.mcmaster.ca/jfox/ > -Original Message- > From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Hamed Ha > Sent: Tuesday, September 11, 2018 8:39 AM > To: r-help@r-project.org > Subject: [R] Problem with lm.resid() when weights are provided > > Dear R Help Team. > > I get some weird results when I use the lm function with weight. The issue can > be reproduced by the example below: > > > The input data is (weights are intentionally designed to reflect some > structures in the data) > > > > df > y x weight > 1.51156139 0.55209240 2.117337e-34 > -0.63653132 -0.12599316 2.117337e-34 > 0.37782776 0.42095384 4.934135e-31 > 3.03792318 1.40315446 2.679495e-24 > 1.53646523 0.46076858 2.679495e-24 > -2.37727874 -0.73963576 6.244160e-21 > 0.37183065 0.20407468 1.455107e-17 > -1.53917553 -0.95519361 1.455107e-17 > 1.10926675 0.03897129 3.390908e-14 > -0.37786333 -0.17523593 3.390908e-14 > 2.43973603 0.97970095 7.902000e-11 > -0.35432394 -0.03742559 7.902000e-11 > 2.19296613 1.00355263 4.289362e-04 > 0.49845532 0.34816207 4.289362e-04 > 1.25005260 0.76306225 5.00e-01 > 0.84360691 0.45152356 5.00e-01 > 0.29565993 0.53880068 5.00e-01 > -0.54081334 -0.28104525 5.00e-01 > 0.83612836 -0.12885659 9.995711e-01 > -1.42526769 -0.87107631 9.98e-01 > 0.10204789 -0.11649899 1.00e+00 > 1.14292898 0.37249631 1.00e+00 > -3.02942081 -1.28966997 1.00e+00 > -1.37549764 -0.74676145 1.00e+00 > -2.00118016 -0.55182759 1.00e+00 > -4.24441674 -1.94603608 1.00e+00 > 1.17168144 1.00868008 1.00e+00 > 2.64007761 1.26333069 1.00e+00 > 1.98550114 1.18509599 1.00e+00 > -0.58941683 -0.61972416 9.98e-01 > -4.57559611 -2.30914920 9.995711e-01 > -0.82610544 -0.39347576 9.995711e-01 > -0.02768220 0.20076910 9.995711e-01 > 0.78186399 0.25690215 9.995711e-01 > -0.88314153 -0.20200148 5.00e-01 > -4.17076452 -2.03547588 5.00e-01 > 0.93373070 0.54190626 4.289362e-04 > -0.08517734 0.17692491 4.289362e-04 > -4.47546619 -2.14876688 4.289362e-04 > -1.65509103 -0.76898087 4.289362e-04 > -0.39403030 -0.12689705 4.289362e-04 > 0.01203300 -0.18689898 1.841442e-07 > -4.82762639 -2.31391121 1.841442e-07 > -0.72658380 -0.39751171 3.397282e-14 > -2.35886866 -1.01082109 0.00e+00 > -2.03762707 -0.96439902 0.00e+00 > 0.90115123 0.60172286 0.00e+00 > 1.55999194 0.83433953 0.00e+00 > 3.07994058 1.30942776 0.00e+00 > 1.78871462 1.10605530 0.00e+00 > > > > Running simple linear model returns: > > > lm(y~x,data=df) > > Call: > lm(formula = y ~ x, data = df) > > Coefficients: > (Intercept)x >-0.04173 2.03790 > > and > > max(resid(lm(y~x,data=df))) > [1] 1.14046 > > > *HOWEVER if I use the weighted model then:* > > lm(formula = y ~ x, data = df, weights = df$weights) > > Coefficients: > (Intercept)x >-0.05786 1.96087 > > and > > max(resid(lm(y~x,data=df,weights=df$weights))) > [1] 60.91888 > > > as you see, the estimation of the coefficients are nearly the same but the > resid() function returns