Re: [R] Potential Issue with lm.influence
Hey John, Seems fair, and, I agree a more explicit or clear (ie, giving users indications as to why/when the lm.influence is going to misfit the data) warning makes sense in context. Sincerely, Eric On Wed, Apr 3, 2019 at 10:18 AM Fox, John wrote: > Dear Eric, > > I'm afraid that your argument doesn't make sense to me. As you saw when > you tried > > fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn "))) > > glm.nb() effectively wasn't able to estimate the theta parameter of the > negative binomial model. So why would it be better to base deletion > diagnostics on actually refitting the model? > > The lesson to me here is that if you fit a sufficiently unreasonable model > to data, the computations may break down. Other than drawing attention to > the NaN with an explicit warning, I don't see what more could usefully be > done. > > Best, > John > > > On Apr 2, 2019, at 9:08 PM, Eric Bridgeford wrote: > > > > Hey John, > > > > I am aware they are high leverage points, and that the model is not the > > best for them. The purpose of this dataset was to explore high leverage > > points, and diagnostic statistics through which one would identify them. > > > > What I am saying is that the current behavior of the function seems a > > little non-specific to me; the influence for this problem is > > finite/computable manually by fitting n models to n-1 points (manually > > holding out each point individually to obtain the loo-variance, and > > computing the influence in the non-approximate way). > > > > I am just suggesting that it seems the function could be improved by, > say, > > throwing specific warnings when NaNs may arise. Ie, "Your have points > that > > are very high leverage. The approximation technique is not numerically > > stable for these points and the results should be used with caution" > > etc...; I am sure there are other also pre-hoc approaches to diagnose > other > > ways in which this function could fail). The approximation technique not > > behaving well for points that are ultra high leverage just seems peculiar > > that that would return an NaN with no other > recommendations/advice/specific > > warnings, especially since the influence is frequently used to diagnosing > > this specific issue. > > > > Alternatively, one could afford an optional argument type="manual" that > > computes the held-out variance manually rather than the approximate > > fashion, and add a comment to use this in the help menu when you have > high > > leverage points (this is what I ended up doing to obtain the true > influence > > and the externally studentized residual). > > > > I just think some more specificity could be of use for future users, to > > make the R:stats community even better :) Does that make sense? > > > > Sincerely, > > Eric > > > > On Tue, Apr 2, 2019 at 7:53 PM Fox, John wrote: > > > >> Dear Eric, > >> > >> Have you looked at your data? -- for example: > >> > >>plot(log(Moons) ~ Volume, data = moon_data) > >>text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, > >> subset = Volume > 400) > >> > >> The negative-binomial model doesn't look reasonable, does it? > >> > >> After you eliminate Jupiter there's one very high leverage point left, > >> Saturn. Computing studentized residuals entails an approximation to > >> deleting that as well from the model, so try fitting > >> > >>fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn > "))) > >>summary(fit3) > >> > >> which runs into numeric difficulties. > >> > >> Then look at: > >> > >>plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < > 400) > >> > >> Finally, try > >> > >>plot(log(Moons) ~ log(Volume), data = moon_data) > >>fit4 <- update(fit2, . ~ log(Volume)) > >>rstudent(fit4) > >> > >> I hope this helps, > >> John > >> > >> - > >> John Fox > >> Professor Emeritus > >> McMaster University > >> Hamilton, Ontario, Canada > >> Web: https://socialsciences.mcmaster.ca/jfox/ > >> > >> > >> > >> > >>> -Original Message- > >>> From: R-help [mailto:r-help-boun...@r-project.org] O
Re: [R] Potential Issue with lm.influence
Hey guys, I appreciate the replies. I agree the issue is easy to catch; wouldn't it make sense to make a warning given that these types of errors (I am sure there are other ways to make the lm.influence have similar NaN performance, simply due to points radically not fitting the data) are relatively easy to forecast? Seems like the output is just a bit vague from lm.influence. Sincerely, Eric On Wed, Apr 3, 2019 at 10:03 AM Fox, John wrote: > Hi Peter, > > Yes, that's another reflection of the degree to which Jupiter and Saturn > are out of line with the data for the other planet when you fit the very > unreasonable negative binomial model with Volume untransformed. > > Best, > John > > > On Apr 3, 2019, at 5:36 AM, peter dalgaard wrote: > > > > Yes, also notice that > > > >> predict(fit3, new=moon_data, type="resp") > > 12345 > 6 > > 1.060694e+00 1.102008e+00 1.109695e+00 1.065515e+00 1.057896e+00 > 1.892312e+29 > > 789 10 11 > 12 > > 3.531271e+17 2.295015e+01 1.739889e+01 1.058165e+00 1.058041e+00 > 1.057957e+00 > > 13 > > 1.058217e+00 > > > > > > so the model of fit3 predicts that Jupiter and Saturn should have > several bazillions of moons each! > > > > -pd > > > > > > > >> On 3 Apr 2019, at 01:53 , Fox, John wrote: > >> > >> Dear Eric, > >> > >> Have you looked at your data? -- for example: > >> > >> plot(log(Moons) ~ Volume, data = moon_data) > >> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, > subset = Volume > 400) > >> > >> The negative-binomial model doesn't look reasonable, does it? > >> > >> After you eliminate Jupiter there's one very high leverage point left, > Saturn. Computing studentized residuals entails an approximation to > deleting that as well from the model, so try fitting > >> > >> fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn "))) > >> summary(fit3) > >> > >> which runs into numeric difficulties. > >> > >> Then look at: > >> > >> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400) > >> > >> Finally, try > >> > >> plot(log(Moons) ~ log(Volume), data = moon_data) > >> fit4 <- update(fit2, . ~ log(Volume)) > >> rstudent(fit4) > >> > >> I hope this helps, > >> John > >> > >> ----- > >> John Fox > >> Professor Emeritus > >> McMaster University > >> Hamilton, Ontario, Canada > >> Web: https://socialsciences.mcmaster.ca/jfox/ > >> > >> > >> > >> > >>> -Original Message- > >>> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Eric > >>> Bridgeford > >>> Sent: Tuesday, April 2, 2019 5:01 PM > >>> To: Bert Gunter > >>> Cc: R-help > >>> Subject: Re: [R] Fwd: Potential Issue with lm.influence > >>> > >>> I agree the influence documentation suggests NaNs may result; however, > as > >>> these can be manually computed and are, indeed, finite/existing (ie, > >>> computing the held-out influence by manually training n models for n > points > >>> to obtain n leave one out influence measures), I don't possibly see > how the > >>> function SHOULD return NaN, and given that it is returning NaN, that > >>> suggests to me that there should be either a) Providing an alternative > >>> method to compute them that (may be slower) that returns the correct > >>> results in the even that lm.influence does not return a good > approximation > >>> (ie, a command line argument for type="approx" that does the > >>> approximation strategy employed currently, or an alternative > type="direct" > >>> or something like that that computes them manually), or b) a heuristic > to > >>> suggest why NaNs might result from one's particular inputs/what can be > >>> done to fix it (if the approximation strategy is the source of the > problem) or > >>> what the issue is with the data that will cause NaNs. Hence I was > looking to > >>> start a discussion around the specific strategy employed to compute the > >>> elements. > >&
Re: [R] Potential Issue with lm.influence
Second! Bert Gunter On Wed, Apr 3, 2019 at 9:35 AM Richard M. Heiberger wrote: > fortune nomination. > > > The lesson to me here is that if you fit a sufficiently unreasonable > model to data, the computations may break down. > > On Wed, Apr 3, 2019 at 10:18 AM Fox, John wrote: > > > > Dear Eric, > > > > I'm afraid that your argument doesn't make sense to me. As you saw when > you tried > > > > fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn > "))) > > > > glm.nb() effectively wasn't able to estimate the theta parameter of the > negative binomial model. So why would it be better to base deletion > diagnostics on actually refitting the model? > > > > The lesson to me here is that if you fit a sufficiently unreasonable > model to data, the computations may break down. Other than drawing > attention to the NaN with an explicit warning, I don't see what more could > usefully be done. > > > > Best, > > John > > > > > On Apr 2, 2019, at 9:08 PM, Eric Bridgeford > wrote: > > > > > > Hey John, > > > > > > I am aware they are high leverage points, and that the model is not the > > > best for them. The purpose of this dataset was to explore high leverage > > > points, and diagnostic statistics through which one would identify > them. > > > > > > What I am saying is that the current behavior of the function seems a > > > little non-specific to me; the influence for this problem is > > > finite/computable manually by fitting n models to n-1 points (manually > > > holding out each point individually to obtain the loo-variance, and > > > computing the influence in the non-approximate way). > > > > > > I am just suggesting that it seems the function could be improved by, > say, > > > throwing specific warnings when NaNs may arise. Ie, "Your have points > that > > > are very high leverage. The approximation technique is not numerically > > > stable for these points and the results should be used with caution" > > > etc...; I am sure there are other also pre-hoc approaches to diagnose > other > > > ways in which this function could fail). The approximation technique > not > > > behaving well for points that are ultra high leverage just seems > peculiar > > > that that would return an NaN with no other > recommendations/advice/specific > > > warnings, especially since the influence is frequently used to > diagnosing > > > this specific issue. > > > > > > Alternatively, one could afford an optional argument type="manual" that > > > computes the held-out variance manually rather than the approximate > > > fashion, and add a comment to use this in the help menu when you have > high > > > leverage points (this is what I ended up doing to obtain the true > influence > > > and the externally studentized residual). > > > > > > I just think some more specificity could be of use for future users, to > > > make the R:stats community even better :) Does that make sense? > > > > > > Sincerely, > > > Eric > > > > > > On Tue, Apr 2, 2019 at 7:53 PM Fox, John wrote: > > > > > >> Dear Eric, > > >> > > >> Have you looked at your data? -- for example: > > >> > > >>plot(log(Moons) ~ Volume, data = moon_data) > > >>text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, > > >> subset = Volume > 400) > > >> > > >> The negative-binomial model doesn't look reasonable, does it? > > >> > > >> After you eliminate Jupiter there's one very high leverage point left, > > >> Saturn. Computing studentized residuals entails an approximation to > > >> deleting that as well from the model, so try fitting > > >> > > >>fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn > "))) > > >>summary(fit3) > > >> > > >> which runs into numeric difficulties. > > >> > > >> Then look at: > > >> > > >> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < > 400) > > >> > > >> Finally, try > > >> > > >>plot(log(Moons) ~ log(Volume), data = moon_data) > > >>fit4 <- update(fit2, . ~ log(Volume)) > > >>rstudent(fit4) > &g
Re: [R] Potential Issue with lm.influence
fortune nomination. The lesson to me here is that if you fit a sufficiently unreasonable model to data, the computations may break down. On Wed, Apr 3, 2019 at 10:18 AM Fox, John wrote: > > Dear Eric, > > I'm afraid that your argument doesn't make sense to me. As you saw when you > tried > > fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn "))) > > glm.nb() effectively wasn't able to estimate the theta parameter of the > negative binomial model. So why would it be better to base deletion > diagnostics on actually refitting the model? > > The lesson to me here is that if you fit a sufficiently unreasonable model to > data, the computations may break down. Other than drawing attention to the > NaN with an explicit warning, I don't see what more could usefully be done. > > Best, > John > > > On Apr 2, 2019, at 9:08 PM, Eric Bridgeford wrote: > > > > Hey John, > > > > I am aware they are high leverage points, and that the model is not the > > best for them. The purpose of this dataset was to explore high leverage > > points, and diagnostic statistics through which one would identify them. > > > > What I am saying is that the current behavior of the function seems a > > little non-specific to me; the influence for this problem is > > finite/computable manually by fitting n models to n-1 points (manually > > holding out each point individually to obtain the loo-variance, and > > computing the influence in the non-approximate way). > > > > I am just suggesting that it seems the function could be improved by, say, > > throwing specific warnings when NaNs may arise. Ie, "Your have points that > > are very high leverage. The approximation technique is not numerically > > stable for these points and the results should be used with caution" > > etc...; I am sure there are other also pre-hoc approaches to diagnose other > > ways in which this function could fail). The approximation technique not > > behaving well for points that are ultra high leverage just seems peculiar > > that that would return an NaN with no other recommendations/advice/specific > > warnings, especially since the influence is frequently used to diagnosing > > this specific issue. > > > > Alternatively, one could afford an optional argument type="manual" that > > computes the held-out variance manually rather than the approximate > > fashion, and add a comment to use this in the help menu when you have high > > leverage points (this is what I ended up doing to obtain the true influence > > and the externally studentized residual). > > > > I just think some more specificity could be of use for future users, to > > make the R:stats community even better :) Does that make sense? > > > > Sincerely, > > Eric > > > > On Tue, Apr 2, 2019 at 7:53 PM Fox, John wrote: > > > >> Dear Eric, > >> > >> Have you looked at your data? -- for example: > >> > >>plot(log(Moons) ~ Volume, data = moon_data) > >>text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, > >> subset = Volume > 400) > >> > >> The negative-binomial model doesn't look reasonable, does it? > >> > >> After you eliminate Jupiter there's one very high leverage point left, > >> Saturn. Computing studentized residuals entails an approximation to > >> deleting that as well from the model, so try fitting > >> > >>fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn "))) > >>summary(fit3) > >> > >> which runs into numeric difficulties. > >> > >> Then look at: > >> > >>plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400) > >> > >> Finally, try > >> > >>plot(log(Moons) ~ log(Volume), data = moon_data) > >> fit4 <- update(fit2, . ~ log(Volume)) > >>rstudent(fit4) > >> > >> I hope this helps, > >> John > >> > >> - > >> John Fox > >> Professor Emeritus > >> McMaster University > >> Hamilton, Ontario, Canada > >> Web: https://socialsciences.mcmaster.ca/jfox/ > >> > >> > >> > >> > >>> -Original Message- > >>> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Eric > >>> Bridgeford > >>> Sent: Tuesday, April 2, 2019 5
Re: [R] Potential Issue with lm.influence
Dear Eric, I'm afraid that your argument doesn't make sense to me. As you saw when you tried fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn "))) glm.nb() effectively wasn't able to estimate the theta parameter of the negative binomial model. So why would it be better to base deletion diagnostics on actually refitting the model? The lesson to me here is that if you fit a sufficiently unreasonable model to data, the computations may break down. Other than drawing attention to the NaN with an explicit warning, I don't see what more could usefully be done. Best, John > On Apr 2, 2019, at 9:08 PM, Eric Bridgeford wrote: > > Hey John, > > I am aware they are high leverage points, and that the model is not the > best for them. The purpose of this dataset was to explore high leverage > points, and diagnostic statistics through which one would identify them. > > What I am saying is that the current behavior of the function seems a > little non-specific to me; the influence for this problem is > finite/computable manually by fitting n models to n-1 points (manually > holding out each point individually to obtain the loo-variance, and > computing the influence in the non-approximate way). > > I am just suggesting that it seems the function could be improved by, say, > throwing specific warnings when NaNs may arise. Ie, "Your have points that > are very high leverage. The approximation technique is not numerically > stable for these points and the results should be used with caution" > etc...; I am sure there are other also pre-hoc approaches to diagnose other > ways in which this function could fail). The approximation technique not > behaving well for points that are ultra high leverage just seems peculiar > that that would return an NaN with no other recommendations/advice/specific > warnings, especially since the influence is frequently used to diagnosing > this specific issue. > > Alternatively, one could afford an optional argument type="manual" that > computes the held-out variance manually rather than the approximate > fashion, and add a comment to use this in the help menu when you have high > leverage points (this is what I ended up doing to obtain the true influence > and the externally studentized residual). > > I just think some more specificity could be of use for future users, to > make the R:stats community even better :) Does that make sense? > > Sincerely, > Eric > > On Tue, Apr 2, 2019 at 7:53 PM Fox, John wrote: > >> Dear Eric, >> >> Have you looked at your data? -- for example: >> >>plot(log(Moons) ~ Volume, data = moon_data) >>text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, >> subset = Volume > 400) >> >> The negative-binomial model doesn't look reasonable, does it? >> >> After you eliminate Jupiter there's one very high leverage point left, >> Saturn. Computing studentized residuals entails an approximation to >> deleting that as well from the model, so try fitting >> >>fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn "))) >>summary(fit3) >> >> which runs into numeric difficulties. >> >> Then look at: >> >>plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400) >> >> Finally, try >> >>plot(log(Moons) ~ log(Volume), data = moon_data) >>fit4 <- update(fit2, . ~ log(Volume)) >>rstudent(fit4) >> >> I hope this helps, >> John >> >> ------------- >> John Fox >> Professor Emeritus >> McMaster University >> Hamilton, Ontario, Canada >> Web: https://socialsciences.mcmaster.ca/jfox/ >> >> >> >> >>> -Original Message- >>> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Eric >>> Bridgeford >>> Sent: Tuesday, April 2, 2019 5:01 PM >>> To: Bert Gunter >>> Cc: R-help >>> Subject: Re: [R] Fwd: Potential Issue with lm.influence >>> >>> I agree the influence documentation suggests NaNs may result; however, as >>> these can be manually computed and are, indeed, finite/existing (ie, >>> computing the held-out influence by manually training n models for n >> points >>> to obtain n leave one out influence measures), I don't possibly see how >> the >>> function SHOULD return NaN, and given that it is returning NaN, that >>> suggests to me that there should be either a) Providing an alternative >>&g
Re: [R] Potential Issue with lm.influence
Hi Peter, Yes, that's another reflection of the degree to which Jupiter and Saturn are out of line with the data for the other planet when you fit the very unreasonable negative binomial model with Volume untransformed. Best, John > On Apr 3, 2019, at 5:36 AM, peter dalgaard wrote: > > Yes, also notice that > >> predict(fit3, new=moon_data, type="resp") > 123456 > 1.060694e+00 1.102008e+00 1.109695e+00 1.065515e+00 1.057896e+00 1.892312e+29 > 789 10 11 12 > 3.531271e+17 2.295015e+01 1.739889e+01 1.058165e+00 1.058041e+00 1.057957e+00 > 13 > 1.058217e+00 > > > so the model of fit3 predicts that Jupiter and Saturn should have several > bazillions of moons each! > > -pd > > > >> On 3 Apr 2019, at 01:53 , Fox, John wrote: >> >> Dear Eric, >> >> Have you looked at your data? -- for example: >> >> plot(log(Moons) ~ Volume, data = moon_data) >> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, subset >> = Volume > 400) >> >> The negative-binomial model doesn't look reasonable, does it? >> >> After you eliminate Jupiter there's one very high leverage point left, >> Saturn. Computing studentized residuals entails an approximation to deleting >> that as well from the model, so try fitting >> >> fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn "))) >> summary(fit3) >> >> which runs into numeric difficulties. >> >> Then look at: >> >> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400) >> >> Finally, try >> >> plot(log(Moons) ~ log(Volume), data = moon_data) >> fit4 <- update(fit2, . ~ log(Volume)) >> rstudent(fit4) >> >> I hope this helps, >> John >> >> - >> John Fox >> Professor Emeritus >> McMaster University >> Hamilton, Ontario, Canada >> Web: https://socialsciences.mcmaster.ca/jfox/ >> >> >> >> >>> -Original Message- >>> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Eric >>> Bridgeford >>> Sent: Tuesday, April 2, 2019 5:01 PM >>> To: Bert Gunter >>> Cc: R-help >>> Subject: Re: [R] Fwd: Potential Issue with lm.influence >>> >>> I agree the influence documentation suggests NaNs may result; however, as >>> these can be manually computed and are, indeed, finite/existing (ie, >>> computing the held-out influence by manually training n models for n points >>> to obtain n leave one out influence measures), I don't possibly see how the >>> function SHOULD return NaN, and given that it is returning NaN, that >>> suggests to me that there should be either a) Providing an alternative >>> method to compute them that (may be slower) that returns the correct >>> results in the even that lm.influence does not return a good approximation >>> (ie, a command line argument for type="approx" that does the >>> approximation strategy employed currently, or an alternative type="direct" >>> or something like that that computes them manually), or b) a heuristic to >>> suggest why NaNs might result from one's particular inputs/what can be >>> done to fix it (if the approximation strategy is the source of the problem) >>> or >>> what the issue is with the data that will cause NaNs. Hence I was looking to >>> start a discussion around the specific strategy employed to compute the >>> elements. >>> >>> Below is the code: >>> moon_data <- structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L, 11L, >>> 12L, 9L, 10L, 4L, 6L, 3L), >>> .Label = c("Ceres ", "Earth", >>> "Eris ", >>> >>>"Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ", "Neptune ", >>> >>>"Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"), >>> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2, >>> 9.54, 19.22, >>>30.06, 39.5, 43.35, 45.8, 67.7), >>> Diameter = c(0.382, 0.949, >>> >>>