Re: [R] influence measures for multivariate linear models
I'm following up on a question I posted 8/10/2010, but my newsreader has lost this thread. Barrett & Ling, JASA, 1992, v.87(417), pp184-191 define general classes of influence measures for multivariate regression models, including analogs of Cook's D, Andrews & Pregibon COVRATIO, etc. As in univariate response models, these are based on leverage and residuals based on omitting one (or more) observations at a time and refitting, although, in the univariate case, the computations can be optimized, as they are in stats::influence() and related methods. I'm interested in exploring the multivariate extension in R. I tried the following, and was surprised to find that R returned a result rather than an error -- presumably because mlm objects are not trapped before they get to lm.influence() I've done a bit more testing, comparing what I get from R lm functions with some direct calculations in both R and SAS, and now conclude that there are bugs in lm.influence and the coef(update()) methods I tried before to get leave-one-out coefficients for multivariate response models fit with lm(). By bugs, I mean that results returned are silently wrong, or at best, misleading. My test case: a subset of the Rohwer data anayzed by Barrett & Ling (& others) > library(heplots) > data(Rohwer, package="heplots") > Rohwer1 <- Rohwer[Rohwer$SES=='Hi',] > rohwer.mod <- lm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss, data=Rohwer1) > (B <- coef(rohwer.mod)) SAT PPVT Raven (Intercept) -28.467471 39.6970896 13.2438356 n 3.257132 0.0672825 0.0593474 s 2.996583 0.3699843 0.4924440 ns -5.859063 -0.3743802 -0.1640219 na5.666223 1.5230090 0.1189805 ss -0.622653 0.4101567 -0.1211564 Attempt to get the leave-one-out coefficients, B(-i), for the first 2 cases from influence(): *wrong* -- influence() should probably issue stop('Not defined for mlm objects'), or the documentation should be made clearer for what happens in this case. > head(influence(rohwer.mod, do.coef=TRUE)$coefficients, 2) [,1] [,2] [,3] [,4] [,5] [,6] [1,] -5.56830 0.0510135 -0.486096 0.61811 0.0585384 -0.1339219 [2,] 2.30754 0.1092886 0.388349 -0.40717 -0.0600899 0.0477736 The correct result, for the first case is > # direct calculation > X <- cbind(1, as.matrix(Rohwer1[,6:10])) > Y <- as.matrix(Rohwer1[,3:5]) > keep <- 2:n > (B1 <- solve(t(X[keep,]) %*% X[keep,]) %*% t(X[keep,]) %*% Y[keep,]) SAT PPVT Raven -22.899170 41.7757793 13.5057156 n3.206119 0.0482388 0.0569482 s3.482679 0.5514477 0.5153053 ns -6.477172 -0.6051252 -0.1930919 na 5.607684 1.5011562 0.1162274 ss -0.488731 0.4601508 -0.1148580 OK, ?influence tells me that the 'coefficients' returned are actually B-B(-i), and I can see that it gives those for the first response variable (SAT) > B-B1 SAT PPVT Raven (Intercept) -5.5683009 -2.0786897 -0.26187994 n0.0510135 0.0190437 0.00239919 s -0.4860961 -0.1814634 -0.02286134 ns 0.6181096 0.2307451 0.02907000 na 0.0585384 0.0218528 0.00275309 ss -0.1339219 -0.0499941 -0.00629841 For SAT, same as re-fitting a univariate model: > rohwer.inf1 <- influence(lm(SAT ~ n + s + ns + na + ss, data=Rohwer1), do.coef=TRUE) > colnames(rohwer.inf1$coefficients) <- paste("SAT", colnames(rohwer.inf1$coefficients), sep=":") > head(rohwer.inf1$coefficients, 2) SAT:(Intercept) SAT:n SAT:s SAT:ns SAT:na SAT:ss 38-5.56830 0.0510135 -0.486096 0.61811 0.0585384 -0.1339219 39 2.30754 0.1092886 0.388349 -0.40717 -0.0600899 0.0477736 > But I'm looking for a method I can generalize. I also tried the following, using subset= in update() and lm(), giving a different wrong answer. (Am I using the subset= argument correctly?) > coef(update(rohwer.mod, subset=1:n !=1, data=Rohwer1)) SAT PPVT Raven (Intercept) -28.428163 51.762326 11.0052852 n 0.912497 -0.316177 0.1294643 s 5.478358 0.280319 0.6639180 ns -6.119679 -1.828516 -0.2995350 na5.383479 2.096793 0.1940462 ss -0.345764 0.448199 -0.0725977 Warning message: In 1:n : numerical expression has 32 elements: only the first used So, how can I calculate leave-one-out coefficients (and other quantities) efficiently for mlm-s? -Michael -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele StreetWeb: http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained,
Re: [R] influence measures for multivariate linear models
Michael Friendly wrote: > Barrett & Ling, JASA, 1992, v.87(417), pp184-191 define general classes > of influence measures for multivariate > regression models, including analogs of Cook's D, Andrews & Pregibon > COVRATIO, etc. As in univariate > response models, these are based on leverage and residuals based on > omitting one (or more) observations at > a time and refitting, although, in the univariate case, the computations > can be optimized, as they are in > stats::influence() and related methods. > > I'm interested in exploring the multivariate extension in R. I tried > the following, and was surprised to find that > R returned a result rather than an error -- presumably because mlm > objects are not trapped before they > get to lm.influence() > > > # multivariate model > > data(Rohwer, package="heplots") > > rohwer.mod <- lm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss, > data=Rohwer) > > > names(influence(rohwer.mod)) > [1] "hat" "coefficients" "sigma""wt.res" > > head(influence(rohwer.mod)$coefficients, 2) > [,1] [,2] [,3] [,4] [,5] [,6] > [1,] 2.25039 0.0254739 -0.025252 -0.06297 -0.121507 0.094355 > [2,] 0.84649 -0.0062656 -0.077430 0.08345 -0.022579 -0.059480 > > > > Of course, the correct calculations would result from refitting, > omitting each observation in turn, though doing this > directly would be horribly inefficient. > e.g, calculating B(i), deleting case i: > > > coef(update(rohwer.mod, subset=1:69 !=1, data=Rohwer)) > SAT PPVT Raven > (Intercept) -2.466079 35.68664 11.510068 > n1.888286 0.60949 0.075931 > s -0.034524 -0.53040 0.160328 > ns -2.739834 -0.67355 0.066392 > na 2.219340 1.20481 -0.037272 > ss 1.072300 0.99033 0.058509 > > coef(update(rohwer.mod, subset=1:69 !=2, data=Rohwer)) > SAT PPVT Raven > (Intercept) -1.062178 33.88199 10.8988006 > n1.920026 0.59735 0.0713976 > s0.017654 -0.47464 0.1774135 > ns -2.886254 -0.67905 0.0673686 > na 2.120411 1.29016 -0.0077484 > ss 1.226135 0.96430 0.0471764 > > Is there anything existing for this case that I've missed, or does > anyone have an interest in pursuing this topic? Hmm, fitted coefficients in this sort multivariate models are the same as those in the univariate ones, so as long as you do whole-case deletions, I would think that you should be able to reuse the 1D code. I would conjecture that the main problem with what you currently get is that it only pertains to the 1st column -- looks like the differences between the two rows from lm.influence matches the differences between the first two colums from coef(update(...)). Since lm() only handles complete cases, casewise deletion diagnostics is probably the best you can get, otherwise it would be interesting to see the effect of deleting each coordinate separately. (As you know, these matters are within my general sphere of interest, but I'm afraid my time is too constrained at them moment for more than a sideline view.) > > -Michael > -- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] influence measures for multivariate linear models
Barrett & Ling, JASA, 1992, v.87(417), pp184-191 define general classes of influence measures for multivariate regression models, including analogs of Cook's D, Andrews & Pregibon COVRATIO, etc. As in univariate response models, these are based on leverage and residuals based on omitting one (or more) observations at a time and refitting, although, in the univariate case, the computations can be optimized, as they are in stats::influence() and related methods. I'm interested in exploring the multivariate extension in R. I tried the following, and was surprised to find that R returned a result rather than an error -- presumably because mlm objects are not trapped before they get to lm.influence() > # multivariate model > data(Rohwer, package="heplots") > rohwer.mod <- lm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss, data=Rohwer) > names(influence(rohwer.mod)) [1] "hat" "coefficients" "sigma""wt.res" > head(influence(rohwer.mod)$coefficients, 2) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 2.25039 0.0254739 -0.025252 -0.06297 -0.121507 0.094355 [2,] 0.84649 -0.0062656 -0.077430 0.08345 -0.022579 -0.059480 > Of course, the correct calculations would result from refitting, omitting each observation in turn, though doing this directly would be horribly inefficient. e.g, calculating B(i), deleting case i: > coef(update(rohwer.mod, subset=1:69 !=1, data=Rohwer)) SAT PPVT Raven (Intercept) -2.466079 35.68664 11.510068 n1.888286 0.60949 0.075931 s -0.034524 -0.53040 0.160328 ns -2.739834 -0.67355 0.066392 na 2.219340 1.20481 -0.037272 ss 1.072300 0.99033 0.058509 > coef(update(rohwer.mod, subset=1:69 !=2, data=Rohwer)) SAT PPVT Raven (Intercept) -1.062178 33.88199 10.8988006 n1.920026 0.59735 0.0713976 s0.017654 -0.47464 0.1774135 ns -2.886254 -0.67905 0.0673686 na 2.120411 1.29016 -0.0077484 ss 1.226135 0.96430 0.0471764 Is there anything existing for this case that I've missed, or does anyone have an interest in pursuing this topic? -Michael -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele StreetWeb: http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.