Re: [R] influence measures for multivariate linear models

2010-09-14 Thread Michael Friendly
 I'm following up on a question I posted 8/10/2010, but my newsreader 
has lost this thread.


Barrett & Ling, JASA, 1992, v.87(417), pp184-191 define general 
classes of influence measures for multivariate
regression models, including analogs of Cook's D, Andrews & Pregibon 
COVRATIO, etc.  As in univariate
response models, these are based on leverage and residuals based on 
omitting one (or more) observations at
a time and refitting, although, in the univariate case, the 
computations can be optimized, as they are in

stats::influence() and related methods.

I'm interested in exploring the multivariate extension in R.  I tried 
the following, and was surprised to find that
R returned a result rather than an error -- presumably because mlm 
objects are not trapped before they

get to lm.influence()

I've done a bit more testing, comparing what I get from R lm functions 
with some direct calculations in both
R and SAS, and now conclude that there are bugs in lm.influence and the 
coef(update()) methods I tried
before to get leave-one-out coefficients for multivariate response 
models fit with lm().
By bugs, I mean that results returned are silently wrong, or at best, 
misleading.


My test case: a subset of the Rohwer data anayzed by Barrett & Ling (& 
others)


> library(heplots)
> data(Rohwer, package="heplots")
> Rohwer1 <- Rohwer[Rohwer$SES=='Hi',]
> rohwer.mod <- lm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss, 
data=Rohwer1)

> (B <- coef(rohwer.mod))
   SAT   PPVT  Raven
(Intercept) -28.467471 39.6970896 13.2438356
n 3.257132  0.0672825  0.0593474
s 2.996583  0.3699843  0.4924440
ns   -5.859063 -0.3743802 -0.1640219
na5.666223  1.5230090  0.1189805
ss   -0.622653  0.4101567 -0.1211564

Attempt to get the leave-one-out coefficients, B(-i),  for the first 2 
cases from influence(): *wrong* -- influence() should

probably issue
stop('Not defined for mlm objects'), or the documentation should be made 
clearer for what happens in this

case.

> head(influence(rohwer.mod, do.coef=TRUE)$coefficients, 2)
 [,1]  [,2]  [,3] [,4]   [,5]   [,6]
[1,] -5.56830 0.0510135 -0.486096  0.61811  0.0585384 -0.1339219
[2,]  2.30754 0.1092886  0.388349 -0.40717 -0.0600899  0.0477736

The correct result, for the first case is

> # direct calculation
> X <- cbind(1, as.matrix(Rohwer1[,6:10]))
> Y <- as.matrix(Rohwer1[,3:5])
> keep <- 2:n
> (B1 <- solve(t(X[keep,]) %*% X[keep,]) %*% t(X[keep,]) %*% Y[keep,])
  SAT   PPVT  Raven
   -22.899170 41.7757793 13.5057156
n3.206119  0.0482388  0.0569482
s3.482679  0.5514477  0.5153053
ns  -6.477172 -0.6051252 -0.1930919
na   5.607684  1.5011562  0.1162274
ss  -0.488731  0.4601508 -0.1148580

OK, ?influence tells me that the 'coefficients' returned are actually 
B-B(-i), and I can see that

it gives those for the first response variable (SAT)

> B-B1
   SAT   PPVT   Raven
(Intercept) -5.5683009 -2.0786897 -0.26187994
n0.0510135  0.0190437  0.00239919
s   -0.4860961 -0.1814634 -0.02286134
ns   0.6181096  0.2307451  0.02907000
na   0.0585384  0.0218528  0.00275309
ss  -0.1339219 -0.0499941 -0.00629841

For SAT, same as re-fitting a univariate model:

> rohwer.inf1 <- influence(lm(SAT ~ n + s + ns + na + ss, 
data=Rohwer1), do.coef=TRUE)
> colnames(rohwer.inf1$coefficients) <- paste("SAT", 
colnames(rohwer.inf1$coefficients), sep=":")

> head(rohwer.inf1$coefficients, 2)
   SAT:(Intercept) SAT:n SAT:s   SAT:ns SAT:na SAT:ss
38-5.56830 0.0510135 -0.486096  0.61811  0.0585384 -0.1339219
39 2.30754 0.1092886  0.388349 -0.40717 -0.0600899  0.0477736
>

But I'm looking for a method I can generalize.  I also tried the 
following,  using subset= in
update() and lm(), giving a different wrong answer. (Am I using the 
subset= argument

correctly?)

> coef(update(rohwer.mod, subset=1:n !=1, data=Rohwer1))
   SAT  PPVT  Raven
(Intercept) -28.428163 51.762326 11.0052852
n 0.912497 -0.316177  0.1294643
s 5.478358  0.280319  0.6639180
ns   -6.119679 -1.828516 -0.2995350
na5.383479  2.096793  0.1940462
ss   -0.345764  0.448199 -0.0725977
Warning message:
In 1:n : numerical expression has 32 elements: only the first used

So, how can I calculate leave-one-out coefficients (and other 
quantities) efficiently

for mlm-s?

-Michael



--
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University  Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele StreetWeb:   http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,

Re: [R] influence measures for multivariate linear models

2010-08-10 Thread Peter Dalgaard
Michael Friendly wrote:
> Barrett & Ling, JASA, 1992, v.87(417), pp184-191 define general classes 
> of influence measures for multivariate
> regression models, including analogs of Cook's D, Andrews & Pregibon 
> COVRATIO, etc.  As in univariate
> response models, these are based on leverage and residuals based on 
> omitting one (or more) observations at
> a time and refitting, although, in the univariate case, the computations 
> can be optimized, as they are in
> stats::influence() and related methods.
> 
> I'm interested in exploring the multivariate extension in R.  I tried 
> the following, and was surprised to find that
> R returned a result rather than an error -- presumably because mlm 
> objects are not trapped before they
> get to lm.influence()
> 
>  > # multivariate model
>  > data(Rohwer, package="heplots")
>  > rohwer.mod <- lm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss, 
> data=Rohwer)
> 
>  > names(influence(rohwer.mod))
> [1] "hat"  "coefficients" "sigma""wt.res" 
>  > head(influence(rohwer.mod)$coefficients, 2)
> [,1]   [,2]  [,3] [,4]  [,5]  [,6]
> [1,] 2.25039  0.0254739 -0.025252 -0.06297 -0.121507  0.094355
> [2,] 0.84649 -0.0062656 -0.077430  0.08345 -0.022579 -0.059480
>  >
> 
> Of course, the correct calculations would result from refitting, 
> omitting each observation in turn, though doing this
> directly would be horribly inefficient.
> e.g, calculating B(i), deleting case i:
> 
>  > coef(update(rohwer.mod, subset=1:69 !=1, data=Rohwer))
>   SAT PPVT Raven
> (Intercept) -2.466079 35.68664 11.510068
> n1.888286  0.60949  0.075931
> s   -0.034524 -0.53040  0.160328
> ns  -2.739834 -0.67355  0.066392
> na   2.219340  1.20481 -0.037272
> ss   1.072300  0.99033  0.058509
>  > coef(update(rohwer.mod, subset=1:69 !=2, data=Rohwer))
>   SAT PPVT  Raven
> (Intercept) -1.062178 33.88199 10.8988006
> n1.920026  0.59735  0.0713976
> s0.017654 -0.47464  0.1774135
> ns  -2.886254 -0.67905  0.0673686
> na   2.120411  1.29016 -0.0077484
> ss   1.226135  0.96430  0.0471764
> 
> Is there anything existing for this case that I've missed, or does 
> anyone have an interest in pursuing this topic?

Hmm, fitted coefficients in this sort multivariate models are the same
as those in the univariate ones, so as long as you do whole-case
deletions, I would think that you should be able to reuse the 1D code. I
would conjecture that the main problem with what you currently get is
that it only pertains to the 1st column -- looks like the differences
between the two rows from lm.influence matches the differences between
the first two colums from coef(update(...)).

Since lm() only handles complete cases, casewise deletion diagnostics is
probably the best you can get, otherwise it would be interesting to see
the effect of deleting each coordinate separately.

(As you know, these matters are within my general sphere of interest,
but I'm afraid my time is too constrained at them moment for more than a
sideline view.)

> 
> -Michael
> 


-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] influence measures for multivariate linear models

2010-08-10 Thread Michael Friendly
Barrett & Ling, JASA, 1992, v.87(417), pp184-191 define general classes 
of influence measures for multivariate
regression models, including analogs of Cook's D, Andrews & Pregibon 
COVRATIO, etc.  As in univariate
response models, these are based on leverage and residuals based on 
omitting one (or more) observations at
a time and refitting, although, in the univariate case, the computations 
can be optimized, as they are in

stats::influence() and related methods.

I'm interested in exploring the multivariate extension in R.  I tried 
the following, and was surprised to find that
R returned a result rather than an error -- presumably because mlm 
objects are not trapped before they

get to lm.influence()

> # multivariate model
> data(Rohwer, package="heplots")
> rohwer.mod <- lm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss, 
data=Rohwer)


> names(influence(rohwer.mod))
[1] "hat"  "coefficients" "sigma""wt.res" 
> head(influence(rohwer.mod)$coefficients, 2)

   [,1]   [,2]  [,3] [,4]  [,5]  [,6]
[1,] 2.25039  0.0254739 -0.025252 -0.06297 -0.121507  0.094355
[2,] 0.84649 -0.0062656 -0.077430  0.08345 -0.022579 -0.059480
>

Of course, the correct calculations would result from refitting, 
omitting each observation in turn, though doing this

directly would be horribly inefficient.
e.g, calculating B(i), deleting case i:

> coef(update(rohwer.mod, subset=1:69 !=1, data=Rohwer))
 SAT PPVT Raven
(Intercept) -2.466079 35.68664 11.510068
n1.888286  0.60949  0.075931
s   -0.034524 -0.53040  0.160328
ns  -2.739834 -0.67355  0.066392
na   2.219340  1.20481 -0.037272
ss   1.072300  0.99033  0.058509
> coef(update(rohwer.mod, subset=1:69 !=2, data=Rohwer))
 SAT PPVT  Raven
(Intercept) -1.062178 33.88199 10.8988006
n1.920026  0.59735  0.0713976
s0.017654 -0.47464  0.1774135
ns  -2.886254 -0.67905  0.0673686
na   2.120411  1.29016 -0.0077484
ss   1.226135  0.96430  0.0471764

Is there anything existing for this case that I've missed, or does 
anyone have an interest in pursuing this topic?


-Michael

--
Michael Friendly Email: friendly AT yorku DOT ca 
Professor, Psychology Dept.

York University  Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele StreetWeb:   http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.