Dear Peter,

I'm sorry that I've taken a while to get back to you -- I was away for a few days.

In the example that you give from Belsley (1991), the predictors are essentially perfectly linearly related; for example

> summary(lm(x2a ~ x3a + x4a))

    Call:
    lm(formula = x2a ~ x3a + x4a)

Residuals:
1 2 3 4 5 6 7
-0.0195624 -0.0152938 0.0078068 0.0323025 -0.0087845 0.0025448 0.0014472
8
-0.0004606


    Coefficients:
                Estimate Std. Error  t value Pr(>|t|)
    (Intercept)  0.007943   0.010025    0.792    0.464
    x3a         -6.181811   0.016069 -384.716 2.25e-12 ***
    x4a         28.540996   0.066907  426.580 1.34e-12 ***
    ---
    Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

    Residual standard error: 0.01901 on 5 degrees of freedom
    Multiple R-Squared:     1,      Adjusted R-squared:     1
    F-statistic: 1.033e+05 on 2 and 5 DF,  p-value: 2.879e-12

In a case like this, the variance-inflation factors will also be very large:

> vif(lm(y~ x2a + x3a + x4a))
     x2a      x3a      x4a
41333.34 47141.19 57958.62

Any of several methods of discovering the linear relationship among the x's will work -- including the first regression above, a principal-components analysis, and Belsley's approach.

I'm not arguing that discovering the source of large standard errors in a regression is completely uninteresting, although in most circumstances there isn't much that one can do about it short of collecting new data -- but this probably isn't a proper forum to have a detailed discussion about collinearity (my fault for broaching the issue in the first place).

Except with respect to centering the data, I suspect that we largely agree about these matters.

Regards,
 John


At 11:03 AM 7/24/2003 -0400, Peter Flom wrote:
Dear John

An interesting discussion!

I would be the last to suggest ignoring such diagnostics as Cook's D;
as you point out, it diagnoses a problem which condition indices do not:
Whether a point is influential.

OTOH, condition indices diagnose a problem which Cook's D does not:
Would shifting the data slightly change the results.

Consider the data given in Belsley (1991) on p. 5

y <-   c( 3.3979, 1.6094, 3.7131, 1.6767, 0.0419, 3.3768, 1.1661,
0.4701)
x2a <- c(-3.138, -0.297, -4.582, 0.301, 2.729, -4.836, 0.065, 4.102)
x2b <- c(-3.136, -0.296, -4.581, 0.300, 2.730, -4.834, 0.064, 4.103)
x3a <- c(1.286, 0.250, 1.247, 0.498, -0.280, 0.350, 0.208, 1.069)
x3b <- c(1.288, 0.251, 1.246, 0.498, -0.281, 0.349, 0.206, 1.069)
x4a <- c(0.169, 0.044, 0.109, 0.117, 0.035, -0.094, 0.047, 0.375)
x4b <- c(0.170, 0.043, 0.108, 0.118, 0.036, -0.093, 0.048, 0.376)

where x2a , x3a and x4a are very similar to x2b, x3b and x4b,
respecttively, and where both are generated from

y = 1.2I - 0.4 x2 + 0.6x3 + 0.9x4 + e

e ~ N(0, 0.01)

Then
modela <- lm(y~ x2a + x3a + x4a)
and
modelb <- lm(y~x2b + x3b + x4b)

give radically different results, with neither having any significant
parameters other than the intercept.  Admittedly, the standard errors
for a couple of the parameters are large.  But why are they large? I
have certainly dealt with models with large standard errors that have
nothing to do with collinearity.

here, the function PI.lm (supplied by Juergen Gross) gives huge
condition indices, and indicates that the nature of the problem is that
all three of the x variables are highly collinear.

Variance-Decomposition Proportions for
Scaled Condition Indexes:

    (Intercept) x2b x3b x4b
1        0.0494   0   0   0
1        0.0009   0   0   0
3        0.8101   0   0   0
464      0.1396   1   1   1





-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: [EMAIL PROTECTED]
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Reply via email to