Dear R-help This is a follow-up to my previous post here: http://groups.google.com/group/r-help-archive/browse_thread/thread/d9b6f87ce06a9fb7/e9be30a4688f239c?lnk=gst&q=dobomode#e9be30a4688f239c
I am working on developing an open-source automated system for running batch-regressions on very large datasets. In my previous post, I posed the question of obtaining VIF's from the output of BIGLM. With a lot of help from Assoc. Professor, Biostatistics Thomas Lumley at University of Washington, I was able to make significant progress, but ultimately got stuck. The following post describes the steps and reasoning I undertook in trying to accomplish this task. Please note that I am not a statistician so ignore any commentary that seems naive to you. A quick intro. The goal is to obtain VIF's (variance inflation factors) from the regression output of BIGLM. Traditionally, this has been possible with the regular lm() function. Follows a quick illustration (the model below is pretty silly, only for illustration purposes). Example dataset: > mtcars mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Example model: model <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb Regression: reg_lm <- lm(model, mtcars) Results: > summary(reg_lm) Call: lm(formula = model, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.4506 -1.6044 -0.1196 1.2193 4.6271 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.30337 18.71788 0.657 0.5181 cyl -0.11144 1.04502 -0.107 0.9161 disp 0.01334 0.01786 0.747 0.4635 hp -0.02148 0.02177 -0.987 0.3350 drat 0.78711 1.63537 0.481 0.6353 wt -3.71530 1.89441 -1.961 0.0633 . qsec 0.82104 0.73084 1.123 0.2739 vs 0.31776 2.10451 0.151 0.8814 am 2.52023 2.05665 1.225 0.2340 gear 0.65541 1.49326 0.439 0.6652 carb -0.19942 0.82875 -0.241 0.8122 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.65 on 21 degrees of freedom Multiple R-squared: 0.869, Adjusted R-squared: 0.8066 F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07 VIF's: > vif(reg_lm) cyl disp hp drat wt qsec vs am gear carb 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487 5.357452 7.908747 Here is the definition of vif() (courtesy of http://www.stat.sc.edu/~hitchcock/bodyfatRexample.txt): > vif.lm function(object, ...) { V <- summary(object)$cov.unscaled Vi <- crossprod(model.matrix(object)) nam <- names(coef(object)) if(k <- match("(Intercept)", nam, nomatch = F)) { v1 <- diag(V)[-k] v2 <- (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k]) nam <- nam[-k] } else { v1 <- diag(V) v2 <- diag(Vi) warning("No intercept term detected. Results may surprise.") } structure(v1*v2, names = nam) } My understanding is that the function uses the unscaled variance- covariance matrix V and the model matrix Vi to compute the VIF's based on the equations above. Now, attempting to do the same using biglm(). > reg_biglm <- biglm(model, mtcars) > summary(reg_biglm) Large data regression model: biglm(model, mtcars) Sample size = 32 Coef (95% CI) SE p (Intercept) 12.303 -25.132 49.739 18.718 0.511 cyl -0.111 -2.201 1.979 1.045 0.915 disp 0.013 -0.022 0.049 0.018 0.455 hp -0.021 -0.065 0.022 0.022 0.324 drat 0.787 -2.484 4.058 1.635 0.630 wt -3.715 -7.504 0.074 1.894 0.050 qsec 0.821 -0.641 2.283 0.731 0.261 vs 0.318 -3.891 4.527 2.105 0.880 am 2.520 -1.593 6.634 2.057 0.220 gear 0.655 -2.331 3.642 1.493 0.661 carb -0.199 -1.857 1.458 0.829 0.810 The challenge is to adapt the vif() function to work with biglm. I was able to obtain the model matrix Vi using the following method. biglm supports the calculation of the Huber/White sandwich covariance matrix: > reg_biglm <- biglm(model, mtcars,, TRUE) > sandwich <- vcov(reg_biglm) Out of this, I can extract the model-based scaled variance-covariance matrix: > v.scaled <- attr(vcov(reg_biglm), "model-based") Based on my research (big thanks to Assoc. Professor, Biostatistics Thomas Lumley at University of Washington): sandwich = v.scaled × model_matrix × v.scaled It is the model_matrix I need to obtain to compute the VIF's. I do so by inverting v.scaled and moving them to the other side of the equation. > v.scaled.inverted <- solve(v.scaled) > model_matrix <- v.scaled.inverted %*% sandwich %*% v.scaled.inverted At this point, I have both the variance-covariance matrix and the model matrix, so I should be able to compute the VIF's using the function above. However, on examining the contents of the variance- covariance matrix, I see that the one from the lm() function is scaled, while the one I obtained from the sandwich calculations is not: > head(v.scaled) (Intercept) cyl disp hp drat wt qsec vs am gear (Intercept) 350.359197487 -13.163824437 -0.0059101820 -0.0266100189 -12.941827906 3.156000750 -10.546511742 3.646214479 -8.8845922019 -11.357908146 cyl -13.163824437 1.092073828 -0.0049669381 -0.0041817837 0.471219880 0.223389955 0.209717934 0.705445371 0.5605478034 0.550163881 disp -0.005910182 -0.004966938 0.0003188903 -0.0002040364 -0.003384402 -0.026026366 0.003725863 0.003764231 0.0009902057 -0.002094405 hp -0.026610019 -0.004181784 -0.0002040364 0.0004738710 0.003095272 0.009914533 0.001724661 -0.012474163 -0.0021256566 -0.002907768 drat -12.941827906 0.471219880 -0.0033844018 0.0030952720 2.674445074 0.521759369 0.043438912 -0.102380147 -0.5260445782 -0.182954006 wt 3.156000750 0.223389955 -0.0260263658 0.0099145334 0.521759369 3.588805538 -0.702017025 0.338026038 0.3675570022 0.513001888 > head(summary(reg_lm)$cov.unscaled) (Intercept) cyl disp hp drat wt qsec vs am gear (Intercept) 49.8835321876 -1.8742423910 -8.414814e-04 -3.788688e-03 -1.8426349117 0.449345889 -1.5015939690 0.5191416655 -1.2649727601 -1.6171191755 cyl -1.8742423910 0.1554875692 -7.071840e-04 -5.953951e-04 0.0670914657 0.031805873 0.0298592741 0.1004400830 0.0798098197 0.0783313749 disp -0.0008414814 -0.0007071840 4.540305e-05 -2.905034e-05 -0.0004818652 -0.003705589 0.0005304819 0.0005359447 0.0001409838 -0.0002981977 hp -0.0037886881 -0.0005953951 -2.905034e-05 6.746893e-05 0.0004406994 0.001411614 0.0002455543 -0.0017760496 -0.0003026473 -0.0004140029 drat -1.8426349117 0.0670914657 -4.818652e-04 4.406994e-04 0.3807828305 0.074287190 0.0061847567 -0.0145767070 -0.0748973106 -0.0260486727 wt 0.4493458888 0.0318058726 -3.705589e-03 1.411614e-03 0.0742871900 0.510967881 -0.0999519610 0.0481275585 0.0523321257 0.0730403152 They differ by a constant factor: > head(v.scaled) / head(summary(reg_lm)$cov.unscaled) (Intercept) cyl disp hp drat wt qsec vs am gear carb (Intercept) 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 cyl 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 disp 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 hp 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 drat 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 wt 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544 This factor turns out to be the squared value of the regression's sigma: > summary(reg_lm)$sigma ^ 2 [1] 7.023544 Sigma is defined as: the square root of the estimated variance of the random error sigma^2 = 1/(n-p) Sum(w[i] R[i]^2), where R[i] is the i-th residual, residuals[i]. The ultimate challenge is that sigma is returned by lm(), but not by biglm(). To calcualte sigma, it looks like I would need to compute the residuals for the entire dataset which defeats the purpose of biglm (). Does anybody know if there is an easier way of obtaining sigma? Any guidance on this subject would be greatly appreciated! ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.