Hello folks, I am very excited to have discovered R and have been exploring its capabilities. R's regression models are of great interest to me as my company is in the business of running thousands of linear regressions on large datasets.
I am using biglm to run linear regressions on datasets that are as large as several GB's. I have been pleasantly surprised that biglm runs the regressions extremely fast (one regression may take minutes in SPSS vs seconds in R). I have been trying to wrap my head around biglm and have a couple of questions. 1. How can I get VIF's (Variance Inflation Factors) using biglm? I was able to get VIF's from the regular lm function using this piece of code I found through Google, but have not been able to adapt it to work with biglm. Hasn't anyone been successful in this? vif.lm <- function(object, ...) { V <- summary(object)$cov.unscaled Vi <- crossprod(model.matrix(object)) nam <- names(coef(object)) if(k <- match("(Intercept)", nam, nomatch = F)) { v1 <- diag(V)[-k] v2 <- (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k]) nam <- nam[-k] } else { v1 <- diag(V) v2 <- diag(Vi) warning("No intercept term detected. Results may surprise.") } structure(v1*v2, names = nam) } 2. How reliable / stable is biglm's update() function? I was experimenting with running regressions on individual chunks of my large dataset, but the coefficients I got were different compared to those obtained form running biglm on the whole dataset. Am I mistaken when I say that update() is intended to run regressions in chunks (when memory becomes an issue with datasets that are too large) and produce identical results to running a single regression on the dataset as a whole? Thanks! Dobo ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.