I have been interested in what people have been saying about Solas, as it fits in with my biases against standalone statistical software.
I have been running a lot of simulations in S-Plus comparing MICE (which has had excellent performance in my tests) with a new S-Plus and R function I've recently developed called aregImpute. So far, the much faster bootstrap-based aregImpute function seems to work as well as the approximate Bayesian predictive distribution approach implemented in MICE, in terms of bias and MSE of regression coefficients. I have attached documentation to aregImpute and would appreciate any reactions to or questions about the algorithm proposed there. Sincerely, Frank Harrell -- Frank E Harrell Jr Prof. of Biostatistics & Statistics Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostatTitle: R: Multiple Imputation using Additive Regression, Bootstrap, and Predictive Mean Matching
aregImpute {Hmisc} | R Documentation |
Multiple Imputation using Additive Regression, Bootstrap, and Predictive Mean Matching
Description
The transcan
function creates flexible additive imputation models
but provides only an approximation to true multiple imputation as the
imputation models are fixed before all multiple imputations are
drawn. This ignores variability caused by having to fit the
imputation models. aregImpute
takes all aspects of uncertainty in
the imputations into account by using the bootstrap to approximate the
process of drawing predicted values from a full Bayesian predictive
distribution. Different bootstrap resamples are used for each of the
multiple imputations, i.e., for the i
th imputation of a sometimes
missing variable, i=1,2,...n.impute
, a flexible additive
model is fitted on a sample with replacement
from the original data and this model is used to predict all of the
original missing and non-missing values for the target variable.
Two methods are used to fit the imputation models, ace
and
avas
. Unless the identity transformation is specified, these
methods simultaneously find transformations of the target variable and
of all of the predictors, to get a good fit assuming additivity.
ace
maximizes R-squared, and avas
attempts to maximize R-squared
while stabilizing the variance of residuals. When a categorical
variable is being predicted, only ace
is used. Like transcan
s use
of canonical regression, this is Fisher's optimum scoring method for
categorical variables. For continuous variables, monotonic
transformations of the target variable are assumed when avas
is
used. For ace
, the default allows nonmonotonic transformations of
target variables. When variables are used as predictors, the nonparametric
transformations derived by ace
or avas
can be restricted by the
user to be monotonic.
Instead of taking random draws from fitted imputation models using
random residuals as is done by transcan
, aregImpute
uses
predictive mean matching. This works for binary, categorical, and
continuous variables without the need for iterative maximum likelihood
fitting for binary and categorical variables, and without the need for
computing residuals or for curtailing imputed values to be in the range of
actual data. Predictive mean matching is especially attractive when
the variable being imputed is also being transformed automatically.
See Details below for more information about the algorithm.
Typically, fit.mult.impute
will be called after aregImpute
.
Usage
aregImpute(formula, n.impute=5, method=c('ace','avas'), defaultLinear=FALSE, pr=TRUE, x=FALSE, data, subset)
Arguments
formula |
an S model formula. You can specify restrictions for transformations
of variables. The function automatically determines which variables
are categorical (i.e., factor , category , or character vectors).
Binary variables are automatically restricted to be linear. Force
linear transformations of continuous variables by enclosing variables
by the identify function (I() ), and specify monotonicity by using
monotone(variable) .
|
n.impute |
number of multiple imputations. n.impute=5 is frequently
recommended but 10 or more doesn't hurt.
|
method |
method ("ace" , the default, or "avas" ) for modeling a variable to
be imputed. As avas does not allow the response variable to be
categorical, "ace" is always used for such variables.
|
defaultLinear |
set to TRUE to force all continuous variables to be linear in any
model. This is recommended when the sample size is small.
|
pr |
set to FALSE to suppress printing of iteration messages
|
x |
set to TRUE to save the data matrix containing the final (number
n.impute ) imputations in the result. This
is needed if you want to later do out-of-sample imputation.
|
data |
|
subset |
These may be also be specified. You may not specify na.action as
na.retain is always used.
|
Details
The sequence of steps used by the aregImpute
algorithm is the
following.
(1) For each variable containing m NA
s where m > 0, initialize the
NA
s to values from a random sample (without replacement if
a sufficient number of non-missing values exist) of size m from the
non-missing values.
(2) For 3+n.impute
iterations do the following steps. The first 3
iterations provide a burn-in, and imputations are saved only from the last
n.impute
iterations.
(3) For each variable containing any NA
s, draw a sample with
replacement from the observations in the entire dataset in which the
current variable being imputed is non-missing. Fit a flexible
additive model to predict this target variable while finding the
optimum transformation of it (unless the identity
transformation is forced). Use this fitted semiparametric model to
predict the target variable in all of the original observations.
Impute each missing value of the target variable with the observed
value whose predicted transformed value is closest to the predicted
transformed value of the missing value.
(4) After these imputations are computed, use these random draw
imputations the next time the curent target variable is used as a
predictor of other sometimes-missing variables.
Predictive mean matching does not work well when fewer than 3 variables are used to predict the target variable, because many of the multiple imputations for an observation will be identical.
Value
a list of class "aregImpute"
containing the following elements:
call |
the function call expression |
formula |
the formula specified to aregImpute
|
method |
the method argument
|
n |
total number of observations in input dataset |
p |
number of variables |
na |
list of subscripts of observations for which values were originally missing |
nna |
named vector containing the numbers of missing values in the data |
linear |
vector of names of variables restricted to be linear |
categorical |
vector of names of categorical variables |
monotone |
vector of names of variables restricted to be monotonic |
cat.levels |
list containing character vectors specifying the levels of
categorical variables
|
n.impute |
number of multiple imputations per missing value |
imputed |
a list containing matrices of imputed values in the same format as
those created by transcan . Categorical variables are coded using
their integer codes. Variables having no missing values will have
NULL matrices in the list.
|
x |
if x=TRUE , a numeric matrix containing the raw data used to derive
the imputations, filled in with the last (n.impute ) imputations used.
Categorical variables are coded as integers in this matrix.
|
Author(s)
Frank Harrell
Division of Biostatistics and Epidemiology
University of Virginia
[EMAIL PROTECTED]
See Also
fit.mult.impute
, transcan
, ace
, naclus
, naplot
, mice
Examples
# Multiple imputation and estimation of variances and covariances of # regression coefficient estimates accounting for imputation # Example 1: large sample size, much missing data, no overlap in # NAs across variables set.seed(3) x1 <- factor(sample(c('a','b','c'),1000,T)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2) x3 <- rnorm(1000) y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2) orig.x1 <- x1[1:250] orig.x2 <- x2[251:350] x1[1:250] <- NA x2[251:350] <- NA # Complete case analysis # Use 100 imputations to better check against individual true values f <- aregImpute(~y + x1 + x2 + x3, n.impute=100) hist(f$imputed$x2) modecat <- function(u) { tab <- table(u) as.numeric(names(tab)[tab==max(tab)][1]) } table(orig.x1,apply(f$imputed$x1, 1, modecat)) plot(orig.x2, apply(f$imputed$x2, 1, mean)) fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f, data="data.frame(x1,x2,y))" sqrt(diag(Varcov(fmi))) fcc <- lm(y ~ x1 + x2 + x3) summary(fcc) # SEs are larger than from mult. imputation # Example 2: Very discriminating imputation models, # x1 and x2 have some NAs on the same rows, smaller n set.seed(5) x1 <- factor(sample(c('a','b','c'),100,T)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4) x3 <- rnorm(100) y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4) orig.x1 <- x1[1:20] orig.x2 <- x2[18:23] x1[1:20] <- NA x2[18:23] <- NA #x2[21:25] <- NA n <- naclus(data.frame(x1,x2,y)) plot(n); naplot(n) # Show patterns of NAs # 100 imputations to study them; normally use 5 or 10 f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, defaultLinear=T) table(f$imputed$x1) hist(f$imputed$x2) table(orig.x1,apply(f$imputed$x1, 1, modecat)) plot(orig.x2, apply(f$imputed$x2, 1, mean)) r <- range(f$imputed$x2,orig.x2) par(mfrow=c(2,3)) for(i in 1:6) { plot(1:100, f$imputed$x2[i,], ylim=r, ylab=paste("Imputations for Obs.",i)) abline(h=orig.x2[i],lty=2) } fmi <- fit.mult.impute(y ~ x1 + x2, lm, f, data="data.frame(x1,x2,y))" sqrt(diag(Varcov(fmi))) fcc <- lm(y ~ x1 + x2) summary(fcc) # SEs are larger than from mult. imputation