I have done something similar to what Stef describes, for the imputation of
panel data for the Survey of Consumer Finances.  In our case, it is
important to be able to specify each column separately, both because of the
sparseness issue implicit in Stef's point and because there are prior
constraints on outcomes that would be very difficult to specify otherwise.
Best wishes,
Arthur

Arthur B. Kennickell
Assistant Director, Division of Research and Statistics
Mail Stop 153
Board of Governors of the Federal Reserve System
Washington, DC  20551
v: 202-452-2247
f: 202-728-5838
e: [email protected]
SCF website: http://www.federalreserve.gov/pubs/oss/oss2/scfindex.html

Please consider the environment before printing this e-mail.





From:   "Buuren, S. (Stef) van" <[email protected]>
To:     <[email protected]>
Date:   12/07/2012 08:35 AM
Subject:        Re: Choosing an imputation model
Sent by:        Impute -- Imputations in Data Analysis
            <[email protected]>



Jonathan,
This is a problem that occurs in many social science and medical
applications. My approach is to build a separate imputation model for each
incomplete column, which requires far fewer predictors per sub model (say
10-15). You can find an example using mice and R in Section 9.1 of the book
Flexible Imputation of Missing Data.
Best wishes,
Stef

From: Impute -- Imputations in Data Analysis [
mailto:[email protected]] On Behalf Of Jonathan Mohr


Sent: Wednesday, December 05, 2012 4:45 PM
To: [email protected]
Subject: Choosing an imputation model





Hi folks,
I'm writing with a question about how to develop a imputation model when
(a) there are many potential variables to include and (b) the number of
imputations required for the MCMC chain to stabilize is very high (~3000)
when a large number of variables are included in the imputation model. I'll
do my best to describe our situation briefly:

THE STUDY
Data from 48 people were collected at six time points, and include over
2,000 variables. Each of the research questions requires running a multiple
regression in which 2-3 variables assessed at earlier time points predict a
variable assessed at the last time point. All data are available for the
outcome variable, but there are missing data for all of the predictors
(ranging from 5% to 31% missing).

DEVELOPING THE IMPUTATION MODEL
We have tried two basic approaches to developing the imputation model. One
is simply to include in the imputation model all of the variables that will
appear in any of the analyses. This imputation model consists of around 35
variables. The other approach was to select a much larger pool of potential
variables to consider for inclusion in the imputation model. We identified
all variables that we believed would be associated with our main variables
of interest. We then conducted a series of stepwise regressions as a
shortcut to attempt to identify a smaller set of variables that uniquely
predicted each of the main variables for which data were missing. This
smaller set contained 18 variables, which--when added to the main
variables--led to an imputation model of 53 variables.

QUESTION
When we generate imputed data sets with the smaller imputation model, the
chain stabilizes relatively quickly (a little over 100 iterations are
needed). In contrast, over 3000 iterations are needed with the larger
imputation model. Should we use the smaller imputation model, even if it
doesn't include variables that we know are uniquely predictive of variables
for which there are missing data?

Thanks in advance for your thoughts!!
Jon

--
***Please note change of email to [email protected]***

Jonathan Mohr
Assistant Professor
Department of Psychology
Biology-Psychology Building
University of Maryland
College Park, MD 20742-4411

Office phone: 301-405-5907
Fax: 301-314-5966
Email: [email protected]


This e-mail and its contents are subject to the DISCLAIMER at
http://www.tno.nl/emaildisclaimer

Reply via email to