I have done something similar to what Stef describes, for the imputation of panel data for the Survey of Consumer Finances. In our case, it is important to be able to specify each column separately, both because of the sparseness issue implicit in Stef's point and because there are prior constraints on outcomes that would be very difficult to specify otherwise. Best wishes, Arthur
Arthur B. Kennickell Assistant Director, Division of Research and Statistics Mail Stop 153 Board of Governors of the Federal Reserve System Washington, DC 20551 v: 202-452-2247 f: 202-728-5838 e: [email protected] SCF website: http://www.federalreserve.gov/pubs/oss/oss2/scfindex.html Please consider the environment before printing this e-mail. From: "Buuren, S. (Stef) van" <[email protected]> To: <[email protected]> Date: 12/07/2012 08:35 AM Subject: Re: Choosing an imputation model Sent by: Impute -- Imputations in Data Analysis <[email protected]> Jonathan, This is a problem that occurs in many social science and medical applications. My approach is to build a separate imputation model for each incomplete column, which requires far fewer predictors per sub model (say 10-15). You can find an example using mice and R in Section 9.1 of the book Flexible Imputation of Missing Data. Best wishes, Stef From: Impute -- Imputations in Data Analysis [ mailto:[email protected]] On Behalf Of Jonathan Mohr Sent: Wednesday, December 05, 2012 4:45 PM To: [email protected] Subject: Choosing an imputation model Hi folks, I'm writing with a question about how to develop a imputation model when (a) there are many potential variables to include and (b) the number of imputations required for the MCMC chain to stabilize is very high (~3000) when a large number of variables are included in the imputation model. I'll do my best to describe our situation briefly: THE STUDY Data from 48 people were collected at six time points, and include over 2,000 variables. Each of the research questions requires running a multiple regression in which 2-3 variables assessed at earlier time points predict a variable assessed at the last time point. All data are available for the outcome variable, but there are missing data for all of the predictors (ranging from 5% to 31% missing). DEVELOPING THE IMPUTATION MODEL We have tried two basic approaches to developing the imputation model. One is simply to include in the imputation model all of the variables that will appear in any of the analyses. This imputation model consists of around 35 variables. The other approach was to select a much larger pool of potential variables to consider for inclusion in the imputation model. We identified all variables that we believed would be associated with our main variables of interest. We then conducted a series of stepwise regressions as a shortcut to attempt to identify a smaller set of variables that uniquely predicted each of the main variables for which data were missing. This smaller set contained 18 variables, which--when added to the main variables--led to an imputation model of 53 variables. QUESTION When we generate imputed data sets with the smaller imputation model, the chain stabilizes relatively quickly (a little over 100 iterations are needed). In contrast, over 3000 iterations are needed with the larger imputation model. Should we use the smaller imputation model, even if it doesn't include variables that we know are uniquely predictive of variables for which there are missing data? Thanks in advance for your thoughts!! Jon -- ***Please note change of email to [email protected]*** Jonathan Mohr Assistant Professor Department of Psychology Biology-Psychology Building University of Maryland College Park, MD 20742-4411 Office phone: 301-405-5907 Fax: 301-314-5966 Email: [email protected] This e-mail and its contents are subject to the DISCLAIMER at http://www.tno.nl/emaildisclaimer
