Thanks much to Dave, Juned, Stef, and Arthur for your sage advice!! I am looking forward to giving these strategies a go. Jon
On Fri, Dec 7, 2012 at 11:08 AM, Arthur Kennickell < [email protected]> wrote: > I have done something similar to what Stef describes, for the imputation of > panel data for the Survey of Consumer Finances. In our case, it is > important to be able to specify each column separately, both because of the > sparseness issue implicit in Stef's point and because there are prior > constraints on outcomes that would be very difficult to specify otherwise. > Best wishes, > Arthur > > Arthur B. Kennickell > Assistant Director, Division of Research and Statistics > Mail Stop 153 > Board of Governors of the Federal Reserve System > Washington, DC 20551 > v: 202-452-2247 > f: 202-728-5838 > e: [email protected] > SCF website: http://www.federalreserve.gov/pubs/oss/oss2/scfindex.html > > Please consider the environment before printing this e-mail. > > > > > > From: "Buuren, S. (Stef) van" <[email protected]> > To: <[email protected]> > Date: 12/07/2012 08:35 AM > Subject: Re: Choosing an imputation model > Sent by: Impute -- Imputations in Data Analysis > <[email protected]> > > > > Jonathan, > This is a problem that occurs in many social science and medical > applications. My approach is to build a separate imputation model for each > incomplete column, which requires far fewer predictors per sub model (say > 10-15). You can find an example using mice and R in Section 9.1 of the book > Flexible Imputation of Missing Data. > Best wishes, > Stef > > From: Impute -- Imputations in Data Analysis [ > mailto:[email protected]] On Behalf Of Jonathan Mohr > > > Sent: Wednesday, December 05, 2012 4:45 PM > To: [email protected] > Subject: Choosing an imputation model > > > > > > Hi folks, > I'm writing with a question about how to develop a imputation model when > (a) there are many potential variables to include and (b) the number of > imputations required for the MCMC chain to stabilize is very high (~3000) > when a large number of variables are included in the imputation model. I'll > do my best to describe our situation briefly: > > THE STUDY > Data from 48 people were collected at six time points, and include over > 2,000 variables. Each of the research questions requires running a multiple > regression in which 2-3 variables assessed at earlier time points predict a > variable assessed at the last time point. All data are available for the > outcome variable, but there are missing data for all of the predictors > (ranging from 5% to 31% missing). > > DEVELOPING THE IMPUTATION MODEL > We have tried two basic approaches to developing the imputation model. One > is simply to include in the imputation model all of the variables that will > appear in any of the analyses. This imputation model consists of around 35 > variables. The other approach was to select a much larger pool of potential > variables to consider for inclusion in the imputation model. We identified > all variables that we believed would be associated with our main variables > of interest. We then conducted a series of stepwise regressions as a > shortcut to attempt to identify a smaller set of variables that uniquely > predicted each of the main variables for which data were missing. This > smaller set contained 18 variables, which--when added to the main > variables--led to an imputation model of 53 variables. > > QUESTION > When we generate imputed data sets with the smaller imputation model, the > chain stabilizes relatively quickly (a little over 100 iterations are > needed). In contrast, over 3000 iterations are needed with the larger > imputation model. Should we use the smaller imputation model, even if it > doesn't include variables that we know are uniquely predictive of variables > for which there are missing data? > > Thanks in advance for your thoughts!! > Jon > > -- > ***Please note change of email to [email protected]*** > > Jonathan Mohr > Assistant Professor > Department of Psychology > Biology-Psychology Building > University of Maryland > College Park, MD 20742-4411 > > Office phone: 301-405-5907 > Fax: 301-314-5966 > Email: [email protected] > > > This e-mail and its contents are subject to the DISCLAIMER at > http://www.tno.nl/emaildisclaimer > -- ***Please note change of email to [email protected]*** Jonathan Mohr Assistant Professor Department of Psychology Biology-Psychology Building University of Maryland College Park, MD 20742-4411 Office phone: 301-405-5907 Fax: 301-314-5966 Email: [email protected]
