I don't see your simulation results as odd at all. When only the dependent variable has missing data and there are no "auxiliary" variables, listwise deletion is optimal (it's the ML estimate), and certainly superior to multiple imputation. The imputation process introduces additional random variation into the estimates and, as you discovered in your simulation, that additional variation can be substantial
---------------------------------------------------------------- Paul D. Allison, Professor & Chair Department of Sociology University of Pennsylvania 3718 Locust Walk Philadelphia, PA 19104-6299 voice: 215-898-6717 or 215-898-6712 fax: 215-573-2081 [email protected] http://www.ssc.upenn.edu/~allison -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Paul von Hippel Sent: Thursday, February 19, 2004 12:31 PM To: [email protected] Subject: IMPUTE: Re: number imputations recommended by Hershberger and Fisher Rubin (1987, Table 4.2) shows that even with .9 missing information, confidence intervals using as few as 5 imputations will have close to their nominal level of coverage. But increasing M beyond 5 has benefits nonetheless. It increases df, narrowing confidence intervals while maintaining their coverage levels. A while back, I simulated 10,000 observations where X1 and X2 were complete and independent, and half the Y values were missing as a function of X1. Because of the high missingness, the regression parameters had only 11-16 df, even when I used M=20 imputations. This struck me as odd, since when only Y is missing, and missing at random, maximum likelihood regression estimates are the same as those obtained from listwise deletion. The listwise estimates would have ~5000 df, and it seems strange that the MI df would be so much lower. Best wishes, Paul von Hippel At 11:31 AM 2/19/2004, Paul Allison wrote: >Some further thoughts: > >1. The arguments I've seen for using around five imputations are based >on efficiency calculations for the parameter estimates. But what about >standard errors and p-values? I've found them to be rather unstable >for moderate to large fractions of missing information. > >2. Joe Schafer told me several months ago that he had a dissertation >student whose work showed that substantially larger numbers of >imputations were often required for good inference. But I don't know >any of the details. > >3. For these reasons, I've adopted the following rule of thumb: Do a >sufficient number of imputations to get the estimated DF over 100 for >all parameters of interest. I'd love to know what others think of >this. > > >---------------------------------------------------------------- >Paul D. Allison, Professor & Chair >Department of Sociology >University of Pennsylvania >3718 Locust Walk >Philadelphia, PA 19104-6299 >voice: 215-898-6717 or 215-898-6712 >fax: 215-573-2081 >[email protected] >http://www.ssc.upenn.edu/~allison > > > > > >I'm baffled too on both counts. Modest numbers of imputations work >fine unless the fractions of missing information are very high (> 50%), >and then I wouldn't think of those situations as missing data problems >except in a formal sense. And the number of them is a random >variable??? I guess we'll have to read what they wrote... > > > >On Thu, 19 Feb 2004, Howells, William wrote: > > > I came across a note from Hershberger and Fisher on the number of > > imputations (citation below), where they conclude that a much larger > > number of imputations is required (over 500 in some cases) than the > > usual rule of thumb that a relatively small number of imputations is > > needed (say 5 to 20 per Rubin 1987, Schafer 1997). They argue that > > the traditional rules of thumb are based on simulations rather than > > sampling theory. Their calculations assume that the number of > > imputations is a random variable from a uniform distribution and use > > a > > > formula from Levy and Lemeshow (1999) n >= (z**2)(V**2)/e**2, where > > n is the number of imputations, z is a standard normal variable, > > V**2 is > > > the squared coefficient of variation (~1.33) and e is the "amount of > > error, or the degree to which the predicted number of imputations > > differs from the optimal or "true" number of imputations". For > > example, with z=1.96 and e=.10, n=511 imputations are required. > > > > > > > > I'm having difficulty conceiving of the number of imputations as a > > random variable. What does "true" number of imputations mean? Is > > this argument legitimate? Should I be using 500 imputations instead >of 5? > > > > > > > > Bill Howells, MS > > > > Behavioral Medicine Center > > > > Washington University School of Medicine > > > > St Louis, MO > > > > > > > > Hershberger SL, Fisher DG (2003), Note on determining the number of > > imputations for missing data, Structural Equation Modeling, 10(4): > > 648-650. > > > > > > > > http://www.leaonline.com/loi/sem > > > > > > > > > >-- >Donald B. Rubin >John L. Loeb Professor of Statistics >Chairman Department of Statistics >Harvard University >Cambridge MA 02138 >Tel: 617-495-5498 Fax: 617-496-8057 Paul von Hippel Department of Sociology / Initiative in Population Research Ohio State University 300 Bricker Hall 190 N. Oval Mall Columbus OH 43210 614 688-3768
