Dear Adriaan, I re-did your small simulation study, and I think it's rather easy to see (and explain) where the small sample bias enters your problem.
Summing up your study: You have two variables X and Y, some values of X are MAR (depending on the values of Y), then you impute the missing Xs by (Bayesian) regression imputation, then you estimate the regression of Y on X with the imputed data, and you get a downward bias for the regression coefficient of X (say, b). Of course, b in this situation is calculated by the covariance of X and Y divided by the variance of X. If you include these two statistics in your simulation study, you will notice that you don't get biased estimates for the covariance, but you get upward biased estimates for the variance of X. So, the problem with the downward biased regression coefficient really stems from the biased variance estimate of X. This means: If you have bivariate data X and Y, then delete some values of X MAR, then re-impute X by regression imputation, the variance of X in the data will be too large. Why? To see what's happening here, look at the following, slightly different data model: Y normally distributed, X = b * Y + sigma * u (u standard normally distributed, independent of Y) Note that the variance of X is b^2 * Var(Y) + sigma^2. Now suppose some values of X are MAR, and you use regression imputation to impute them, i.e. the imputed values will be created by X_imp = b_imp * Y + sigma_imp * u with suitably chosen b_imp and sigma_imp (which will be different for different imputation steps in the context of multiple imputations, but this doesn't matter here). The variance of X_imp in an imputed data set will be b_imp^2 * Var(Y) + sigma_imp^2. Usual regression imputation algorithms (like the Bayesian version implemented in ice) are constructed in a way that b_imp and sigma_imp^2 are unbiased estimates of b and sigma^2 (from a frequentist point of view). But then, of course, the *square* of b_imp is *not* an unbiased estimate of the square of b. However, the variance of X_imp depends on the square of b_imp. Since E(b_imp^2) equals the square of E(b_imp) (which is b^2, given that b_imp is unbiased) *plus* the variance if b_imp, the square of b_imp is upward biased, and so is the variance of X in the imputed data set. (Since the variance of b_imp goes to zero for large samples, the bias of the variance of X_imp quickly vanishes, when n gets larger.) Thus, this is not a special problem of multiple imputation or ML estimation; it's the simple fact that the square of an unbiased estimator for some parameter is an upward biased estimator for the square of the parameter. Does this sound sensible to you? Best regards, -Hans- --- Hans Kiesl Institute for Employment Research Regensburger Str. 104 90478 Nuremberg Germany Tel.: +49 911 179 1358 Fax: +49 911 179 3297 -----Ursprüngliche Nachricht----- Von: Impute -- Imputations in Data Analysis [mailto:[email protected]] Im Auftrag von Hoogendoorn, Adriaan Gesendet: Mittwoch, 18. November 2009 16:27 An: [email protected] Betreff: Re: Mulitple Imputation in a very simple situation: just two variables Dear Paul (and Alan and Rogier in previous posts), Thank you for pointing the issue of small-sample bias out to me. The example you gave me (ML-estimator with dominator of N versus the unbiased estimator with dominator N-1) made me vaguely-passively remember the issue of bias in ML estimators, but I somehow associated this topic with hair-splitting mathematicians and never realized that this theoretical phenomenon could have such large practical consequences (although they are practical consequences in my theoretical simulations). Thanks again! Kind regards, Adriaan Hoogendoorn ________________________________________ From: Impute -- Imputations in Data Analysis [[email protected]] On Behalf Of Paul Allison [[email protected]] Sent: Wednesday, November 18, 2009 3:41 PM To: [email protected] Subject: Re: Mulitple Imputation in a very simple situation: just two variables Adriaan: I don't know of any specific references for small-sample bias with multiple imputation. But the theoretical justification of MI is only asymptotic, i.e., based on the large-sample distribution. It's well known that maximum likelihood is potentially vulnerable to small sample bias in many situations, and I would expect the same for MI. I suppose that the simplest and best-known case of small-sample bias is that of the sample variance: the ML estimator (with the denominator of N) is downwardly biased as an estimator of the population variance. ----------------------------------------------------------------- Paul D. Allison, Professor Department of Sociology University of Pennsylvania 581 McNeil Building 3718 Locust Walk Philadelphia, PA 19104-6299 215-898-6717 215-573-2081 (fax) http://www.pauldallison.com -----Original Message----- From: Impute -- Imputations in Data Analysis [mailto:[email protected]] On Behalf Of Hoogendoorn, Adriaan Sent: Wednesday, November 18, 2009 9:09 AM To: [email protected] Subject: Re: Mulitple Imputation in a very simple situation: just two variables Dear Paul, Thank you for your contribution. This really helped! I redid my simulations with larger number of observations and they confirmed your remark. I was not aware of the problem of "small-sample bias". I only expected more variance from having just 30 observations. Can you give me a hint (or reference?) to get some intuition for this type of problem? Kind regards, Adriaan Hoogendoorn -----Oorspronkelijk bericht----- Van: Impute -- Imputations in Data Analysis [mailto:[email protected]] Namens Paul Allison Verzonden: Wednesday, November 18, 2009 2:34 PM Aan: [email protected] Onderwerp: Re: Mulitple Imputation in a very simple situation: just two variables I think this is a problem of small-sample bias. When I redid the simulation with 10000 observations, I got pretty close to the true value of .70. But when I dropped down to 100, I got results similar to yours. Remember, that with 70 percent missing, there are only 30 cases with complete data in each of your samples. ----------------------------------------------------------------- Paul D. Allison, Professor Department of Sociology University of Pennsylvania 581 McNeil Building 3718 Locust Walk Philadelphia, PA 19104-6299 215-898-6717 215-573-2081 (fax) http://www.pauldallison.com From: Impute -- Imputations in Data Analysis [mailto:[email protected]] On Behalf Of Hoogendoorn, Adriaan Sent: Wednesday, November 18, 2009 5:56 AM To: [email protected] Subject: Mulitple Imputation in a very simple situation: just two variables Dear Listserv, I would like to know in which situations the Multiple Imputation method works well when I have just two variables I did the following simulation study: I generated (X,Y) being 100 draws from the bivariate normal distribution with standard normal margins and a correlation coefficient of .7. Next I created missings in four different ways, creating 1. missing X's, depending on the value of Y (MAR) 2. missing Y's, depending on the value of X (MAR) 3. missing Y's, depending on the value of Y (MNAR) 4. missing X's, depending on the value of X (MNAR) Here, I was motivated by the (in my view very nice) blood pressure example that Schafer and Graham (2002) use to illustrate differences between MCAR, MAR and NMAR. As far as I understood, the first two missing data mechanisms are MAR and the latter two are MNAR. As Schafer and Graham did, I used a very rigorous method in creating missing values by chopping off a part of the bivariate normal distribution. In more detail: I created missing values if X (or Y) had a value below 0.5. This resulted in about 70% missing values, wihich could be expected from the standard normal distribution. Note that for the COMPLETE DATA, the scatter diagrams of 1 and 3 are identical and show the top slice of the bivariate normal distribution. Also the scatter diagrams 2 and 4 are identical and show the right-end slice of the bivariate normal distribution. The scatter diagrams suggest that regressing y on x using complete case analysis will fail in cases 1 and 3: the top slice of the bivariate normal tilts the regression line and results in a biased regression coefficient estimate. The scatter diagrams also suggest that regressing y on x using complete case analysis may work well in cases 2 and 4, where missingness depends on X. These suggestions were confirmed by the simulation study: The mean regression coefficient (over 2000 simulations) came out to be .29, showing a serious bias from the true value of .7, in case 1 and 3, i.e. when missingness depends on Y. Case 1 illustrates Allisons' claim that ".. if the data are not MCAR, but only MAR, listwise deletion can yield biase! d estimates." (see Allison (2001) page 6) When missingness depends on X, the mean regression coefficient came out to be .70 and is unbiased. Again this confirms one of Allison's claims: "... if the probability of missing data on any of the independent variables does noet depend on the values fo the dependent variable, then regression estimates using listwise deletion will be unbiased ... (see Allison (2001) page 6-7) Now comes the interesting part where I using Mulitiple Imputation (detail: I used Stata's "ice"procedure and was able to replicate the results using Stata's "mi impute mvn") I found the following results 1. b = .59 (averaged over 2000 simulations) 2. b = .70 3. b = .29 4. b = .89 My point is: Case 1 shows a bias! Althoug substantially smaller than complete case analysis (where b = .29), I still obtain a bias of .11. I would have expected, since case 1 is a case of MAR, that Multiple Imputaion would provide an unbiased estimate. Do you have any clues why this happens? I modified the simulation study by replacing the cut-off value in the missing data mechanism by a stochastic selection mechanism depending on X (or Y) but found similar results. Kind regards, Adriaan Hoogendoorn GGZ inGeest Amstedam Reference: Schafer, J.L. & J.W. Graham (2002), Missing Data: Our View of the State of the Art, Psychological Methods, 147-177 Allisson, P.D. (2001), Missing Data, Sage Pub., Thousand Oaks, CA.
