Dear Adriaan,

I re-did your small simulation study, and I think it's rather easy to see (and 
explain) where the small sample bias enters your problem.

Summing up your study: You have two variables X and Y, some values of X are MAR 
(depending on the values of Y), then you impute the missing Xs by (Bayesian) 
regression imputation, then you estimate the regression of Y on X with the 
imputed data, and you get a downward bias for the regression coefficient of X 
(say, b).

Of course, b in this situation is calculated by the covariance of X and Y 
divided by the variance of X. If you include these two statistics in your 
simulation study, you will notice that you don't get biased estimates for the 
covariance, but you get upward biased estimates for the variance of X. So, the 
problem with the downward biased regression coefficient really stems from the 
biased variance estimate of X.

This means: If you have bivariate data X and Y, then delete some values of X 
MAR, then re-impute X by regression imputation, the variance of X in the data 
will be too large. Why?

To see what's happening here, look at the following, slightly different data 
model:

Y normally distributed, 

X = b * Y  + sigma * u  (u standard normally distributed, independent of Y)

Note that the variance of X is b^2 * Var(Y) + sigma^2.

Now suppose some values of X are MAR, and you use regression imputation to 
impute them, i.e. the imputed values will be created by

X_imp =  b_imp * Y + sigma_imp * u   with suitably chosen b_imp and sigma_imp 
(which will be different for different imputation steps in the context of 
multiple imputations, but this doesn't matter here).

The variance of X_imp in an imputed data set will be   b_imp^2 * Var(Y) + 
sigma_imp^2.

Usual regression imputation algorithms (like the Bayesian version implemented 
in ice) are constructed in a way that b_imp and sigma_imp^2 are unbiased 
estimates of b and sigma^2 (from a frequentist point of view). But then, of 
course, the *square* of b_imp is *not* an unbiased estimate of the square of b. 
However, the variance of X_imp depends on the square of b_imp.

Since E(b_imp^2) equals the square of E(b_imp) (which is b^2, given that b_imp 
is unbiased) *plus* the variance if b_imp, the square of b_imp is upward 
biased, and so is the variance of X in the imputed data set. (Since the 
variance of b_imp goes to zero for large samples, the bias of the variance of 
X_imp quickly vanishes, when n gets larger.)

Thus, this is not a special problem of multiple imputation or ML estimation; 
it's the simple fact that the square of an unbiased estimator for some 
parameter is an upward biased estimator for the square of the parameter.


Does this sound sensible to you?

Best regards,

-Hans-



---

Hans Kiesl
Institute for Employment Research
Regensburger Str. 104
90478 Nuremberg
Germany
Tel.: +49 911 179 1358
Fax:  +49 911 179 3297




-----Ursprüngliche Nachricht-----
Von: Impute -- Imputations in Data Analysis 
[mailto:[email protected]] Im Auftrag von Hoogendoorn, Adriaan
Gesendet: Mittwoch, 18. November 2009 16:27
An: [email protected]
Betreff: Re: Mulitple Imputation in a very simple situation: just two variables


Dear Paul (and Alan and Rogier in previous posts),

Thank you for pointing the issue of small-sample bias out to me. 
The example you gave me (ML-estimator with dominator of N versus the unbiased 
estimator with dominator N-1) made me vaguely-passively remember the issue of 
bias in ML estimators, but I somehow associated this topic with hair-splitting 
mathematicians and never realized that this theoretical phenomenon could have 
such large practical consequences (although they are practical consequences in 
my theoretical simulations).

Thanks again!
Kind regards, Adriaan Hoogendoorn

________________________________________
From: Impute -- Imputations in Data Analysis 
[[email protected]] On Behalf Of Paul Allison 
[[email protected]]
Sent: Wednesday, November 18, 2009 3:41 PM
To: [email protected]
Subject: Re: Mulitple Imputation in a very simple situation: just two variables

Adriaan:

I don't know of any specific references for small-sample bias with multiple 
imputation. But the theoretical justification of MI is only asymptotic, i.e., 
based on the large-sample distribution.

It's well known that maximum likelihood is potentially vulnerable to small 
sample bias in many situations, and I would expect the same for MI. I suppose 
that the simplest and best-known case of small-sample bias is that of the 
sample variance: the ML estimator (with the denominator of N) is downwardly 
biased as an estimator of the population variance.

-----------------------------------------------------------------
Paul D. Allison, Professor
Department of Sociology
University of Pennsylvania
581 McNeil Building
3718 Locust Walk
Philadelphia, PA  19104-6299
215-898-6717
215-573-2081 (fax)
http://www.pauldallison.com


-----Original Message-----
From: Impute -- Imputations in Data Analysis 
[mailto:[email protected]] On Behalf Of Hoogendoorn, Adriaan
Sent: Wednesday, November 18, 2009 9:09 AM
To: [email protected]
Subject: Re: Mulitple Imputation in a very simple situation: just two variables

Dear Paul,

Thank you for your contribution. This really helped! I redid my simulations 
with larger number of observations and they confirmed your remark.

I was not aware of the problem of "small-sample bias". I only expected more 
variance from having just 30 observations. Can you give me a hint (or 
reference?) to get some intuition for this type of problem?

Kind regards, Adriaan Hoogendoorn

-----Oorspronkelijk bericht-----
Van: Impute -- Imputations in Data Analysis 
[mailto:[email protected]] Namens Paul Allison
Verzonden: Wednesday, November 18, 2009 2:34 PM
Aan: [email protected]
Onderwerp: Re: Mulitple Imputation in a very simple situation: just two 
variables


I think this is a problem of small-sample bias.  When I redid the simulation 
with 10000 observations, I got pretty close to the true value of .70.  But when 
I dropped down to 100, I got results similar to yours. Remember, that with 70 
percent missing, there are only 30 cases with complete data in each of your 
samples.

-----------------------------------------------------------------
Paul D. Allison, Professor
Department of Sociology
University of Pennsylvania
581 McNeil Building
3718 Locust Walk
Philadelphia, PA  19104-6299
215-898-6717
215-573-2081 (fax)
http://www.pauldallison.com



From: Impute -- Imputations in Data Analysis 
[mailto:[email protected]] On Behalf Of Hoogendoorn, Adriaan
Sent: Wednesday, November 18, 2009 5:56 AM
To: [email protected]
Subject: Mulitple Imputation in a very simple situation: just two variables

Dear Listserv,

I would like to know in which situations the Multiple Imputation method works 
well when I have just two variables

I did the following simulation study: I generated (X,Y) being 100 draws from 
the bivariate normal distribution with standard normal margins and a 
correlation coefficient of .7. Next I created missings in four different ways, 
creating

1. missing X's, depending on the value of Y (MAR)
2. missing Y's, depending on the value of X (MAR)
3. missing Y's, depending on the value of Y (MNAR)
4. missing X's, depending on the value of X (MNAR)

Here, I was motivated by the (in my view very nice) blood pressure example that 
Schafer and Graham (2002) use to illustrate differences between MCAR, MAR and 
NMAR. As far as I understood, the first two missing data mechanisms are MAR and 
the latter two are MNAR. As Schafer and Graham did, I used a very rigorous 
method in creating missing values by chopping off a part of the bivariate 
normal distribution. In more detail: I created missing values if X (or Y) had a 
value below 0.5. This resulted in about 70% missing values, wihich could be 
expected from the standard normal distribution.

Note that for the COMPLETE DATA, the scatter diagrams of 1 and 3 are identical 
and show the top slice of the bivariate normal distribution. Also the scatter 
diagrams 2 and 4 are identical and show the right-end slice of the bivariate 
normal distribution. The scatter diagrams suggest that regressing y on x using 
complete case analysis will fail in cases 1 and 3: the top slice of the 
bivariate normal tilts the regression line and results in a biased regression 
coefficient estimate. The scatter diagrams also suggest that regressing y on x 
using complete case analysis may work well in cases 2 and 4, where missingness 
depends on X. These suggestions were confirmed by the simulation study: The 
mean regression coefficient (over 2000 simulations) came out to be .29, showing 
a serious bias from the true value of .7, in case 1 and 3, i.e. when 
missingness depends on Y. Case 1 illustrates Allisons' claim that ".. if the 
data are not MCAR, but only MAR, listwise deletion can yield biase!
 d estimates." (see Allison (2001) page 6) When missingness depends on X, the 
mean regression coefficient came out to be .70 and is unbiased. Again this 
confirms one of Allison's claims: "... if the probability of missing data on 
any of the independent variables does noet depend on the values fo the 
dependent variable, then regression estimates using listwise deletion will be 
unbiased ... (see Allison (2001) page 6-7)


Now comes the interesting part where I using Mulitiple Imputation (detail: I 
used Stata's "ice"procedure and was able to replicate the results using Stata's 
"mi impute mvn") I found the following results

1. b = .59 (averaged over 2000 simulations)
2. b = .70
3. b = .29
4. b = .89

My point is: Case 1 shows a bias!
Althoug substantially smaller than complete case analysis (where b = .29), I 
still obtain a bias of .11. I would have expected, since case 1 is a case of 
MAR, that Multiple Imputaion would provide an unbiased estimate.

Do you have any clues why this happens?

I modified the simulation study by replacing the cut-off value in the missing 
data mechanism by a stochastic selection mechanism depending on X (or Y) but 
found similar results.

Kind regards,
Adriaan Hoogendoorn
GGZ inGeest
Amstedam


Reference:
Schafer, J.L. & J.W. Graham (2002), Missing Data: Our View of the State of the 
Art, Psychological Methods, 147-177 Allisson, P.D. (2001), Missing Data, Sage 
Pub., Thousand Oaks, CA.

Reply via email to