I think this is a problem of small-sample bias.  When I redid the simulation
with 10000 observations, I got pretty close to the true value of .70.  But
when I dropped down to 100, I got results similar to yours. Remember, that
with 70 percent missing, there are only 30 cases with complete data in each
of your samples. 

 

-----------------------------------------------------------------
Paul D. Allison, Professor
Department of Sociology
University of Pennsylvania
581 McNeil Building
3718 Locust Walk
Philadelphia, PA  19104-6299
215-898-6717
215-573-2081 (fax)
http://www.pauldallison.com

  _____  

From: Impute -- Imputations in Data Analysis
[mailto:[email protected]] On Behalf Of Hoogendoorn,
Adriaan
Sent: Wednesday, November 18, 2009 5:56 AM
To: [email protected]
Subject: Mulitple Imputation in a very simple situation: just two variables

 

Dear Listserv,

 

I would like to know in which situations the Multiple Imputation method
works well when I have just two variables 

 

I did the following simulation study: I generated (X,Y) being 100 draws from
the bivariate normal distribution with standard normal margins and a
correlation coefficient of .7. 

Next I created missings in four different ways, creating

 

1. missing X's, depending on the value of Y (MAR)

2. missing Y's, depending on the value of X (MAR)

3. missing Y's, depending on the value of Y (MNAR)

4. missing X's, depending on the value of X (MNAR)

 

Here, I was motivated by the (in my view very nice) blood pressure example
that Schafer and Graham (2002) use to illustrate differences between MCAR,
MAR and NMAR. 

As far as I understood, the first two missing data mechanisms are MAR and
the latter two are MNAR. 

As Schafer and Graham did, I used a very rigorous method in creating missing
values by chopping off a part of the bivariate normal distribution. 

In more detail: I created missing values if X (or Y) had a value below 0.5.
This resulted in about 70% missing values, wihich could be expected from the
standard normal distribution. 

 

Note that for the COMPLETE DATA, the scatter diagrams of 1 and 3 are
identical and show the top slice of the bivariate normal distribution. 

Also the scatter diagrams 2 and 4 are identical and show the right-end slice
of the bivariate normal distribution. 

The scatter diagrams suggest that regressing y on x using complete case
analysis will fail in cases 1 and 3: the top slice of the bivariate normal
tilts the regression line and results in a biased regression coefficient
estimate. The scatter diagrams also suggest that regressing y on x using
complete case analysis may work well in cases 2 and 4, where missingness
depends on X. 

These suggestions were confirmed by the simulation study: 

The mean regression coefficient (over 2000 simulations) came out to be .29,
showing a serious bias from the true value of .7, in case 1 and 3, i.e. when
missingness depends on Y. 

Case 1 illustrates Allisons' claim that ".. if the data are not MCAR, but
only MAR, listwise deletion can yield biased estimates." (see Allison (2001)
page 6)

When missingness depends on X, the mean regression coefficient came out to
be .70 and is unbiased. Again this confirms one of Allison's claims: "... if
the probability of missing data on any of the independent variables does
noet depend on the values fo the dependent variable, then regression
estimates using listwise deletion will be unbiased ... (see Allison (2001)
page 6-7)

 

 

Now comes the interesting part where I using Mulitiple Imputation (detail: I
used Stata's "ice"procedure and was able to replicate the results using
Stata's "mi impute mvn") 

I found the following results

 

1. b = .59 (averaged over 2000 simulations)

2. b = .70

3. b = .29

4. b = .89

 

My point is: Case 1 shows a bias!

Althoug substantially smaller than complete case analysis (where b = .29), I
still obtain a bias of .11. 

I would have expected, since case 1 is a case of MAR, that Multiple
Imputaion would provide an unbiased estimate. 

 

Do you have any clues why this happens? 

 

I modified the simulation study by replacing the cut-off value in the
missing data mechanism by a stochastic selection mechanism depending on X
(or Y) but found similar results. 

 

Kind regards,

Adriaan Hoogendoorn

GGZ inGeest

Amstedam

 

 

Reference: 

Schafer, J.L. & J.W. Graham (2002), Missing Data: Our View of the State of
the Art, Psychological Methods, 147-177

Allisson, P.D. (2001), Missing Data, Sage Pub., Thousand Oaks, CA.

Reply via email to