Dear Paul (and Alan and Rogier in previous posts),

Thank you for pointing the issue of small-sample bias out to me. 
The example you gave me (ML-estimator with dominator of N versus the unbiased 
estimator with dominator N-1) made me vaguely-passively remember the issue of 
bias in ML estimators, but I somehow associated this topic with hair-splitting 
mathematicians and never realized that this theoretical phenomenon could have 
such large practical consequences (although they are practical consequences in 
my theoretical simulations).

Thanks again!
Kind regards, Adriaan Hoogendoorn

________________________________________
From: Impute -- Imputations in Data Analysis 
[[email protected]] On Behalf Of Paul Allison 
[[email protected]]
Sent: Wednesday, November 18, 2009 3:41 PM
To: [email protected]
Subject: Re: Mulitple Imputation in a very simple situation: just two variables

Adriaan:

I don't know of any specific references for small-sample bias with multiple
imputation. But the theoretical justification of MI is only asymptotic,
i.e., based on the large-sample distribution.

It's well known that maximum likelihood is potentially vulnerable to small
sample bias in many situations, and I would expect the same for MI. I
suppose that the simplest and best-known case of small-sample bias is that
of the sample variance: the ML estimator (with the denominator of N) is
downwardly biased as an estimator of the population variance.

-----------------------------------------------------------------
Paul D. Allison, Professor
Department of Sociology
University of Pennsylvania
581 McNeil Building
3718 Locust Walk
Philadelphia, PA  19104-6299
215-898-6717
215-573-2081 (fax)
http://www.pauldallison.com


-----Original Message-----
From: Impute -- Imputations in Data Analysis
[mailto:[email protected]] On Behalf Of Hoogendoorn,
Adriaan
Sent: Wednesday, November 18, 2009 9:09 AM
To: [email protected]
Subject: Re: Mulitple Imputation in a very simple situation: just two
variables

Dear Paul,

Thank you for your contribution. This really helped! I redid my simulations
with larger number of observations and they confirmed your remark.

I was not aware of the problem of "small-sample bias". I only expected more
variance from having just 30 observations.
Can you give me a hint (or reference?) to get some intuition for this type
of problem?

Kind regards, Adriaan Hoogendoorn

-----Oorspronkelijk bericht-----
Van: Impute -- Imputations in Data Analysis
[mailto:[email protected]] Namens Paul Allison
Verzonden: Wednesday, November 18, 2009 2:34 PM
Aan: [email protected]
Onderwerp: Re: Mulitple Imputation in a very simple situation: just two
variables


I think this is a problem of small-sample bias.  When I redid the simulation
with 10000 observations, I got pretty close to the true value of .70.  But
when I dropped down to 100, I got results similar to yours. Remember, that
with 70 percent missing, there are only 30 cases with complete data in each
of your samples.

-----------------------------------------------------------------
Paul D. Allison, Professor
Department of Sociology
University of Pennsylvania
581 McNeil Building
3718 Locust Walk
Philadelphia, PA  19104-6299
215-898-6717
215-573-2081 (fax)
http://www.pauldallison.com



From: Impute -- Imputations in Data Analysis
[mailto:[email protected]] On Behalf Of Hoogendoorn,
Adriaan
Sent: Wednesday, November 18, 2009 5:56 AM
To: [email protected]
Subject: Mulitple Imputation in a very simple situation: just two variables

Dear Listserv,

I would like to know in which situations the Multiple Imputation method
works well when I have just two variables

I did the following simulation study: I generated (X,Y) being 100 draws from
the bivariate normal distribution with standard normal margins and a
correlation coefficient of .7.
Next I created missings in four different ways, creating

1. missing X's, depending on the value of Y (MAR)
2. missing Y's, depending on the value of X (MAR)
3. missing Y's, depending on the value of Y (MNAR)
4. missing X's, depending on the value of X (MNAR)

Here, I was motivated by the (in my view very nice) blood pressure example
that Schafer and Graham (2002) use to illustrate differences between MCAR,
MAR and NMAR.
As far as I understood, the first two missing data mechanisms are MAR and
the latter two are MNAR.
As Schafer and Graham did, I used a very rigorous method in creating missing
values by chopping off a part of the bivariate normal distribution.
In more detail: I created missing values if X (or Y) had a value below 0.5.
This resulted in about 70% missing values, wihich could be expected from the
standard normal distribution.

Note that for the COMPLETE DATA, the scatter diagrams of 1 and 3 are
identical and show the top slice of the bivariate normal distribution.
Also the scatter diagrams 2 and 4 are identical and show the right-end slice
of the bivariate normal distribution.
The scatter diagrams suggest that regressing y on x using complete case
analysis will fail in cases 1 and 3: the top slice of the bivariate normal
tilts the regression line and results in a biased regression coefficient
estimate. The scatter diagrams also suggest that regressing y on x using
complete case analysis may work well in cases 2 and 4, where missingness
depends on X.
These suggestions were confirmed by the simulation study:
The mean regression coefficient (over 2000 simulations) came out to be .29,
showing a serious bias from the true value of .7, in case 1 and 3, i.e. when
missingness depends on Y.
Case 1 illustrates Allisons' claim that ".. if the data are not MCAR, but
only MAR, listwise deletion can yield biased estimates." (see Allison (2001)
page 6)
When missingness depends on X, the mean regression coefficient came out to
be .70 and is unbiased. Again this confirms one of Allison's claims: "... if
the probability of missing data on any of the independent variables does
noet depend on the values fo the dependent variable, then regression
estimates using listwise deletion will be unbiased ... (see Allison (2001)
page 6-7)


Now comes the interesting part where I using Mulitiple Imputation (detail: I
used Stata's "ice"procedure and was able to replicate the results using
Stata's "mi impute mvn")
I found the following results

1. b = .59 (averaged over 2000 simulations)
2. b = .70
3. b = .29
4. b = .89

My point is: Case 1 shows a bias!
Althoug substantially smaller than complete case analysis (where b = .29), I
still obtain a bias of .11.
I would have expected, since case 1 is a case of MAR, that Multiple
Imputaion would provide an unbiased estimate.

Do you have any clues why this happens?

I modified the simulation study by replacing the cut-off value in the
missing data mechanism by a stochastic selection mechanism depending on X
(or Y) but found similar results.

Kind regards,
Adriaan Hoogendoorn
GGZ inGeest
Amstedam


Reference:
Schafer, J.L. & J.W. Graham (2002), Missing Data: Our View of the State of
the Art, Psychological Methods, 147-177
Allisson, P.D. (2001), Missing Data, Sage Pub., Thousand Oaks, CA.

Reply via email to