I beg to differ.  If using multiple imputation, Y should be used in general.  
The bottom line is that the R-squared will not be inflated because the imputed 
values use connections between X and Y that are only as strong as the 
connections that are observed in the non-missing data.  If Y is not used, 
regression coefficients can be biased towards zero and estimates of standard 
errors of final regression coefficients will be biased.

For binary ogistic regression, Verner Vach and Michael Schemper have found that 
single imputation for categorical predictors using estimated cell probabilities 
(but not using Y) is about as good as multiple imputation.  I doubt if that is 
true for continuous Y.

Frank Harrell

On Thu, 23 May 2002 13:42:31 -0700
"Craig D. Newgard" <[email protected]> wrote:

> Constantine,
>       Generally, I agree with your colleague.  The danger in using an outcome
> variable in the imputation process is in creating associations between
> predictors and the outcome that would not be present otherwise.  If this
> occurs (essentially creating confounding variables), your result could be
> biased.  It is difficult to tell whether or not this occurs when the outcome
> is included in the imputation process, as differences in results between the
> non-imputed data and the imputed data could reflect a correction of the bias
> in using data restricted to non-missing values (the non-imputed dataset) or
> bias created from the imputation process, or both.  The simplest way to deal
> with this is to leave out the outcome variable from the imputation process,
> as a benefit in precision may be offset by a loss of validity.
> 
> Craig
> 
> Craig D. Newgard, MD, MPH
> Research Fellow
> Department of Emergency Medicine
> Harbor-UCLA Medical Center
> 1000 West Carson Street, Box 21
> Torrance, CA 90509
> (310)222-3666 (Office)
> (310)782-1763 (Fax)
> [email protected]
> 
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]on
> Behalf Of Constantine Daskalakis
> Sent: Wednesday, May 22, 2002 1:18 PM
> To: [email protected]
> Cc: Constantine Daskalakis
> Subject: IMPUTE: imputing covariates
> 
> 
> Hi.
> 
> I have a regression of Y on a bunch of Xs (always observed) and on Z
> (sometimes missing).
> 
> The X's will be used to impute Z. But should Y also be used in imputing Z?
> 
> My reading of the literature suggests that's not a problem and can often be
> a good thing in terms of gaining precision. A colleague argues that using
> the outcome to impute the predictor, will bias the estimated effect of that
> predictor in the main regression model. She argues that, by using Y,
> "you're stacking the deck, so to speak", ie, the imputation determines what
> you'll find out in the main regression model.
> 
> Is there a heuristic response to that concern?
> (Or, if I'm wrong, please someone correct me!)
> 
> Thanks,
> cd
> 
> PS  Always assuming MAR of Z (ie, missingness of Z does not depend on the
> unobserved Z itself).
> 
> 
> 
> ________________________________________________________________
> 
> Constantine Daskalakis, ScD
> Assistant Professor,
> Biostatistics Section, Thomas Jefferson University,
> 125 S. 9th St. #402, Philadelphia, PA 19107
>     Tel: 215-955-5695
>     Fax: 215-503-3804
>     Email: [email protected]
>     Webpage: http://www.kcc.tju.edu/Science/SharedFacilities/Biostatistics
> 
> 
> 
> 


-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  http://hesweb1.med.virginia.edu/biostat

Reply via email to