David Judkins wrote: > Frank, > > It depends on the fineness of grain in the predictions generated by the > model, but in the extreme case where there is a single nearest match for each > missing case, then drawing that nearest case five times will result in five > identical imputations, leading, of course, to zero between-imputation > variance for any marginal statistic of the variable. I am not certain who > was first to make this point, but you can find it among other places on page > 500 of J.N.K. Rao's article in the June 1996 JASA trio of competing paradigms > by Rubin, Fay, and Rao. Rao references S?rndal 1992. > > A workable approach and references to other workable approaches for > predictive mean matching (aka nearest neighbor imputation) are given in Kim > 2002: > > Kim, Jae Kwang (2002) > Variance estimation for nearest neighbor imputation with application to > census long form data > ASA Proceedings of the Joint Statistical Meetings, 1857-1862 > American Statistical Association (Alexandria, VA) > Keywords: Fractional imputation; Jackknife; Section on Survey Research > Methods; JSM > > > --David Judkins
Great point David - thanks. For that reason, my R/S-Plus aregImpute function does weighted sampling using Tukey's tricube function with a sharp peak at the closest match in predicted value. I'm getting much better distributions of imputed values when I do that. Frank > > -----Original Message----- > From: Frank E Harrell Jr [mailto:[email protected]] > Sent: Wednesday, March 29, 2006 3:42 PM > To: David Judkins > Cc: Paul T. Shattuck; [email protected] > Subject: Re: [Impute] range of imputed values for income > > David Judkins wrote: > >>I am not aware of the capabilities of IVEware, but the general question >>of person-level mean squared prediction error is a function of both the >>covariates and the imputation procedure. As Dr. Rubin has pointed out, >>minimizing person-level MSPE is not typically a primary goal in the >>analysis of surveys and experiments although it might be important an >>activity like fraud detection. Nonetheless, reduced person-level MSPE >>should also translate into both lower variances on estimated population >>and superpopulation marginal parameters and reduced bias on regression >>coefficients. So you want to use as rich a set of covariates in the >>imputation as are available to you and to use the model-based >>predictions in your imputation to at least some extent. Unfortunately, >>the stronger the usage you make, the more difficult it becomes to >>estimate the post-imputation variance. For example, a predictive-mean >>matching approach to imputation defeats multiple imputation as a >>variance-estimation technique. For normally distributed outcomes, > > > David - It's not clear to me why PMM would invalidate the using Rubin > variance estimator for regression coefficient variances. But maybe you > are saying that PMM doesn't work if you are primarily interested in > estimating a variance parameter (what kind?). -Frank Harrell > > >>really good methods that both utilize covariate information and allow >>post-imputation variance estimation are pretty much Bayesian and involve >>Gibbs sampling to fit complex models and make reasonable posterior >>draws. (See Schafer's book.) Even they do not cope well with the >>natural heaping in income where people round to the nearest thousand >>dollars or even worse. I have some papers on how to impute non-normal >>outcomes using covariates that are subject to missing values themselves, >>but I have not yet been able to develop and validate good >>post-imputation variance estimators to go with them. >> >>Your person-level MSPE seems so large that I suspect your software is >>not using any covariates. While that makes post-imputation variance >>estimation easy, it seems like you could do better. >> >>The preservation of the marginal first and second order moments of >>income seem to support the idea that you are not using any covariates. >>The robustness of the model coefficients is harder to reconcile. I >>think this can only happen with a simple imputation procedure if the >>missing data rate is negligible or if the model isn't very good to begin >>with. If substantial numbers of subjects were being thrown back and >>forth between $3,000 and $100,000 per year, the coefficients in good >>models would certainly be attenuated. Maybe you just don't have any >>variables that are strongly related to income? >> >>David Judkins >>Senior Statistician >>Westat >>1650 Research Boulevard >>Rockville, MD 20850 >>(301) 315-5970 >>[email protected] >> >> >>-----Original Message----- >>From: [email protected] >>[mailto:[email protected]] On Behalf Of Paul T. >>Shattuck >>Sent: Wednesday, March 29, 2006 11:43 AM >>To: [email protected] >>Subject: [Impute] range of imputed values for income >> >>Hello, >> >>I am using IVEware for multiple imputation for the first time on a large >> >>national health survey. One of the variables imputed is income and I'm >>finding that imputed values can vary dramatically within-subjects across >> >>multiply imputed datasets. For instance, in some cases Person A might >>have an imputed income of $3,000 in one imputation, and then $$100,000 >>in another imputation. This within-person variability far exceeds what >>I'm seeing with other variables in the survey. The distributions, >>means, and standard deviations of the imputed vs. non-imputed values are >> >>comparable. And multivariate regression results using the multiply >>imputed datasets and the original dataset with missing values are >>reasonably robust, with the same substantive conclusions and very close >>coefficient estimates. So, I'm wondering if this degree of >>within-subject variability across imputations is something to worry >>about, and potentially an indicator of a mis-specified imputation >>model....or whether this kind of within-subject variability across >>imputed datasets is typical. >> >>Thanks, >> >>Paul Shattuck >> > > > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
