David Judkins wrote: > Frank, > > Well, I am glad that conditioned my statement to refer to software known > to me. > > This past summer, some co-workers and I presented some testing on a > really pathological joint distribution. Would you be interested in > testing your aregImpute function on it? > > --Dave
Yes Thanks Dave Frank > > -----Original Message----- > From: Frank E Harrell Jr [mailto:[email protected]] > Sent: Wednesday, January 02, 2008 11:48 AM > To: David Judkins > Cc: Alan Zaslavsky; [email protected]; > [email protected] > Subject: Re: [Impute] Rounding option on PROC MI and choosing a final MI > dataset > > David Judkins wrote: >> Raquel, >> >> Your problem is typical of the class of problems that I have been >> working on for about 15 years now. You can look up my imputation > papers >> in the CIS. None of the currently available (free or marketed) >> software solutions known to me are designed to preserve the structure > of >> general multivariate data. The ones that build models of multivariate >> relationships are mostly designed for either normal or binary data. >> Programs designed for general data are usually designed to impute a >> single variable at a time and generally fail to preserve multivariate >> structure. If you have the luxury of a large programming budget, you >> could program the algorithms that some of us here at Westat have >> developed and published. > > David, > > In theory you are correct, but I think your note slightly misses the > point. It is amazing how well the chained equations approach of MICE > and my aregImpute function work, given they were not designed to > preserve the multivariate structure. And they make fewer assumptions. > I am particularly dubious about any methods that assume linearity and > multivariate normality. > > aregImpute uses Fisher's optimum scoring algorithm to impute nominal > variables. If predictive mean matching is used with aregImpute (a more > nonparametric approach not available with your multivariate approach), > the distribution of imputed categories is quite sensible. > > Frank Harrell > >> As Alan replied, however, given that all your individual item rates > are >> low, perhaps one of the available solutions would work reasonably well >> for you. >> >> It sounds as if you don't have any skip patterns. If so, you could > just >> impute the mode for each variable. A second solution that is only a >> little more complicated would be to independently impute each variable >> by a simple hotdeck. Either way, you end up with 100% complete > vectors. >> You don't have to do any rounding. All variables have permissible >> values. You will have better marginal distributions with independent >> hotdecks than you get by imputing modes. >> >> But neither solution protects multivariate structure. Here is a bit >> more complicated solution that tries to do that but is still fairly >> simple: >> >> Pick a single variable as the most important for your analyses. Call > it >> Y. Let S be the maximum set of variables with zero item nonresponse. >> Build the best model for Y in terms of S that you can. (Doesn't have > to >> be a linear model.) Output predicted values of Y for the whole > sample. >> Call them Ypred. Let O be the maximum set of cases with zero >> nonresponse on all variables. Find the nearest neighbor in O for each >> case with one or more missing values. So then you have a donor case > and >> a recipient case. Let X1i,...,Xpi be the set of variables on > recipient >> case i with missing values. Let X1j,...,Xpj be the corresponding set > of >> variables on the donor case. Impute Xki=Xkj for k=1,...,p. >> >> To the extent that the variables in S are good predictors of Y and to >> the extent that the other variables are related to Y, you should get >> slightly better preservation of covariances than with independent >> hotdecks. There are many variants on this theme. You will still > have >> some fading of multivariate structure, however. And you will >> under-estimate post-imputation variances. >> >> For combining hotdecks with multiple imputation, see the exciting new >> papers by Siddique and Belin and by Little, Yosef, Cain, Nan, and >> Harlow, both in the first issue of volume 27 of Statistics in > Medicine. >> >> >> --Dave >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Alan >> Zaslavsky >> Sent: Wednesday, January 02, 2008 10:07 AM >> To: [email protected]; [email protected] >> Subject: [Impute] Rounding option on PROC MI and choosing a final MI >> dataset >> >> >>> From: "Raquel Hampton" <[email protected]> >>> Subject: [Impute] Rounding option on PROC MI and choosing a final MI >>> dataset >>> My first question is: there is a round option for PROC MI, but I read >> in >>> an article (Horton, N.J., Lipsitz, S.P., & Parzen, M. (2003). A >>> potential for bias when rounding in multiple imputation. The American >>> Statistician 57(4), 229-232) that using the round option for >> categorical >>> data (the items have nominal responses, ranging from 1 to 5) produces >>> bias estimates, though logical. So what can be done? I only have >> access >>> to SAS and STATA, but I am not very familar with STATA. Will this > not >>> be such a problem since the proportion of missing for each individual >>> item is small? >> Do you really mean nominal (unordered categories, like French, German, >> English, or chocolate, vanilla, strawberry) or ordinal (like poor, > fair, >> good, excellent)? If nominal, you won't get anything sensible by >> fitting >> a normal model and rounding. If ordinal and well distributed across > the >> categories, the bias of using rounded data will be less than with the >> binomial data primarily considered by the Horton et al. article. >> >> You might also consider whether it is necessary to round at all -- >> depends on how the data will be used in further analyses. >> >> With only a couple of percent missing on each item, all of the issues >> about imputation become less crucial, although as noted in a previous >> response you should definitely run the proper MI analysis to verify > that >> the between-imputation contribution to variance is small. In practice >> any modeling exercise is a compromise involving putting more effort > into >> the important aspects of the modeling and in this case this might not >> require doing the most methodologically advanced things with the >> imputation. >> >> _______________________________________________ >> Impute mailing list >> [email protected] >> http://lists.utsouthwestern.edu/mailman/listinfo/impute >> >> _______________________________________________ >> Impute mailing list >> [email protected] >> http://lists.utsouthwestern.edu/mailman/listinfo/impute >> > > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
