Scot: you might take a look at my review paper on missing data in regression in JASA (ref below). An improvement of your method would be to compute conditional probabilities of being in the categories given the other covariates and Y, and draw the category using these conditional probabilities. This could be done multiply to incorporate imputation error (see the section on multiple imputation in the above reference).
This method assumes the missing data are missing at random, which may not be appropriate in your setting. The simple approach of deleting the incomplete cases (complete-case analysis) is valid if missingness depends on the value of the covariates but not the outcome, which may be plausible here; one might consider both multiple imputation (as outlined above) and complete-case analysis and see if the substantive findings differ, as a form of sensitivity analysis. Best, Rod Little. Reference: Little, R.J.A. (1992). Regression with missing X's: a review. Journal of the American Statistical Association, 87, 1227-1237. On Mon, 26 Feb 2001, Scot W McNary wrote: > > Hi, > > I'm working with three ANCOVAs with categorical covariates. The variables > of interest are continuous as are the DVs and all of these variables are > completely observed. The missing data exist for the categorical > predictors. There are three of them: > > 1) four level predictor, 17% missing data > 2) four level predictor, 7% missing data > 3) two level predictor, 3% missing data > > The investigators I'm working for have good reason to believe that these > data are unavailable vs. not applicable. They are items which ask about > different mutually exclusive/exhaustive aspects of abuse experienced by > individuals. It's reasonable to expect that there is some (unobserved) > response to these items since individuals were selected into the study > based on their exposure to abuse. It's likely that some individuals > refused to answer these items. Unfortunately, the original data coders > are not available to ask about the proportion of refused vs. don't know > responses in each of these cases. > > My simple minded approach was to collapse these individuals into one of > the existing categories. To do this I found the outcome means for each > level of the predictors and collapsed the missing value cases into the > category with the most similar outcome means. > > I understand these missing data to be non-ignorable. But since the > function of imputation for this analysis is to maximize the N for the > covariates and not the primary focus of the study, I initially thought > that a simple-minded, ad hoc approach would suffice. However, a > reviewer's question has caused me to rethink that. > > The reviewer believes that we have used an illegitimate method that has > overly favored our hypotheses by "imputing" data in this way. I disagree > that we have biased our data in favor of our hypotheses, first, since we > had no hypotheses about the covariates per se, and second only one of the > 21 contrasts implied by the levels of the covariates was significant. > > The reviewer's general point that my ad hoc method is not standard has > caused me to consider asking for advice from others with more experience. > Should I be engaging in a more formal imputation procedure (e.g., multiple > imputation), for these covariates? Are there problems with my approach I > haven't forseen? Any suggestions welcomed. > > Thanks in advance, > > Scot McNary > > > -- > Scot W. McNary email:[email protected] > > > ___________________________________________________________________________________ Roderick Little Chair, Department of Biostatistics (734) 936-1003 U-M School of Public Health Fax: (734) 763-2215 M4208 SPH II [email protected] 1420 Washington Hgts http://www.sph.umich.edu/~rlittle/ Ann Arbor, MI 48109-2029
