Scott - To add a bit to Rod's excellent advice, the S-Plus transcan function can use Fisher's optimum scoring algorithm to score categorical variables. The approximate Bayesian bootstrap is used to sample residuals off of these scores then to back-compute to estimate categories "closest" to the predicted randomly drawn scores. See http://hesweb1.med.virginia.edu/biostat/s/help/transcan.html
transcan can also use recursive partitioning to impute categorical predictors but at present this only works for single "best guess" imputation. I need to implement Rod's advice to add random sampling from the estimated multinomial distribution. -Frank Harrell Rod Little wrote: > > Scot: you might take a look at my review paper on missing data in > regression in JASA (ref below). An improvement of your method would be to > compute conditional probabilities of being in the categories given the > other covariates and Y, and draw the category using these conditional > probabilities. This could be done multiply to incorporate imputation error > (see the section on multiple imputation in the above reference). > > This method assumes the missing data are missing at random, which may > not be appropriate in your setting. The simple approach of deleting the > incomplete cases (complete-case analysis) is valid if missingness depends > on the value of the covariates but not the outcome, which may be plausible > here; one might consider both multiple imputation (as outlined above) and > complete-case analysis and see if the substantive findings differ, as a > form of sensitivity analysis. Best, Rod Little. > > Reference: Little, R.J.A. (1992). Regression with missing X's: a > review. Journal of the American Statistical Association, 87, 1227-1237. > > On Mon, 26 Feb 2001, Scot W McNary wrote: > > > > > Hi, > > > > I'm working with three ANCOVAs with categorical covariates. The variables > > of interest are continuous as are the DVs and all of these variables are > > completely observed. The missing data exist for the categorical > > predictors. There are three of them: > > > > 1) four level predictor, 17% missing data > > 2) four level predictor, 7% missing data > > 3) two level predictor, 3% missing data > > > > The investigators I'm working for have good reason to believe that these > > data are unavailable vs. not applicable. They are items which ask about > > different mutually exclusive/exhaustive aspects of abuse experienced by > > individuals. It's reasonable to expect that there is some (unobserved) > > response to these items since individuals were selected into the study > > based on their exposure to abuse. It's likely that some individuals > > refused to answer these items. Unfortunately, the original data coders > > are not available to ask about the proportion of refused vs. don't know > > responses in each of these cases. > > > > My simple minded approach was to collapse these individuals into one of > > the existing categories. To do this I found the outcome means for each > > level of the predictors and collapsed the missing value cases into the > > category with the most similar outcome means. > > > > I understand these missing data to be non-ignorable. But since the > > function of imputation for this analysis is to maximize the N for the > > covariates and not the primary focus of the study, I initially thought > > that a simple-minded, ad hoc approach would suffice. However, a > > reviewer's question has caused me to rethink that. > > > > The reviewer believes that we have used an illegitimate method that has > > overly favored our hypotheses by "imputing" data in this way. I disagree > > that we have biased our data in favor of our hypotheses, first, since we > > had no hypotheses about the covariates per se, and second only one of the > > 21 contrasts implied by the levels of the covariates was significant. > > > > The reviewer's general point that my ad hoc method is not standard has > > caused me to consider asking for advice from others with more experience. > > Should I be engaging in a more formal imputation procedure (e.g., multiple > > imputation), for these covariates? Are there problems with my approach I > > haven't forseen? Any suggestions welcomed. > > > > Thanks in advance, > > > > Scot McNary > > > > > > -- > > Scot W. McNary email:[email protected] > > > > > > > > ___________________________________________________________________________________ > Roderick Little > Chair, Department of Biostatistics (734) 936-1003 > U-M School of Public Health Fax: (734) 763-2215 > M4208 SPH II [email protected] > 1420 Washington Hgts http://www.sph.umich.edu/~rlittle/ > Ann Arbor, MI 48109-2029 -- Frank E Harrell Jr Prof. of Biostatistics & Statistics Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat
