Scott - To add a bit to Rod's excellent advice, the S-Plus transcan
function can use Fisher's optimum scoring algorithm to score
categorical variables.  The approximate Bayesian bootstrap is used
to sample residuals off of these scores then to back-compute to
estimate categories "closest" to the predicted randomly drawn
scores.  See
http://hesweb1.med.virginia.edu/biostat/s/help/transcan.html

transcan can also use recursive partitioning to impute categorical
predictors but at present this only works for single "best guess"
imputation.  I need to implement Rod's advice to add random sampling
from the estimated multinomial distribution.

-Frank Harrell


Rod Little wrote:
> 
> Scot: you might take a look at my review paper on missing data in
> regression in JASA (ref below). An improvement of your method would be to
> compute conditional probabilities of being in the categories given the
> other covariates and Y, and draw the category using these conditional
> probabilities. This could be done multiply to incorporate imputation error
> (see the section on multiple imputation in the above reference).
> 
> This method assumes the missing data are missing at random, which may
> not be appropriate in your setting. The simple approach of deleting the
> incomplete cases (complete-case analysis) is valid if missingness depends
> on the value of the covariates but not the outcome, which may be plausible
> here; one might consider both multiple imputation (as outlined above) and
> complete-case analysis and see if the substantive findings differ, as a
> form of sensitivity analysis. Best, Rod Little.
> 
> Reference: Little, R.J.A. (1992). Regression with missing X's: a
> review. Journal of the American Statistical Association, 87, 1227-1237.
> 
> On Mon, 26 Feb 2001, Scot W McNary wrote:
> 
> >
> > Hi,
> >
> > I'm working with three ANCOVAs with categorical covariates.  The variables
> > of interest are continuous as are the DVs and all of these variables are
> > completely observed.  The missing data exist for the categorical
> > predictors.  There are three of them:
> >
> > 1) four level predictor, 17% missing data
> > 2) four level predictor, 7% missing data
> > 3) two level predictor, 3% missing data
> >
> > The investigators I'm working for have good reason to believe that these
> > data are unavailable vs. not applicable.  They are items which ask about
> > different mutually exclusive/exhaustive aspects of abuse experienced by
> > individuals.  It's reasonable to expect that there is some (unobserved)
> > response to these items since individuals were selected into the study
> > based on their exposure to abuse.  It's likely that some individuals
> > refused to answer these items.  Unfortunately, the original data coders
> > are not available to ask about the proportion of refused vs. don't know
> > responses in each of these cases.
> >
> > My simple minded approach was to collapse these individuals into one of
> > the existing categories.  To do this I found the outcome means for each
> > level of the predictors and collapsed the missing value cases into the
> > category with the most similar outcome means.
> >
> > I understand these missing data to be non-ignorable.  But since the
> > function of imputation for this analysis is to maximize the N for the
> > covariates and not the primary focus of the study, I initially thought
> > that a simple-minded, ad hoc approach would suffice.  However, a
> > reviewer's question has caused me to rethink that.
> >
> > The reviewer believes that we have used an illegitimate method that has
> > overly favored our hypotheses by "imputing" data in this way.  I disagree
> > that we have biased our data in favor of our hypotheses, first, since we
> > had no hypotheses about the covariates per se, and second only one of the
> > 21 contrasts implied by the levels of the covariates was significant.
> >
> > The reviewer's general point that my ad hoc method is not standard has
> > caused me to consider asking for advice from others with more experience.
> > Should I be engaging in a more formal imputation procedure (e.g., multiple
> > imputation), for these covariates?  Are there problems with my approach I
> > haven't forseen?  Any suggestions welcomed.
> >
> > Thanks in advance,
> >
> > Scot McNary
> >
> >
> > --
> >   Scot W. McNary  email:[email protected]
> >
> >
> >
> 
> ___________________________________________________________________________________
> Roderick Little
> Chair, Department of Biostatistics                    (734) 936-1003
> U-M School of Public Health                     Fax:  (734) 763-2215
> M4208 SPH II                                       [email protected]
> 1420 Washington Hgts               http://www.sph.umich.edu/~rlittle/
> Ann Arbor, MI 48109-2029

-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  http://hesweb1.med.virginia.edu/biostat

Reply via email to