On 1 Dec 2001, jenny wrote:

> What should I do with the missing values in my data.  I need to 
> perform a t test of two samples to test the mean difference between 
> them. 
> How should I handle them in S-Plus or SAS?

1.  What do S-Plus and/or SAS do with missing values by default?  
    (All packages have defaults, and sometimes they're even sensible 
ones.  If your package(s) do what you want done, or at least do something 
you can live with, that's probably the most comfortable resolution of 
your question.)

2.  Why are there missing values?  And what do these reasons imply (if 
anything) about the values themselves? 
 There are essentially two choices available: 
  (a) treat the values as missing, that is, discard each of the cases for
which the variable in question is missing for the duration of the analysis
of that variable, and retrieve those cases again when dealing with some 
other variable for which their value is not missing.  This is the default 
in MINITAB and SPSS, although for some analyses (in both packages) the 
missing cases are deleted listwise (in multiple regression, for example, 
if any of the variables in the model be missing, the whole case is 
deleted fron the analysis) and for some the missing cases are deleted 
pairwise (in reporting a correlation matrix, for example, a case is 
deleted from the computation of a correlation coefficient if either of 
the two variables is missing, but is retained for other correlation 
coefficients for which both variables are non-missing in this case).
  (b) Impute some value to the missing variable for this case.  There are 
a great variety of imputation schemes, all of them (so far as I know) 
suffering from the logical defect that one must assume something about 
the missing value, and the assumption may not only be untrue, it may be 
wildly in error.  One approach is to substitute the mean of this variable 
for the missing value;  but if the _reason_ the value is missing implies 
that the actual value is likely to be extremely high or extremely low, 
this is evidently not a good strategy.  Another approach is to use some 
variant of multiple regression to predict the missing value from the 
existing values of other variables;  again, this assumes that the missing 
value would be close to the regression line (or surface), and if the 
_reason_ implies an extreme value or outlier, this is not particularly 
likely to yield a realistic value.

This is of course a simplified account (some might say oversimplified) of 
the problem of missing-ness, but may suggest some useful ideas.  

Personally, I generally prefer to acknowledge that I don't know the value 
that's missing, and let the case be temporarily discarded, at least for a 
first run at an analysis (or series of analyses);  most of the time.
  And if I chose to use a method of imputation, I'd usually want to 
report results both of analyses in which the missing data are honestly 
missing, and analyses in which imputed values are used, so that I (and 
my readers) could see the effect(s) of the imputation.

And since you want to test for differences between means, you almost 
certainly should NOT substitute a _mean_ for any missing value.  If you 
substitute the overall mean, you will tend to diminish the real 
difference, if any, between the two sample means, and if there's a lot of 
missing data you could end up not finding differences where they would 
have been evident if you'd permitted the missing cases to be discarded. 
If yhou substitute the mean of this subgroup, you will not change the 
apparent difference between the means, but you WILL reduce the 
within-group (pooled or not) variance, so that you will have spuriously 
high sensitivity to differences between the means.

Whether there is an aregument that would support any other method of 
imputation in your case, I cannot tell.  I'm inclined to doubt it, but 
that maybe merely a reflection of my usual skepticism (or, perhaps, 

