Re: What usually should be done with missing values ...

Kevin C. Heslin Sun, 02 Dec 2001 16:17:53 -0800

Jenny -- here's a way to impute continuous variables using SAS:

Regression analysis is performed on a continuous variable until significant predictors of the continuous variable are identified. The parameter estimates for the significant independent predictor variables from the final regression are then used in a regression model to predict the missing observations for the continuous variable. Each predicted missing value is determined by first multiplying the value of each independent variable for that observation by the parameter estimate and then summing these values together with the intercept of the final regression equation.   See example below:

Step 1: (VarX=continuous variable with missing values)

Data DATAI;
        SET DATA;
RUN;

(indvar1=independent predictor variable)

PROC REG DATA="DATAI;"
model VarX= indvar1 indvar2 indvar3;
RUN;

STep2:
Rerun the regression with the independent variables that were significant in the first run.

DATAII;
   SET DATA; /*USING ORIGINAL DATA SET (DATASET IS CALLED DATA) FOR REGRESSION!!!!!!*/
RUN;

proc reg data=DATAII;
model VarX= indvar1 indvar3;
RUN;

Step 3:
Create a new data set to Impute, using the values from the regression and the intercept.

VarXIMP= contiunous variable with imputed missing values

To impute VarXIMP:

DATA IMPUTE;
      SET Data; /*ORIGINAL DATA SET***/
IF VarX=. THEN VarXIMP = /*INTERCEPT*/ 7.55820 + (indvar1*-6.20558) + (indvar2*-14.15744);
ELSE VARIMP=VARX;
Run;

At 06:06 PM 12/2/2001 -0500, Donald Burrill wrote:
>On 1 Dec 2001, jenny wrote:
>
>> What should I do with the missing values in my data. I need to
>> perform a t test of two samples to test the mean difference between
>> them.
>> How should I handle them in S-Plus or SAS?
>
>1. What do S-Plus and/or SAS do with missing values by default?
>    (All packages have defaults, and sometimes they're even sensible
>ones. If your package(s) do what you want done, or at least do something
>you can live with, that's probably the most comfortable resolution of
>your question.)
>
>2. Why are there missing values? And what do these reasons imply (if
>anything) about the values themselves?
> There are essentially two choices available:
> (a) treat the values as missing, that is, discard each of the cases for
>which the variable in question is missing for the duration of the analysis
>of that variable, and retrieve those cases again when dealing with some
>other variable for which their value is not missing. This is the default
>in MINITAB and SPSS, although for some analyses (in both packages) the
>missing cases are deleted listwise (in multiple regression, for example,
>if any of the variables in the model be missing, the whole case is
>deleted fron the analysis) and for some the missing cases are deleted
>pairwise (in reporting a correlation matrix, for example, a case is
>deleted from the computation of a correlation coefficient if either of
>the two variables is missing, but is retained for other correlation
>coefficients for which both variables are non-missing in this case).
> (b) Impute some value to the missing variable for this case. There are
>a great variety of imputation schemes, all of them (so far as I know)
>suffering from the logical defect that one must assume something about
>the missing value, and the assumption may not only be untrue, it may be
>wildly in error. One approach is to substitute the mean of this variable
>for the missing value; but if the _reason_ the value is missing implies
>that the actual value is likely to be extremely high or extremely low,
>this is evidently not a good strategy. Another approach is to use some
>variant of multiple regression to predict the missing value from the
>existing values of other variables; again, this assumes that the missing
>value would be close to the regression line (or surface), and if the
>_reason_ implies an extreme value or outlier, this is not particularly
>likely to yield a realistic value.
>
>This is of course a simplified account (some might say oversimplified) of
>the problem of missing-ness, but may suggest some useful ideas.
>
>Personally, I generally prefer to acknowledge that I don't know the value
>that's missing, and let the case be temporarily discarded, at least for a
>first run at an analysis (or series of analyses); most of the time.
> And if I chose to use a method of imputation, I'd usually want to
>report results both of analyses in which the missing data are honestly
>missing, and analyses in which imputed values are used, so that I (and
>my readers) could see the effect(s) of the imputation.
>
>And since you want to test for differences between means, you almost
>certainly should NOT substitute a _mean_ for any missing value. If you
>substitute the overall mean, you will tend to diminish the real
>difference, if any, between the two sample means, and if there's a lot of
>missing data you could end up not finding differences where they would
>have been evident if you'd permitted the missing cases to be discarded.
>If yhou substitute the mean of this subgroup, you will not change the
>apparent difference between the means, but you WILL reduce the
>within-group (pooled or not) variance, so that you will have spuriously
>high sensitivity to differences between the means.
>
>Whether there is an aregument that would support any other method of
>imputation in your case, I cannot tell. I'm inclined to doubt it, but
>that maybe merely a reflection of my usual skepticism (or, perhaps,
>curmudgeonliness).
>
> ------------------------------------------------------------------------
> Donald F. Burrill                                 [EMAIL PROTECTED]
> 184 Nashua Road, Bedford, NH 03110                          603-471-7128
>
>
>
>=================================================================
>Instructions for joining and leaving this list and remarks about
>the problem of INAPPROPRIATE MESSAGES are available at
>                  http://jse.stat.ncsu.edu/
>=================================================================

Re: What usually should be done with missing values ...

Reply via email to