It's even more complicated than that. Suppose your variable X is sometimes missing (let's say MCAR to keep things simple), but you have another variable Z that is fairly well correlated with X and is fully observed. If you are interested in the regression of a third variable Y on X (or the correlation of Y and X), your imputation of X based on Z (and Y) might be quite beneficial. However, if what you are interested in the coefficient of X in the regression of Y on X,Z, (or the partial correlation of X,Y given Z), your imputations won't do you much good. Your imputations of X are only telling you about the Y,Z relationship, which helps (since it mediates the X,Y association) but gives you no direct information about the partial association X,Y|Z.
Three points here: (1) to fully understand what imputation can contribute, you might need to understand how the observed-data inference works (as suggested above, although my explanation was only heuristic). (2) The contribution of an auxiliary variable to the value of imputation can be highly analysis-specific. (3) The value of using in imputation an auxiliary variable external to the analysis might be greater than that of variables in the analysis. A concrete example to fix the abstractions of the first paragraph: suppose you are interested in associations of family income (X) with educational outcomes (Y), but X is often missing. Then block-group median income (Z) might be an excellent auxiliary variable. However if your objective is to assess the distinct contributions of family and neighborhood income to prediction of Y, imputing family income from neighborhood income won't do much good. You might do better imputing based on something completely different like rent paid or value of family automobile. ________________________________ From: Impute -- Imputations in Data Analysis [[email protected]] On Behalf Of Hunsicker, Lawrence [[email protected]] Sent: Tuesday, April 16, 2013 8:51 AM To: [email protected] Subject: FW: "Accessory" variables in imputation . . . There seem to me to be two ways to think about whether “the auxiliary variable helps a little or a lot.” If one’s metric is the accuracy of the imputation, it seems pretty likely to me that the accuracy of the imputation will be improved by including a strongly correlated auxiliary variable. One could check this by checking the correlation of the values predicted by the imputation model with the actually observed values. But the real issue is how much inclusion of the auxiliary variables in the imputation model improves the results of the actual final analysis. This has to be a function of how strongly the covariate with missing values correlates with (predicts) the final outcome variable, and with the amount of missing data. The analysis that I posed in my original post is not a particularly good one for asking this question, as the amount of missing data for the current PRA is only about 3%, and the impact of current PRA on graft survival is not particularly strong. It was a convenient straw man to permit me to ask the question coherently.
