> X-Original-To: [EMAIL PROTECTED] > Date: Fri, 9 Apr 2004 11:21:47 -0700 (PDT) > From: Thomas Lumley <[EMAIL PROTECTED]> > Cc: [EMAIL PROTECTED] > > On Fri, 9 Apr 2004, Marek Ancukiewicz wrote: > > > > > Dear Thomas, > > > > The question becomes: how do we rank missing values? > > That's one of the questions. It's not the only question. Suppose x has > no missing values but y has a missing value. Should the ranks for x be > based on the whole vector or just on the values where y isn't missing? > > -thomas
I see what you mean. One could give an argument in favour of each of these approaches. If we treat data primarily as pairs of values (or more generally, cases) then we should discard incomplete pairs (records) first and rank afterwards. If we consider x and y primarily as separate from each other (especially with regard to how the missing values arise) then a more natural approach would be to do ranking before dropping incomplete pairs. In the later approach we use more information in the data; in the former approach we ignore the information which might be spurious, especially when missing y values tend to coincide with high (low) x values. Dropping NAs first and ranking later seems to be a conservative approach; with the other approach on should probably always check if NAs in one variable are correlated with other variables. My understanding is that cor() in 1.9.0 will do ranking independently, before dropping missing pairs/cases. It would be good to have this documented in help(), it might be also good to add a warning on perils of the analysis with missing values when occurrences of NAs in one variable are correlated with other variables. Marek ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel