Answering to myself (for future archive users' sake), more to come (soon) :
Le jeudi 23 avril 2009 à 00:31 +0200, Emmanuel Charpentier a écrit : > Dear list, > > I'd like to use multiple imputations to try and save a somewhat badly > mangled dataset (lousy data collection, worse than lousy monitoring, you > know that drill... especially when I am consulted for the first time > about one year *after* data collection). > > My dataset has 231 observations of 53 variables, of which only a very > few has no missing data. Most variables have 5-10% of missing data, but > the whole dataset has only 15% complete cases (40% when dropping the 3 > worst cases, which might be regained by other means). [ Big snip ... ] It turns out that my problems were caused by ... the dataset. Two very important variables (i. e. of strong influence on the outcomes and proxies) are ill-distributed : - one is a modus operandi (two classes) - the second is center (23 classes, alas...) My data are quite ill-distributed : some centers have contributed a large number of observations, some other very few. Furthermore, while few variables are quite badly known, the "missingness pattern" is such as : - some centers have no directly usable information (= complete cases) under one of the modi operandi - some other have no complete case at all... Therefore, any model-based prediction method using the whole dataset (recommended for multiple imputations, since one should not use for inference a richer set of data than what was imputed (seen this statement in a lot of references)) fails miserably. Remembering some fascinating readings (incl. V&R) and an early (20 years ago) excursion in AI (yes, did that, didn't even got the T-shirt...), I have attempted (with some success) to use recursive partitioning for prediction. This (non-)model has some very interestind advantages in my case : - model-free - distribution-free (quite important here : you should see my density curves... and I won't speak about the outliers !) - handles missing data gracefully (almost automagically) - automatic selection and ranking of the pertinent variables - current implementation in R has some very nice features, such as surrogate splits if a value is missing, auto-pruning by cross-validation, etc ... It has also some drawbacks : - no (easy) method for inference - not easy to abstract (you can't just publish an ANOVA table and a couple of p-values...) - no "well-established" (i. e. acknowledged by journal reviewers) => difficult to publish These latter point do not bother me in my case. So I attempted to use this for imputation. Since mice is based on a "chained equations" approach and allows the end-user to write its own imputation functions, I wrote a set of such imputers to be called within the framework of the Gibbs sampler. They proceed as follow : - build a regression or classification tree of the relevant variable using the rest of the dataset - predict the relevant variable for *all* the dataset, - compute "residuals" from known values of the relevant variable and their prediction - impute values to missing data as prediction + a random residual. It works. It's a tad slower than prediction using normal/logistic/multinomial modelling (about a factor of 3, but for y first trial, I attempted to err on the side of excessive precision ==> deeper trees). It does not exhibit any "obvious" statistical misfeatures. But I have questions : 1) What is known of such imputation by regression/classification trees (aka recursive partitionning) ? A quick research didn't turn up very much : the idea has been evoked here and there, but I am not aware of any published solution. In particular, I have no knowledge of any theoretical (i. e. probability) wotrk on their properties ? 2) Where could I find published datasets having been used to validate other imputation methods ? 3) Do you think that these functions should be published ? Sincerely yours, Emmanuel Charpentier PS : > Could someone offer an explanation ? Or must I recourse to sacrifying a > black goat at midnight next new moon ? > > Any hint will be appreciated (and by the way, next new moon is in about > 3 days...). The goat has endured the fear of her life, but is still alive... will probably start worshipping the Moon, however. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.