Emmanuel, Friedman's (Annals of Stats 1991) MARS program implements recursive partitioning in a regression context - a version of it written by Trevor Hastie was available in R but I don't know what package it's now in - I only have base stuff available (long story).
MARS, like recursive partitioning is a data exploration tool that builds up an approximation to a nonlinear regression function using piecewise regression splines. Each splines is split and replaced by a pair and the GCV score computed - if the split reduces the GCV then the split is accepted - in this way the method is adaptive. MARS is very powerful and was used for time series research by LEwis & Bonnie Ray (JASA 1991) - Bonnie has later papers as well. The main flaw with MARS and I suppose a key reason why it doesn't feature more is that there is no physical/biological underlying model that the researcher in trying to make sense of - MARS just finds the best curve. Interpretation of the result can therefore be a problem. However, MARS does provide an "anova" type decomposition of the curve and this can certainly help in making sense of the underlying relationships. To use it (or related methods such as Generalised Additive Models GAMs) for imputation then is a question of taste. If you're happy that the regression curve is sufficient explanation then MARS is worth looking at - if you want to know more about the physical model, well ... Finally, MARS will treat all missing data as missing at random so if there are specific conditional effects there have to be included as categorical predictors a priori. As MARS is based on least squares it's only optimal for Gaussian errors - it can be used on categorical data as well - another variation called PolyMARS also implements MARS for categorical/multinomial data. Hope this is of interest! Gerard Emmanuel Charpentier <charp...@bacbuc. To dyndns.org> r-h...@stat.math.ethz.ch Sent by: cc r-help-boun...@r- project.org Subject Re: [R] Multiple imputations : wicked dataset. Need advice for 27/04/2009 20:49 follow-up to a possible solution. Answering to myself (for future archive users' sake), more to come (soon) : Le jeudi 23 avril 2009 à 00:31 +0200, Emmanuel Charpentier a écrit : > Dear list, > > I'd like to use multiple imputations to try and save a somewhat badly > mangled dataset (lousy data collection, worse than lousy monitoring, you > know that drill... especially when I am consulted for the first time > about one year *after* data collection). > > My dataset has 231 observations of 53 variables, of which only a very > few has no missing data. Most variables have 5-10% of missing data, but > the whole dataset has only 15% complete cases (40% when dropping the 3 > worst cases, which might be regained by other means). [ Big snip ... ] It turns out that my problems were caused by ... the dataset. Two very important variables (i. e. of strong influence on the outcomes and proxies) are ill-distributed : - one is a modus operandi (two classes) - the second is center (23 classes, alas...) My data are quite ill-distributed : some centers have contributed a large number of observations, some other very few. Furthermore, while few variables are quite badly known, the "missingness pattern" is such as : - some centers have no directly usable information (= complete cases) under one of the modi operandi - some other have no complete case at all... Therefore, any model-based prediction method using the whole dataset (recommended for multiple imputations, since one should not use for inference a richer set of data than what was imputed (seen this statement in a lot of references)) fails miserably. Remembering some fascinating readings (incl. V&R) and an early (20 years ago) excursion in AI (yes, did that, didn't even got the T-shirt...), I have attempted (with some success) to use recursive partitioning for prediction. This (non-)model has some very interestind advantages in my case : - model-free - distribution-free (quite important here : you should see my density curves... and I won't speak about the outliers !) - handles missing data gracefully (almost automagically) - automatic selection and ranking of the pertinent variables - current implementation in R has some very nice features, such as surrogate splits if a value is missing, auto-pruning by cross-validation, etc ... It has also some drawbacks : - no (easy) method for inference - not easy to abstract (you can't just publish an ANOVA table and a couple of p-values...) - no "well-established" (i. e. acknowledged by journal reviewers) => difficult to publish These latter point do not bother me in my case. So I attempted to use this for imputation. Since mice is based on a "chained equations" approach and allows the end-user to write its own imputation functions, I wrote a set of such imputers to be called within the framework of the Gibbs sampler. They proceed as follow : - build a regression or classification tree of the relevant variable using the rest of the dataset - predict the relevant variable for *all* the dataset, - compute "residuals" from known values of the relevant variable and their prediction - impute values to missing data as prediction + a random residual. It works. It's a tad slower than prediction using normal/logistic/multinomial modelling (about a factor of 3, but for y first trial, I attempted to err on the side of excessive precision ==> deeper trees). It does not exhibit any "obvious" statistical misfeatures. But I have questions : 1) What is known of such imputation by regression/classification trees (aka recursive partitionning) ? A quick research didn't turn up very much : the idea has been evoked here and there, but I am not aware of any published solution. In particular, I have no knowledge of any theoretical (i. e. probability) wotrk on their properties ? 2) Where could I find published datasets having been used to validate other imputation methods ? 3) Do you think that these functions should be published ? Sincerely yours, Emmanuel Charpentier PS : > Could someone offer an explanation ? Or must I recourse to sacrifying a > black goat at midnight next new moon ? > > Any hint will be appreciated (and by the way, next new moon is in about > 3 days...). The goat has endured the fear of her life, but is still alive... will probably start worshipping the Moon, however. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ********************************************************************************** The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. It is the policy of the Department of Justice, Equality and Law Reform and the Agencies and Offices using its IT services to disallow the sending of offensive material. Should you consider that the material contained in this message is offensive you should contact the sender immediately and also mailminder[at]justice.ie. Is le haghaidh an duine nó an eintitis ar a bhfuil sí dírithe, agus le haghaidh an duine nó an eintitis sin amháin, a bheartaítear an fhaisnéis a tarchuireadh agus féadfaidh sé go bhfuil ábhar faoi rún agus/nó faoi phribhléid inti. Toirmisctear aon athbhreithniú, atarchur nó leathadh a dhéanamh ar an bhfaisnéis seo, aon úsáid eile a bhaint aisti nó aon ghníomh a dhéanamh ar a hiontaoibh, ag daoine nó ag eintitis seachas an faighteoir beartaithe. Má fuair tú é seo trí dhearmad, téigh i dteagmháil leis an seoltóir, le do thoil, agus scrios an t-ábhar as aon ríomhaire. Is é beartas na Roinne Dlí agus Cirt, Comhionannais agus Athchóirithe Dlí, agus na nOifígí agus na nGníomhaireachtaí a úsáideann seirbhísí TF na Roinne, seoladh ábhair cholúil a dhícheadú. Más rud é go measann tú gur ábhar colúil atá san ábhar atá sa teachtaireacht seo is ceart duit dul i dteagmháil leis an seoltóir láithreach agus le mailminder[ag]justice.ie chomh maith. *********************************************************************************** ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.