Ramon, > From: Ramon Diaz-Uriarte [mailto:[EMAIL PROTECTED] > > Dear All, > > I have been using the randomForest package for a couple of difficult > prediction problems (which also share p >> n). The > performance is good, but > since all the variables in the data set are used, > interpretation of what is > going on is not easy, even after looking at variable > importance as produced > by the randomForest run. > > I have tried a simple "variable selection" scheme, and it > does seem to perform > well (as judged by leave-one-out) but I am not sure if it > makes any sense. > The idea is, in a kind of backwards elimination, to > eliminate one by one the > variables with smallest importance (or all the ones with > negative importance > in one go) until the out-of-bag estimate of classification > error becames > larger than that of the previous model (or of the initial > model). So nothing > really new. But I haven't been able to find any comments in > the literature > about "simplification" of random forests.
This is quite a hazardous game. We've been burned by this ourselves. I'll send you a paper we submitted on variable selection for random forest off-line. (Those who are interested, let me know.) The basic problem is that when you select important variables by RF and then re-run RF with those variables, the OOB error rate become biased downward. As you iterate more times, the "overfitting" becomes more and more severe (in the sense that, the OOB error rate will keep decreasing while error rate on an independent test set will be flat or increases). I was na�ve enough to ask Breiman about this, and his reply was something like "any competent statistician would know that you need something like cross-validation to do that"... In the upcoming version 5 of Breiman's Fortran code, he offers an option to run RF twice, first time with all variables, and the second with the k (selected by user) most important variables from the 1st run. The OOB error rate from the 2nd run is no longer unbiased, but the bias is probably not too severe with only one iteration. Best, Andy > Any suggestions/comments? > > Best, > > Ram�n > > -- > Ram�n D�az-Uriarte > Bioinformatics Unit > Centro Nacional de Investigaciones Oncol�gicas (CNIO) > (Spanish National Cancer Center) > Melchor Fern�ndez Almagro, 3 > 28029 Madrid (Spain) > Fax: +-34-91-224-6972 > Phone: +-34-91-224-6900 > http://bioinfo.cnio.es/~rdiaz ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
