[R] Error on random forest variable importance estimates

Pierre Dubath Fri, 06 Aug 2010 07:35:14 -0700

Hello,

I am using the R randomForest package to classify variable stars. I havea training set of 1755 stars described by (too) many variables. Some ofthese variables are highly correlated.

I believe that I understand how randomForest works and how the variableimportance are evaluated (through variable permutations). Here are myquestions.

1) variable importance error? Is there any ways to estimate the error onthe "MeanDecreaseAccuracy"? In other words, I would like to know howsignificant are "MeanDecreaseAccuracy" differences (and displayhorizontal error bars in the VarImpPlot output).

I have notice that even with relatively large number of trees, I havevariation in the importance values from one run to the next. Could thisserve as a measure of the errors/uncertainties?

2) how to deal with variable correlation? so far, I am iterating,selecting the most important variable first, removing all other variablethat have a high correlation (say higher than 80%), taking the secondmost important variable left, removing variables with high-correlationwith any of the first two variables, and so on... (also using someastronomical insight as to which variables are the most important!)

Is there a better way to deal with correlation in randomForest? (Isuppose that using many correlated variables should not be a problem forrandomForest, but it is for my understanding of the data and for otheralgorithms).

3) How many variables should eventually be used? I have made successiveruns, adding one variable at a time from the most to the least important(not-too-correlated) variables. I then plot the error rate (err.rate) asa function of the number of variable used. As this number increase, theerror first sharply decrease, but relatively soon it reaches a plateau .I assume that the point of inflexion can be use to derive the minimumnumber of variable to be used. Is that a sensible approach? Is there anyother suggestion? A measure of the error on "err.rate" would also herereally help. Is there any idea how to estimate this? From the variationbetween runs or with the help of "importanceSD" somehow?


Thanks very much in advance for any help.

Pierre Dubath

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Error on random forest variable importance estimates

Reply via email to