Re: [R] Error on random forest variable importance estimates
Hello Andy, Thank you for your quick and helpful reply. I will try to follow your suggestions. Also, thank you for the R implementation of random forest. It is very useful for our work. Best, Pierre Liaw, Andy wrote: From: Pierre Dubath Hello, I am using the R randomForest package to classify variable stars. I have a training set of 1755 stars described by (too) many variables. Some of these variables are highly correlated. I believe that I understand how randomForest works and how the variable importance are evaluated (through variable permutations). Here are my questions. 1) variable importance error? Is there any ways to estimate the error on the MeanDecreaseAccuracy? In other words, I would like to know how significant are MeanDecreaseAccuracy differences (and display horizontal error bars in the VarImpPlot output). If you really want to do it, one possibility is to do permutation test: Permute your response, say, 1000 or 2000 times, run RF on each of these permuted response, and use the importance measures as samples from the null distribution. I have notice that even with relatively large number of trees, I have variation in the importance values from one run to the next. Could this serve as a measure of the errors/uncertainties? Yes. 2) how to deal with variable correlation? so far, I am iterating, selecting the most important variable first, removing all other variable that have a high correlation (say higher than 80%), taking the second most important variable left, removing variables with high-correlation with any of the first two variables, and so on... (also using some astronomical insight as to which variables are the most important!) Is there a better way to deal with correlation in randomForest? (I suppose that using many correlated variables should not be a problem for randomForest, but it is for my understanding of the data and for other algorithms). That depends a lot on what you're trying to do. RF can tolerate problematic data, but that doesn't mean it will magically give you good answers. Trying to draw conclusions about effects when there are highly correlated (and worse, important) variables is a tricky business. 3) How many variables should eventually be used? I have made successive runs, adding one variable at a time from the most to the least important (not-too-correlated) variables. I then plot the error rate (err.rate) as a function of the number of variable used. As this number increase, the error first sharply decrease, but relatively soon it reaches a plateau . I assume that the point of inflexion can be use to derive the minimum number of variable to be used. Is that a sensible approach? Is there any other suggestion? A measure of the error on err.rate would also here really help. Is there any idea how to estimate this? From the variation between runs or with the help of importanceSD somehow? One approach is described in the following paper (in the Proceedings of MCS 2004): http://www.springerlink.com/content/9n61mquugf9tungl/ Best, Andy Thanks very much in advance for any help. Pierre Dubath __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attach...{{dropped:13}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Error on random forest variable importance estimates
Hello, I am using the R randomForest package to classify variable stars. I have a training set of 1755 stars described by (too) many variables. Some of these variables are highly correlated. I believe that I understand how randomForest works and how the variable importance are evaluated (through variable permutations). Here are my questions. 1) variable importance error? Is there any ways to estimate the error on the MeanDecreaseAccuracy? In other words, I would like to know how significant are MeanDecreaseAccuracy differences (and display horizontal error bars in the VarImpPlot output). I have notice that even with relatively large number of trees, I have variation in the importance values from one run to the next. Could this serve as a measure of the errors/uncertainties? 2) how to deal with variable correlation? so far, I am iterating, selecting the most important variable first, removing all other variable that have a high correlation (say higher than 80%), taking the second most important variable left, removing variables with high-correlation with any of the first two variables, and so on... (also using some astronomical insight as to which variables are the most important!) Is there a better way to deal with correlation in randomForest? (I suppose that using many correlated variables should not be a problem for randomForest, but it is for my understanding of the data and for other algorithms). 3) How many variables should eventually be used? I have made successive runs, adding one variable at a time from the most to the least important (not-too-correlated) variables. I then plot the error rate (err.rate) as a function of the number of variable used. As this number increase, the error first sharply decrease, but relatively soon it reaches a plateau . I assume that the point of inflexion can be use to derive the minimum number of variable to be used. Is that a sensible approach? Is there any other suggestion? A measure of the error on err.rate would also here really help. Is there any idea how to estimate this? From the variation between runs or with the help of importanceSD somehow? Thanks very much in advance for any help. Pierre Dubath __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error on random forest variable importance estimates
From: Pierre Dubath Hello, I am using the R randomForest package to classify variable stars. I have a training set of 1755 stars described by (too) many variables. Some of these variables are highly correlated. I believe that I understand how randomForest works and how the variable importance are evaluated (through variable permutations). Here are my questions. 1) variable importance error? Is there any ways to estimate the error on the MeanDecreaseAccuracy? In other words, I would like to know how significant are MeanDecreaseAccuracy differences (and display horizontal error bars in the VarImpPlot output). If you really want to do it, one possibility is to do permutation test: Permute your response, say, 1000 or 2000 times, run RF on each of these permuted response, and use the importance measures as samples from the null distribution. I have notice that even with relatively large number of trees, I have variation in the importance values from one run to the next. Could this serve as a measure of the errors/uncertainties? Yes. 2) how to deal with variable correlation? so far, I am iterating, selecting the most important variable first, removing all other variable that have a high correlation (say higher than 80%), taking the second most important variable left, removing variables with high-correlation with any of the first two variables, and so on... (also using some astronomical insight as to which variables are the most important!) Is there a better way to deal with correlation in randomForest? (I suppose that using many correlated variables should not be a problem for randomForest, but it is for my understanding of the data and for other algorithms). That depends a lot on what you're trying to do. RF can tolerate problematic data, but that doesn't mean it will magically give you good answers. Trying to draw conclusions about effects when there are highly correlated (and worse, important) variables is a tricky business. 3) How many variables should eventually be used? I have made successive runs, adding one variable at a time from the most to the least important (not-too-correlated) variables. I then plot the error rate (err.rate) as a function of the number of variable used. As this number increase, the error first sharply decrease, but relatively soon it reaches a plateau . I assume that the point of inflexion can be use to derive the minimum number of variable to be used. Is that a sensible approach? Is there any other suggestion? A measure of the error on err.rate would also here really help. Is there any idea how to estimate this? From the variation between runs or with the help of importanceSD somehow? One approach is described in the following paper (in the Proceedings of MCS 2004): http://www.springerlink.com/content/9n61mquugf9tungl/ Best, Andy Thanks very much in advance for any help. Pierre Dubath __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] error in random forest
I've had the same problem and solved it by removing the cases with the new levels - they need to be handled some other way, either by building a new model or reassigning the factor level to one in the training set. Nagu wrote: Hi, I get the following error when I try to predict the probabilities of a test sample: Error in predict.randomForest(fit.EBA.OM.rf.50, x.OM, type = prob) : New factor levels not present in the training data I have about 630 predictor variables in the dataset x.OM (25 factor variables and the remaining are continuous variables). Any ideas on how to trace it? Thank you, Nagu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/error-in-random-forest-tp15904235p15922797.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] error in random forest
Hi, I get the following error when I try to predict the probabilities of a test sample: Error in predict.randomForest(fit.EBA.OM.rf.50, x.OM, type = prob) : New factor levels not present in the training data I have about 630 predictor variables in the dataset x.OM (25 factor variables and the remaining are continuous variables). Any ideas on how to trace it? Thank you, Nagu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] error in random forest
The error message is pretty clear, really. To spell it out a bit more, what you have done is as follows. Your training set has factor variables in it. Suppose one of them is f. In the training set it has 5 levels, say. Your test set also has a factor f, as it must, but it appears that in the test set it has 6 levels, or more, or levels that do not agree with those for f in the training set. This mismatch measn that the predict method for randomForest cannot use this test set. What you have to do is make sure that the factor levels agree for every factor in both test and training set. One way to do this is to put the test and training set together with rbind(...) say, and then separate them again. But even this will still have a problem for you. Because you training set will have some factor levels empty, which are not empty in the test set. The error will most likely be more subtle, though. You really need to sort this out yourself. It is not particularly an R problem, but a confusion over data. To be useful, your training set need to cover the field for all levels of every factor. Think about it. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nagu Sent: Saturday, 8 March 2008 5:37 AM To: r-help@r-project.org; [EMAIL PROTECTED] Subject: [R] error in random forest Hi, I get the following error when I try to predict the probabilities of a test sample: Error in predict.randomForest(fit.EBA.OM.rf.50, x.OM, type = prob) : New factor levels not present in the training data I have about 630 predictor variables in the dataset x.OM (25 factor variables and the remaining are continuous variables). Any ideas on how to trace it? Thank you, Nagu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] error in random forest
Thank you very much. I'll jump in to the data and verify the consistency between the training and testing variables and their levels. On Fri, Mar 7, 2008 at 5:14 PM, [EMAIL PROTECTED] wrote: The error message is pretty clear, really. To spell it out a bit more, what you have done is as follows. Your training set has factor variables in it. Suppose one of them is f. In the training set it has 5 levels, say. Your test set also has a factor f, as it must, but it appears that in the test set it has 6 levels, or more, or levels that do not agree with those for f in the training set. This mismatch measn that the predict method for randomForest cannot use this test set. What you have to do is make sure that the factor levels agree for every factor in both test and training set. One way to do this is to put the test and training set together with rbind(...) say, and then separate them again. But even this will still have a problem for you. Because you training set will have some factor levels empty, which are not empty in the test set. The error will most likely be more subtle, though. You really need to sort this out yourself. It is not particularly an R problem, but a confusion over data. To be useful, your training set need to cover the field for all levels of every factor. Think about it. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nagu Sent: Saturday, 8 March 2008 5:37 AM To: r-help@r-project.org; [EMAIL PROTECTED] Subject: [R] error in random forest Hi, I get the following error when I try to predict the probabilities of a test sample: Error in predict.randomForest(fit.EBA.OM.rf.50, x.OM, type = prob) : New factor levels not present in the training data I have about 630 predictor variables in the dataset x.OM (25 factor variables and the remaining are continuous variables). Any ideas on how to trace it? Thank you, Nagu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.