Re: [R] Error on random forest variable importance estimates

2010-08-08 Thread Pierre Dubath

Hello Andy,

Thank you for your quick and helpful reply. I will try to follow your 
suggestions.


Also, thank you for the R implementation of random forest. It is very 
useful for our work.


Best,

Pierre

Liaw, Andy wrote:

From: Pierre Dubath

Hello,

I am using the R randomForest package to classify variable 
stars. I have 
a training set of 1755 stars described by (too) many 
variables. Some of 
these variables are highly correlated.


I believe that I understand how randomForest works and how 
the variable 
importance are evaluated (through variable permutations). Here are my 
questions.


1) variable importance error? Is there any ways to estimate 
the error on 
the MeanDecreaseAccuracy? In other words, I would like to know how 
significant are MeanDecreaseAccuracy differences (and display 
horizontal error bars in the VarImpPlot output).


If you really want to do it, one possibility is to do permutation test:
Permute your response, say, 1000 or 2000 times, run RF on each of these
permuted response, and use the importance measures as samples from the
null distribution.
 
I have notice that even with relatively large number of trees, I have 
variation in the importance values from one run to the next. 
Could this 
serve as a measure of the errors/uncertainties?


Yes.
 
2) how to deal with variable correlation? so far, I am iterating, 
selecting the most important variable first, removing all 
other variable 
that have a high correlation (say higher than 80%), taking the second 
most important variable left, removing variables with 
high-correlation 
with any of the first two variables, and so on... (also using some 
astronomical insight as to which variables are the most important!)


Is there a better way to deal with correlation in randomForest? (I 
suppose that using many correlated variables should not be a 
problem for 
randomForest, but it is for my understanding of the data and 
for other 
algorithms).


That depends a lot on what you're trying to do.  RF can tolerate
problematic data, but that doesn't mean it will magically give you good
answers.  Trying to draw conclusions about effects when there are highly
correlated (and worse, important) variables is a tricky business.
 
3) How many variables should eventually be used? I have made 
successive 
runs, adding one variable at a time from the most to the 
least important 
(not-too-correlated) variables. I then plot the error rate 
(err.rate) as 
a function of the number of variable used. As this number 
increase, the 
error first sharply decrease, but relatively soon it reaches 
a plateau .
I assume that the point of inflexion can be use to derive the minimum 
number of variable to be used. Is that a sensible approach? 
Is there any 
other suggestion? A measure of the error on err.rate would 
also here 
really help. Is there any idea how to estimate this? From the 
variation 
between runs or with the help of importanceSD somehow?


One approach is described in the following paper (in the Proceedings of
MCS 2004):
http://www.springerlink.com/content/9n61mquugf9tungl/

Best,
Andy
 

Thanks very much in advance for any help.

Pierre Dubath

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


Notice:  This e-mail message, together with any attach...{{dropped:13}}


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Error on random forest variable importance estimates

2010-08-06 Thread Pierre Dubath

Hello,

I am using the R randomForest package to classify variable stars. I have 
a training set of 1755 stars described by (too) many variables. Some of 
these variables are highly correlated.


I believe that I understand how randomForest works and how the variable 
importance are evaluated (through variable permutations). Here are my 
questions.


1) variable importance error? Is there any ways to estimate the error on 
the MeanDecreaseAccuracy? In other words, I would like to know how 
significant are MeanDecreaseAccuracy differences (and display 
horizontal error bars in the VarImpPlot output).


I have notice that even with relatively large number of trees, I have 
variation in the importance values from one run to the next. Could this 
serve as a measure of the errors/uncertainties?


2) how to deal with variable correlation? so far, I am iterating, 
selecting the most important variable first, removing all other variable 
that have a high correlation (say higher than 80%), taking the second 
most important variable left, removing variables with high-correlation 
with any of the first two variables, and so on... (also using some 
astronomical insight as to which variables are the most important!)


Is there a better way to deal with correlation in randomForest? (I 
suppose that using many correlated variables should not be a problem for 
randomForest, but it is for my understanding of the data and for other 
algorithms).


3) How many variables should eventually be used? I have made successive 
runs, adding one variable at a time from the most to the least important 
(not-too-correlated) variables. I then plot the error rate (err.rate) as 
a function of the number of variable used. As this number increase, the 
error first sharply decrease, but relatively soon it reaches a plateau .
I assume that the point of inflexion can be use to derive the minimum 
number of variable to be used. Is that a sensible approach? Is there any 
other suggestion? A measure of the error on err.rate would also here 
really help. Is there any idea how to estimate this? From the variation 
between runs or with the help of importanceSD somehow?


Thanks very much in advance for any help.

Pierre Dubath

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error on random forest variable importance estimates

2010-08-06 Thread Liaw, Andy
From: Pierre Dubath
 
 Hello,
 
 I am using the R randomForest package to classify variable 
 stars. I have 
 a training set of 1755 stars described by (too) many 
 variables. Some of 
 these variables are highly correlated.
 
 I believe that I understand how randomForest works and how 
 the variable 
 importance are evaluated (through variable permutations). Here are my 
 questions.
 
 1) variable importance error? Is there any ways to estimate 
 the error on 
 the MeanDecreaseAccuracy? In other words, I would like to know how 
 significant are MeanDecreaseAccuracy differences (and display 
 horizontal error bars in the VarImpPlot output).

If you really want to do it, one possibility is to do permutation test:
Permute your response, say, 1000 or 2000 times, run RF on each of these
permuted response, and use the importance measures as samples from the
null distribution.
 
 I have notice that even with relatively large number of trees, I have 
 variation in the importance values from one run to the next. 
 Could this 
 serve as a measure of the errors/uncertainties?

Yes.
 
 2) how to deal with variable correlation? so far, I am iterating, 
 selecting the most important variable first, removing all 
 other variable 
 that have a high correlation (say higher than 80%), taking the second 
 most important variable left, removing variables with 
 high-correlation 
 with any of the first two variables, and so on... (also using some 
 astronomical insight as to which variables are the most important!)
 
 Is there a better way to deal with correlation in randomForest? (I 
 suppose that using many correlated variables should not be a 
 problem for 
 randomForest, but it is for my understanding of the data and 
 for other 
 algorithms).

That depends a lot on what you're trying to do.  RF can tolerate
problematic data, but that doesn't mean it will magically give you good
answers.  Trying to draw conclusions about effects when there are highly
correlated (and worse, important) variables is a tricky business.
 
 3) How many variables should eventually be used? I have made 
 successive 
 runs, adding one variable at a time from the most to the 
 least important 
 (not-too-correlated) variables. I then plot the error rate 
 (err.rate) as 
 a function of the number of variable used. As this number 
 increase, the 
 error first sharply decrease, but relatively soon it reaches 
 a plateau .
 I assume that the point of inflexion can be use to derive the minimum 
 number of variable to be used. Is that a sensible approach? 
 Is there any 
 other suggestion? A measure of the error on err.rate would 
 also here 
 really help. Is there any idea how to estimate this? From the 
 variation 
 between runs or with the help of importanceSD somehow?

One approach is described in the following paper (in the Proceedings of
MCS 2004):
http://www.springerlink.com/content/9n61mquugf9tungl/

Best,
Andy
 
 Thanks very much in advance for any help.
 
 Pierre Dubath
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] error in random forest

2008-03-08 Thread David Katz

I've had the same problem and solved it by removing the cases with the new
levels - they need to be handled some other way, either by building a new
model or reassigning the factor level to one in the training set.



Nagu wrote:
 
 Hi,
 
 I get the following error when I try to predict the probabilities of a
 test sample:
 
 Error in predict.randomForest(fit.EBA.OM.rf.50, x.OM, type = prob) :
   New factor levels not present in the training data
 
 I have about 630 predictor variables in the dataset x.OM (25 factor
 variables and the remaining are continuous variables). Any ideas on
 how to trace it?
 
 Thank you,
 Nagu
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/error-in-random-forest-tp15904235p15922797.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] error in random forest

2008-03-07 Thread Nagu
Hi,

I get the following error when I try to predict the probabilities of a
test sample:

Error in predict.randomForest(fit.EBA.OM.rf.50, x.OM, type = prob) :
  New factor levels not present in the training data

I have about 630 predictor variables in the dataset x.OM (25 factor
variables and the remaining are continuous variables). Any ideas on
how to trace it?

Thank you,
Nagu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] error in random forest

2008-03-07 Thread Bill.Venables
The error message is pretty clear, really.  To spell it out a bit more,
what you have done is as follows.

Your training set has factor variables in it.  Suppose one of them is
f.  In the training set it has 5 levels, say.

Your test set also has a factor f, as it must, but it appears that in
the test set it has 6 levels, or more, or levels that do not agree with
those for f in the training set.

This mismatch measn that the predict method for randomForest cannot use
this test set.

What you have to do is make sure that the factor levels agree for every
factor in both test and training set. One way to do this is to put the
test and training set together with rbind(...) say, and then separate
them again.  But even this will still have a problem for you.  Because
you training set will have some factor levels empty, which are not empty
in the test set.  The error will most likely be more subtle, though.

You really need to sort this out yourself.  It is not particularly an R
problem, but a confusion over data.  To be useful, your training set
need to cover the field for all levels of every factor.  Think about it.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Nagu
Sent: Saturday, 8 March 2008 5:37 AM
To: r-help@r-project.org; [EMAIL PROTECTED]
Subject: [R] error in random forest

Hi,

I get the following error when I try to predict the probabilities of a
test sample:

Error in predict.randomForest(fit.EBA.OM.rf.50, x.OM, type = prob) :
  New factor levels not present in the training data

I have about 630 predictor variables in the dataset x.OM (25 factor
variables and the remaining are continuous variables). Any ideas on
how to trace it?

Thank you,
Nagu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] error in random forest

2008-03-07 Thread Nagu
Thank you very much. I'll jump in to the data and verify the
consistency between the training and testing variables and their
levels.

On Fri, Mar 7, 2008 at 5:14 PM,  [EMAIL PROTECTED] wrote:
 The error message is pretty clear, really.  To spell it out a bit more,
  what you have done is as follows.

  Your training set has factor variables in it.  Suppose one of them is
  f.  In the training set it has 5 levels, say.

  Your test set also has a factor f, as it must, but it appears that in
  the test set it has 6 levels, or more, or levels that do not agree with
  those for f in the training set.

  This mismatch measn that the predict method for randomForest cannot use
  this test set.

  What you have to do is make sure that the factor levels agree for every
  factor in both test and training set. One way to do this is to put the
  test and training set together with rbind(...) say, and then separate
  them again.  But even this will still have a problem for you.  Because
  you training set will have some factor levels empty, which are not empty
  in the test set.  The error will most likely be more subtle, though.

  You really need to sort this out yourself.  It is not particularly an R
  problem, but a confusion over data.  To be useful, your training set
  need to cover the field for all levels of every factor.  Think about it.



  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
  On Behalf Of Nagu
  Sent: Saturday, 8 March 2008 5:37 AM
  To: r-help@r-project.org; [EMAIL PROTECTED]
  Subject: [R] error in random forest

  Hi,

  I get the following error when I try to predict the probabilities of a
  test sample:

  Error in predict.randomForest(fit.EBA.OM.rf.50, x.OM, type = prob) :
   New factor levels not present in the training data

  I have about 630 predictor variables in the dataset x.OM (25 factor
  variables and the remaining are continuous variables). Any ideas on
  how to trace it?

  Thank you,
  Nagu

  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.