[R] random forest and vegetation data

2008-02-01 Thread ahelmore
Hi there,

I am an environmental studies masters student trying to get my thesis out the 
door.  I am also newbie at trees in general, but I like what I see in the 
literature about the random forest algorithm.  I think I get the general gist 
of things, but even after reading stuff I’m unclear about how I could be 
getting the results I’m seeing.  I obviously am missing something about how the 
split points in the final tree are decided.

I’ve been using random forests in image classification by entering split values 
into decision tree classifiers, and that has seemed work very well.  The map 
output appears legitimate and withheld data gives confusion matrices similar to 
the predictive errors from the random forest.  This leads me to assume that the 
split points are effective.

However now that I’ve turned to the ecological portion of my analysis, with a 
data set that contains few variable levels and lots of zeros, suddenly the 
splitting node information is not making sense.

Here is my situation.  I have a matrix of study plots that each belong to one 
of three elevation classes and which each have percent cover class data for 15 
plant species associated with them.  

plotelevsp1 sp2 sp3… sp15
1   3   0   2   6…  5
2   0   0   0   1…  0
etc.

The species data are ordered factors from 0-9.  When I run the algorithm using 
species cover values to predict elevation class, two species alone come up as 
the best predictors.  That makes ecological sense in this setting, given the 
species ranges in question.

Here’s my difficulty though.  The split point values can’t be interpreted, as 
far as I can tell.  I’m getting split points of, say, 1.5 and 2.5 for a species 
who’s cover is either 0 (absent) or 4 and above.  So obviously the split points 
in the final tree are being generated in some way I don’t understand.  
Averaged?  

I’ve tried running the tree using the data as factors, using the data as 
ordered factors, and using the data as numerical variables, just to see if I 
could gain insight into what’s going on, but I’m coming up clueless.  My 
literature hunt reveals repeated instances of folks saying that the final tree 
can’t be interpreted the way other trees are, but I’m not getting a lot on just 
why that might be.  

Some folks talk about the final tree being “averaged,” others say that “mode,” 
is employed (which doesn’t make sense to me if I’m getting 1.5 and 2.5 split 
values).  If the trees are only good as black box predictors (which is of 
course a very useful thing in itself), should I even be using the node 
information in my image classifications?  

As you see, I’m missing some rather important point or other here.  Can you 
enlighten?

Thanks,
A
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random forest and vegetation data

2008-02-01 Thread Liaw, Andy
First things first:  There's no final tree in random forests.  You get
a set of trees (i.e., a forest).  Secondly, a forest cannot be
interpreted because of the complexity, not because the splits can't
possibly make sense.  You can try to interpret the trees, as long as you
understand the potential pitfalls of doing that.

Here's how ordered factors are handled in randomForest.  Since the tree
algorithm only make use of the ranks, there's basically no difference
between numerical and ordinal variables.  Thus ordered factors are
simply treated as integers 1 through K where K is the number or levels,
and the underlying algorithm is told that this is just a numeric
variable.  This is how a split point of something like 3.5 appears.

Hope that make some sense.

Best,
Andy 

From: [EMAIL PROTECTED]
 
 Hi there,
 
 I am an environmental studies masters student trying to get 
 my thesis out the door.  I am also newbie at trees in 
 general, but I like what I see in the literature about the 
 random forest algorithm.  I think I get the general gist of 
 things, but even after reading stuff I'm unclear about how I 
 could be getting the results I'm seeing.  I obviously am 
 missing something about how the split points in the final 
 tree are decided.
 
 I've been using random forests in image classification by 
 entering split values into decision tree classifiers, and 
 that has seemed work very well.  The map output appears 
 legitimate and withheld data gives confusion matrices similar 
 to the predictive errors from the random forest.  This leads 
 me to assume that the split points are effective.
 
 However now that I've turned to the ecological portion of my 
 analysis, with a data set that contains few variable levels 
 and lots of zeros, suddenly the splitting node information is 
 not making sense.
 
 Here is my situation.  I have a matrix of study plots that 
 each belong to one of three elevation classes and which each 
 have percent cover class data for 15 plant species associated 
 with them.  
 
 plot  elevsp1 sp2 sp3... sp15
 1 3   0   2   6...  5
 2 0   0   0   1...  0
 etc.
 
 The species data are ordered factors from 0-9.  When I run 
 the algorithm using species cover values to predict elevation 
 class, two species alone come up as the best predictors.  
 That makes ecological sense in this setting, given the 
 species ranges in question.
 
 Here's my difficulty though.  The split point values can't be 
 interpreted, as far as I can tell.  I'm getting split points 
 of, say, 1.5 and 2.5 for a species who's cover is either 0 
 (absent) or 4 and above.  So obviously the split points in 
 the final tree are being generated in some way I don't 
 understand.  Averaged?  
 
 I've tried running the tree using the data as factors, using 
 the data as ordered factors, and using the data as numerical 
 variables, just to see if I could gain insight into what's 
 going on, but I'm coming up clueless.  My literature hunt 
 reveals repeated instances of folks saying that the final 
 tree can't be interpreted the way other trees are, but I'm 
 not getting a lot on just why that might be.  
 
 Some folks talk about the final tree being averaged, others 
 say that mode, is employed (which doesn't make sense to me 
 if I'm getting 1.5 and 2.5 split values).  If the trees are 
 only good as black box predictors (which is of course a very 
 useful thing in itself), should I even be using the node 
 information in my image classifications?  
 
 As you see, I'm missing some rather important point or other 
 here.  Can you enlighten?
 
 Thanks,
 A
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


--
Notice:  This e-mail message, together with any attachme...{{dropped:15}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.