First things first: There's no final tree in random forests. You get
a set of trees (i.e., a forest). Secondly, a forest cannot be
interpreted because of the complexity, not because the splits can't
possibly make sense. You can try to interpret the trees, as long as you
understand the potential pitfalls of doing that.
Here's how ordered factors are handled in randomForest. Since the tree
algorithm only make use of the ranks, there's basically no difference
between numerical and ordinal variables. Thus ordered factors are
simply treated as integers 1 through K where K is the number or levels,
and the underlying algorithm is told that this is just a numeric
variable. This is how a split point of something like 3.5 appears.
Hope that make some sense.
Best,
Andy
From: [EMAIL PROTECTED]
Hi there,
I am an environmental studies masters student trying to get
my thesis out the door. I am also newbie at trees in
general, but I like what I see in the literature about the
random forest algorithm. I think I get the general gist of
things, but even after reading stuff I'm unclear about how I
could be getting the results I'm seeing. I obviously am
missing something about how the split points in the final
tree are decided.
I've been using random forests in image classification by
entering split values into decision tree classifiers, and
that has seemed work very well. The map output appears
legitimate and withheld data gives confusion matrices similar
to the predictive errors from the random forest. This leads
me to assume that the split points are effective.
However now that I've turned to the ecological portion of my
analysis, with a data set that contains few variable levels
and lots of zeros, suddenly the splitting node information is
not making sense.
Here is my situation. I have a matrix of study plots that
each belong to one of three elevation classes and which each
have percent cover class data for 15 plant species associated
with them.
plot elevsp1 sp2 sp3... sp15
1 3 0 2 6... 5
2 0 0 0 1... 0
etc.
The species data are ordered factors from 0-9. When I run
the algorithm using species cover values to predict elevation
class, two species alone come up as the best predictors.
That makes ecological sense in this setting, given the
species ranges in question.
Here's my difficulty though. The split point values can't be
interpreted, as far as I can tell. I'm getting split points
of, say, 1.5 and 2.5 for a species who's cover is either 0
(absent) or 4 and above. So obviously the split points in
the final tree are being generated in some way I don't
understand. Averaged?
I've tried running the tree using the data as factors, using
the data as ordered factors, and using the data as numerical
variables, just to see if I could gain insight into what's
going on, but I'm coming up clueless. My literature hunt
reveals repeated instances of folks saying that the final
tree can't be interpreted the way other trees are, but I'm
not getting a lot on just why that might be.
Some folks talk about the final tree being averaged, others
say that mode, is employed (which doesn't make sense to me
if I'm getting 1.5 and 2.5 split values). If the trees are
only good as black box predictors (which is of course a very
useful thing in itself), should I even be using the node
information in my image classifications?
As you see, I'm missing some rather important point or other
here. Can you enlighten?
Thanks,
A
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Notice: This e-mail message, together with any attachme...{{dropped:15}}
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.