I would like to build a forest of regression trees to see how well some
covariates predict a response variable and to examine the importance of the
covariates. I have a small number of covariates (8) and large number of
records (27368). The response and all of the covariates are continuous
variables.

A cursory examination of the covariates does not suggest they are correlated in a simple fashion (e.g. the variance inflation factors are all fairly low) but common sense suggests there should be some relationship: one of them is the day of the year and some of the others are environmental parameters such as water temperature. For this reason I would like to follow the advice of Strobl et al. (2008) and try the authors' conditional variable importance
measure. This is implemented in the party package by calling varimp(...,
conditional=TRUE). Unfortunately, when I call that on my forest I receive
the error:

varimp(myforest, conditional=TRUE)
Error in model.matrix.default(as.formula(f), data = blocks) :
 term 1 would require 9e+12 columns

Does anyone know what is wrong?


Hi Jason,

the particular feature doesn't scale well in its current implementation. Anyway, thanks for looking up previous reports closely. I can offer to have a look at your data if you send them along with the code to reproduce the problem.

Best,

Torsten

I noticed a post in June 2011 where a user reported this message and the
ultimate problem was that the importance measure was being conditioned on too many variables (47). I have only a small number of variables here so I
guessed that was not the problem.

Another suggestion was that there could be a factor with too many levels. In my case, all of the variables are continuous. Term 1 (x1 below) is the day of the year, which does happen to be integers 1 ... 366. But the variable is class numeric, not integer, so I don't believe cforest would treat it as a
factor, although I do not know how to tell whether cforest is treating
something as continuous or as a factor.

Thank you for any help you can provide. I am running R 2.13.1 with party
0.9-99994. You can download the data from
http://www.duke.edu/~jjr8/data.rdata (512 KB). Here is the complete code:

load("\\Temp\\data.rdata")
nrow(df)
[1] 27368
summary(df)
      y                 x1              x2               x3
x4 x5 x6 x7 x8

Min. : 0.000 Min. : 1.0 Min. :0.0000 Min. : 1.00 Min.
:  52   Min.   : 0.008184   Min.   :16.71   Min.   :0.0000000   Min.   :
0.02727
1st Qu.:  0.000   1st Qu.:105.0   1st Qu.:0.0000   1st Qu.: 30.00   1st
Qu.:1290 1st Qu.: 6.747035 1st Qu.:23.92 1st Qu.:0.0000000 1st Qu.:
0.11850
Median : 1.282 Median :169.0 Median :0.2353 Median : 38.00 Median
:1857   Median :11.310277   Median :26.35   Median :0.0001569   Median :
0.14625
Mean : 5.651 Mean :178.7 Mean :0.2555 Mean : 55.03 Mean
:1907   Mean   :12.889021   Mean   :26.31   Mean   :0.0162043   Mean   :
0.20684
3rd Qu.:  5.353   3rd Qu.:262.0   3rd Qu.:0.4315   3rd Qu.: 47.00   3rd
Qu.:2594 3rd Qu.:18.427410 3rd Qu.:28.95 3rd Qu.:0.0144660 3rd Qu.:
0.20095
Max. :195.238 Max. :366.0 Max. :1.0000 Max. :400.00 Max.
:3832   Max.   :29.492380   Max.   :31.73   Max.   :0.3157486   Max.
:11.76877
library(HH)
<output deleted>
vif(y ~ ., data=df)
     x1       x2       x3       x4       x5       x6       x7       x8
1.374583 1.252250 1.021672 1.218801 1.015124 1.439868 1.075546 1.060580
library(party)
<output deleted>
mycontrols <- cforest_unbiased(ntree=50, mtry=3) # Small forest
but requires a few minutes
myforest <- cforest(y ~ ., data=df, controls=mycontrols)
varimp(myforest)
x1 x2 x3 x4 x5 x6 x7
x8
11.924498 103.180195 16.228864 30.658946 5.053500 12.820551 2.113394
6.911377
varimp(myforest, conditional=TRUE)
Error in model.matrix.default(as.formula(f), data = blocks) :
 term 1 would require 9e+12 columns

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to