I'm been experimenting with the randomForest R package (v. 4.6-12) and getting 
an unexpected difference between rpart and randomForest results that may have 
something to do with using x's that are factors.  

The same model (see code below) is used to predict a 2-value variable called 
"resp" that is treated as a factor.  Four x's are used that are factors.

The rpart predicted probabilities average to the same as mean(resp) when used 
on the full dataset.  This seems OK.  
The randomForest predicted probabilities average is quite a bit different from 
mean(resp).  This seems unexpected since random forests amount to repeatedly 
doing variations of what rpart does.

Has anyone seen anything like this or see what I am doing wrong?

(I did the same comparison using the kyphosis dataset in rpart with all 
continuous predictors and found consistent average predicted probabilities 
between rpart and randomForest.)

Here's the code ... 

require(PracTools)      # R package with dataset used
require(rpart)
require(randomForest)

data(nhis)  # dataset in PracTools
table(nhis$resp)/nrow(nhis)
#        0         1
#0.3098952 0.6901048

t1 <- rpart(resp ~ age + as.factor(hisp) + as.factor(race) + 
as.factor(parents_r) + as.factor(educ_r),
      method = "class",
      control = rpart.control(minbucket = 50, cp=0),
      data = nhis)
rpart.prob <- predict(object = t1, newdata = nhis, type = "prob")
apply(rpart.prob,2,mean)
#        0         1
#0.3098952 0.6901048    mean of rpart predictions same as mean(resp)

rf.nhis <- randomForest(as.factor(resp) ~ age + as.factor(hisp) + 
as.factor(race)
                        + as.factor(parents_r) + as.factor(educ_r),
                    importance = TRUE, na.action = na.omit, mtry=5,
                    ntree = 1000, classwt = c(0.31, 0.69),
                        # cycled through mtry =1,...,5; the lower mtry is, the 
worse are the predicted probs
                    data = nhis)
rfnhis.prob <- predict(object = rf.nhis, newdata = nhis, type = "prob")
apply(rfnhis.prob,2,mean)
#        0         1
#0.2485541 0.7514459    not too close to mean(resp)

R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
randomForest_4.6-12

Thanks for any help,
Richard Valliant
Universities of Maryland and Michigan

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to