Re: [R] different randomForest performance for same data

Uwe Ligges Sun, 13 Dec 2009 11:28:54 -0800


Häring wrote:

Hello,

I came across a problem when building a randomForest model. Maybe someone can 
help me.
I have a training- and a testdataset with a discrete response and ten 
predictors (numeric and factor variables). The two datasets are similar in 
terms of number of predictor, name of variables and datatype of variables 
(factor, numeric) except that only one predictor has got 20 levels in the 
training dataset and only 19 levels in the test dataset.
I found that the model performance is different when train and test a model 
with the unchanged datasets on the one hand and after assigning the levels of 
the training dataset on the testdataset. I only assign the levels and do not 
change the dataset itself however the models perform different.
Why???

Here is my code:

library(randomForest)
load("datasets.RData")  # import traindat and testdat
nlevels(traindat$predictor1)

[1] 20

nlevels(testdat$predictor1)

[1] 19

nrow(traindat)

[1] 9838

nrow(testdat)

[1] 3841

set.seed(10)
rf_orig <- randomForest(x=traindat[,-1], y=traindat[,1], xtest=testdat[,-1], 
ytest=testdat[,1],ntree=100)
data.frame(rf_orig$test$err.rate)[100,1]      # Error on test-dataset

[1] 0.3082531

# assign the levels of the training dataset th the test dataset for predictor 1

levels(testdat$predictor1) <- levels(traindat$predictor1)nlevels(traindat$predictor1)

[1] 20

nlevels(testdat$predictor1)

[1] 20

nrow(traindat)

[1] 9838

nrow(testdat)

[1] 3841

set.seed(10)
rf_mod <- randomForest(x=traindat[,-1], y=traindat[,1], xtest=testdat[,-1], 
ytest=testdat[,1],ntree=100)
data.frame(rf_mod$test$err.rate)[100,1]       # Error on test-dataset

[1] 0.4808644  # is different

Say testdat has 19 levels called L2, ..., L20 and traindat has 20 levelscalled L1, ..., L20.


After your call
 levels(testdat$predictor1) <- levels(traindat$predictor1)

You renamed L2 -> L1, L3 -> L2, ..., L20 -> L19 and invented a new levelL20 that is unused.Hence you confused all levels completely and given your ztrainikng isperfect, you will get an error rate of 100% in the end, because yourenamed the levels in the testdata so that they do not fit to thetraindata any more.


Uwe Ligges

Cheers,
TIM

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] different randomForest performance for same data

Reply via email to