Hello list, I'm trying to generate classifiers for a certain task using several methods, one of them being decision trees. The doubts come when I want to estimate the cross-validation error of the generated tree:
tree <- rpart(y~., data=data.frame(xsel, y), cp=0.00001) ptree <- prune(tree, cp=tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]) ptree$cptable CP nsplit rel error xerror xstd 1 0.33120000 0 1.0000 1.0000 0.02856022 2 0.08640000 1 0.6688 0.6704 0.02683544 3 0.02986667 2 0.5824 0.5856 0.02584564 4 0.02880000 5 0.4928 0.5760 0.02571738 5 0.01920000 6 0.4640 0.5168 0.02484761 6 0.01440000 8 0.4256 0.5056 0.02466708 7 0.00960000 12 0.3552 0.5024 0.02461452 8 0.00880000 15 0.3264 0.4944 0.02448120 9 0.00800000 17 0.3088 0.4768 0.02417800 10 0.00480000 25 0.2448 0.4672 0.02400673 If I got it right, "xerror" stands for the cross-validation error (using 10-fold by default), this is pretty high (0.4672 over 1). However, if I do something similar using tune from e1071 I get a much lower error: treetune <- tune(rpart, y~., data=data.frame(xsel, y), predict.func = treeClassPrediction, cp=0.0048) > treetune$best.performance[1] 0.2243049 I'm also assuming that the performance returned by "tune" is the cross-validation error (also 10-fold by default). So where does this enormous difference come from? What am I missing? Also, "rel error" is the relative error in the training set? The documentation is not very descriptive: cptable- the table of optimal prunings based on a complexity parameter. Thanks and happy pre-new year, -- israel [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.