> I'm using rpart function for creating regression trees. > now how to measure the fitness of regression tree??? > > thanks n Regards, > Vibha
I read R-help as a digest so often come late to a discussion. Let me start by being the first to directly answer the question: > fit <- rpart(time ~ age +ph.ecog,lung) > summary(fit) Call: rpart(formula = time ~ age + ph.ecog, data = lung) n= 228 CP nsplit rel error xerror xstd 1 0.03516666 0 1.0000000 1.009949 0.1137819 2 0.01459053 1 0.9648333 1.049636 0.1282259 3 0.01324335 3 0.9356523 1.090562 0.1301632 4 0.01000000 7 0.8810284 1.063609 0.1298557 Node number 1: 228 observations, complexity param=0.03516666 mean=305.2325, MSE=44176.93 left son=2 (51 obs) right son=3 (177 obs) Primary splits: ... The relative error and cross-validated relative error columns above, for a regression tree, are equal to 1-R^2. In this case none of the splits are useful; even the naive non-cross-validated improvement for the first split isn't much (R^2 < .04). Now to the larger debate. I do not find trees as useless as Frank (does anyone). I like to use them for initial data exploration, in the same fashion as a scatterplot. But I fight the same battle that he does with some colleages and customers: they are so very easy to interpret that the results are often severely over-interpreted, sometimes to the point that the tree did more harm than good. All forward stepwise procedures are unstable. Particularly with rich data sets, such as I see each day in the medical field, there are mulitple overlapping/correlated predictors. Small changes in the data will completely change the order of a forward stepwise regression. Anyone who puts faith in the ORDER of inclusion as a measure of worth is like a flag in a fitful breeze. A bigger problem with rpart is the users consistenly ignore the xerror column above, and print out (and believe) bigger trees than they should. Once the xerror bottoms out you are almost certainly looking at random noise. Since the xerror curve often has a long flat bottom the 1SE rule is better (anything within 1SE of the min is a tie, use the smallest of a set of tied models). Terry Therneau ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.