Dear all, I'm trying to manage with user defined split function in rpart (file rpart\tests\usersplits.R in http://cran.r-project.org/src/contrib/rpart_3.1-34.tar.gz - see bottom of the email). Suppose to have the following data.frame (note that x's values are already sorted) > D y x 1 7 0.428 2 3 0.876 3 1 1.467 4 6 1.492 5 3 1.703 6 4 2.406 7 8 2.628 8 6 2.879 9 5 3.025 10 3 3.494 11 2 3.496 12 6 4.623 13 4 4.824 14 6 4.847 15 2 6.234 16 7 7.041 17 2 8.600 18 4 9.225 19 5 9.381 20 8 9.986
Running rpart and setting minbucket=1 and maxdepth=1 we get the following tree (which uses, by default, deviance): > rpart(D$y~D$x,control=rpart.control(minbucket=1,maxdepth=1)) n= 20 node), split, n, deviance, yval * denotes terminal node 1) root 20 84.80000 4.600000 2) D$x< 9.6835 19 72.63158 4.421053 * 3) D$x>=9.6835 1 0.00000 8.000000 * This means that the first 19 observation has been sent to the left side of the tree and one observation to the right. This is correct when we observe goodness (the maximum is the last element of the vector). The thing i really don't understand is the direction vector. # direction= -1 = send "y< cutpoint" to the left side of the tree # 1 = send "y< cutpoint" to the right What does it mean ? In the example here considered we have > sign(lmean) [1] 1 1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Which is the criterion used ? In my opinion we should have all the values equal to -1 given that they have to be sent to left side of the tree. Does someone can help me ? Thank you ####################################################### # The split function, where most of the work occurs. # Called once per split variable per node. # If continuous=T (the case here considered) # The actual x variable is ordered # y is supplied in the sort order of x, with no missings, # return two vectors of length (n-1): # goodness = goodness of the split, larger numbers are better. # 0 = couldn't find any worthwhile split # the ith value of goodness evaluates splitting obs 1:i vs (i+1):n # direction= -1 = send "y< cutpoint" to the left side of the tree # 1 = send "y< cutpoint" to the right # this is not a big deal, but making larger "mean y's" move towards # the right of the tree, as we do here, seems to make it easier to # read # If continuos=F, x is a set of integers defining the groups for an # unordered predictor. In this case: # direction = a vector of length m= "# groups". It asserts that the # best split can be found by lining the groups up in this order # and going from left to right, so that only m-1 splits need to # be evaluated rather than 2^(m-1) # goodness = m-1 values, as before. # # The reason for returning a vector of goodness is that the C routine # enforces the "minbucket" constraint. It selects the best return value # that is not too close to an edge. The vector wt of weights in our case is: > wt [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 temp2 <- function(y, wt, x, parms, continuous) { # Center y n <- length(y) y <- y- sum(y*wt)/sum(wt) if (continuous) { # continuous x variable temp <- cumsum(y*wt)[-n] left.wt <- cumsum(wt)[-n] right.wt <- sum(wt) - left.wt lmean <- temp/left.wt rmean <- -temp/right.wt goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2) list(goodness= goodness, direction=sign(lmean)) } } Paolo Radaelli Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali Facoltà di Economia Università degli Studi di Milano-Bicocca P.zza dell'Ateneo Nuovo, 1 20126 Milano Italy e-mail [EMAIL PROTECTED] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.