Hi, I am using the rpart package to construct regression trees and for the purposes of simulation, would like to the tree completely split: each leaf should contain exactly one observation.
However, I have observed that even by setting minsplit = 2, i.e., ``` control <- rpart.control( minsplit = 2, cp = -1, xval = 0, maxcompete = 0, usesurrogate=0, maxdepth=30 ) model <- rpart(...., control = control) ``` the model will still have leaf nodes with more than one observation. In fact, when I choose a subset of my dataset which fall into the same terminal leaf, and run rpart on that subset, further split will occur. Any advice on why this is occuring? Thanks! Best regards, Kevin P.S. A snippet to showcase the behavior above: -- library(rpart) library(data.table) mu <- function(x, y, z) sin(10 * pi * x + 2 * y) - cos(10 * pi * y) + exp(z) control <- rpart.control( minsplit = 2, cp = -1, xval = 0, maxcompete = 0, usesurrogate=0, maxdepth=30 ) gen.data <- function(n, sd = 0.5) { X <- matrix(runif(3 * n), ncol=3) colnames(X) <- c('x', 'y', 'z') e <- rnorm(n, sd = sd) X <- data.table(X) X[, mu := mu(x, y, z)] X[, A := mu + e] return(X[]) } # Run rpart on the simulated dataset ... set.seed(12321) X <- gen.data(30000, sd = 0.1) X[, i := .I] mod <- rpart(A ~ x + y + z, X, control = control) frame <- as.data.table(mod$frame, keep.rownames=TRUE) frame[, rn := as.integer(rn)][, i := .I] setnames(frame, "rn", "id") splits <- as.data.table(mod$splits, keep.rownames=TRUE) setnames(splits, "rn", "var") splits[, var := factor(var)] where <- data.table(i = seq(1, X[,.N]), where=mod$where) # m = 7191 is the row of the leaf that contains the most observations, # in this case 11. m <- frame[var == "<leaf>"][order(-n)][1, i] obs <- where[where == m, i] # Collect those 11 observations another dataframe X2 <- X[i %in% obs] # observe that rpart will split on that subset again, why? mod2 <- rpart(A ~ x + y + z, X2, control=control) ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.