Hi,

I am using the rpart package to construct regression trees and for the
purposes of simulation, would like to the tree completely split: each
leaf should contain exactly one observation.

However, I have observed that even by setting minsplit = 2, i.e.,

```
control <- rpart.control(
    minsplit = 2,
    cp = -1,
    xval = 0,
    maxcompete = 0,
    usesurrogate=0,
    maxdepth=30
)

model <- rpart(...., control = control)
```

the model will still have leaf nodes with more than one observation.
In fact, when I choose a subset of my dataset which fall into the same
terminal leaf, and run rpart on that subset, further split will occur.
Any advice on why this is occuring? Thanks!


Best regards,
Kevin


P.S. A snippet to showcase the behavior above:

--

library(rpart)
library(data.table)

mu <- function(x, y, z) sin(10 * pi * x + 2 * y) - cos(10 * pi * y) + exp(z)

control <- rpart.control(
    minsplit = 2,
    cp = -1,
    xval = 0,
    maxcompete = 0,
    usesurrogate=0,
    maxdepth=30
)

gen.data <- function(n, sd = 0.5) {
    X <- matrix(runif(3 * n), ncol=3)
    colnames(X) <- c('x', 'y', 'z')
    e <- rnorm(n, sd = sd)

    X <- data.table(X)
    X[, mu := mu(x, y, z)]
    X[, A := mu + e]
    return(X[])
}

# Run rpart on the simulated dataset ...
set.seed(12321)
X <- gen.data(30000, sd = 0.1)
X[, i := .I]
mod <- rpart(A ~ x + y + z, X, control = control)

frame <- as.data.table(mod$frame, keep.rownames=TRUE)
frame[, rn := as.integer(rn)][, i := .I]
setnames(frame, "rn", "id")

splits <- as.data.table(mod$splits, keep.rownames=TRUE)
setnames(splits, "rn", "var")
splits[, var := factor(var)]

where <- data.table(i = seq(1, X[,.N]), where=mod$where)


# m = 7191 is the row of the leaf that contains the most observations,
# in this case 11.
m <- frame[var == "<leaf>"][order(-n)][1, i]
obs <- where[where == m, i]

# Collect those 11 observations another dataframe
X2 <- X[i %in% obs]

# observe that rpart will split on that subset again, why?
mod2 <- rpart(A ~ x + y + z, X2, control=control)

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to