Wondered about the best way to control for input variables that have a
large number of levels in 'rpart' models. I understand the algorithm
searches through all possible splits (2^(k-1) for k levels) and so
variables with more levels are more prone to be good spliters... so I'm
looking for ways to compensate and adjust for this complexity.
For example, if two variables produce comparable splits in the data but
one contains 2 levels and the other 13 levels then I would like to have
to have the algorithm choose the 'simpler' split.
Is this best done with the 'cost' argument in the rpart options? This
defaults to one for all variables... so would it make sense to scale
this by nlevels in each variable or sqrt(nlevels) or something similar?
Thanks,
Landon
[[alternative HTML version deleted]]
______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html