On Mon, Jul 04, 2011 at 09:22:23AM -0400, Katerine Goyer wrote: > > Hello, > > I am using > the rpart function (from the rpart package) to do a regression tree that > would describe > the behaviour of a fish species according to several environmental variables. > For each fish (sampling unit), I have repeated observations of the response > variable, which means that the data are not independent. Normally, in this > case, V-fold cross-validation needs to be modified to prevent over-optimistic > predictions of error rates by cross-validation and overestimation of the tree > size. A way to overcome this problem is by selecting only whole sampling units > in our subsets of cross-validation. My problem is that I don?t know how to > perform this modification of the cross-validation process in the rpart > function. > > > Is there a > way to do this modification in rpart or is there any other function I could > use > that would consider interdependence in the response variable? > > > Here is an > example of the code I am using (?Y? being the response variable and ?data.env? > being a data frame of the environmental > variables): > > > Tree = rpart(Y > ~ X1 + X2 + X3,xval=100,data=data.env) >
Hello. It may be needed to program crossvalidation at the R level using package tree, which does not contain crossvalidation itself. An example is as follows library(tree) X1 <- rnorm(200) X2 <- rnorm(200) X3 <- rnorm(200) Y <- ifelse(X1 > 0, X2, X3) data.env <- data.frame(X1, X2, X3, Y) ind <- rep(1:7, times=c(20, 30, 35, 30, 30, 25, 30)) # length(ind) == nrow(data.env) pred <- rep(NA, times=nrow(data.env)) for (i in unique(ind)) { Tree <- tree(Y ~ X1 + X2 + X3, data=data.env[ind != i, ]) PrunedTree <- prune.tree(Tree, best = 10) pred[ind == i] <- predict(PrunedTree, newdata=data.env[ind == i, ]) } plot(data.env$Y, pred, asp=1) The vector ind should be prepared so that all occurences of the same fish have the same value. See ?tree and ?prune.tree for further parameters. Consider also randomForest package, which may be more accurate, although it does not provide a comprehensible model. Hope this helps. Petr Savicky. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.