Hello R Experts, I want to make sure I understand how the strata, sampsize and replace parameters work so I can confidently perform downsampling on a dataset I'm working with.
My main question is when the documentation talks about how each of these parameters (strata, sampsize, replace) works it is all per tree? Below is my understanding...can you tell me if I have this correct? table(iris$Species) # setosa versicolor virginica # 50 50 50 #default of replace is TRUE #EACH tree uses a sample of 150. For a given tree since sampling w/ replacement is used it is possible that only one class is represented such as setosa i.e. each setosa observation is represented 3x. randomForest(Species~.,data=iris) # EACH tree uses a sample of 30 -- 10 from each class. Observations from each class may be repeated. randomForest(Species~.,data=iris,sampsize=c(setosa=10,versicolor=10,virginica=10), strata=iris$Species) # EACH tree uses a sample of 60 -- 10 from the 1st classs, 20 from the 2nd and 30 from the 3rd. Observations from each class may be repeated. randomForest(Species~.,data=iris,sampsize=c(setosa=10,versicolor=20,virginica=30), strata=iris$Species) Dan [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.