Hello again, The reason why I thought the order at which rows are passed to randomForest affect the error rate is because I get different results for different ways of splitting my positive/negative data.
First get the data (attached with this email) pos.df=read.table("C:/Program Files/R/rw2011/pos.df", header=T) neg.df=read.table("C:/Program Files/R/rw2011/neg.df", header=T) library(randomForest) #The first 2 columns are explanatory variables (which incidentally are not discriminative at all if one looks at their distributions), the 3rd is the class (pos or neg) train2test.ratio=8/10 min_len=min(nrow(pos.df), nrow(neg.df)) class_index=which(names(pos.df)=="class") #is the same for neg.df train_size=as.integer(min_len*train2test.ratio) ############ Way 1 train.indicesP=sample(seq(1:nrow(pos.df)), size=train_size, replace=FALSE) train.indicesN=sample(seq(1:nrow(neg.df)), size=train_size, replace=FALSE) trainP=pos.df[train.indicesP,] trainN=neg.df[train.indicesN,] testP=pos.df[-train.indicesP,] testN=neg.df[-train.indicesN,] mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-class_index], y=rbind(trainP, trainN)[,class_index], xtest=rbind(testP, testN)[,-class_index], ytest=rbind(testP, testN)[,class_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion ############## Way 2 ind <- sample(2, min(nrow(pos.df), nrow(neg.df)), replace = TRUE, prob=c(train2test.ratio, (1-train2test.ratio))) trainP=pos.df[ind==1,] trainN=neg.df[ind==1,] testP=pos.df[ind==2,] testN=neg.df[ind==2,] mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP, trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP, testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion ########### Way 3 subset_start=1 subset_end=subset_start+train_size train_index=seq(subset_start:subset_end) trainP=pos.df[train_index,] trainN=neg.df[train_index,] testP=pos.df[-train_index,] testN=neg.df[-train_index,] mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP, trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP, testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion ########### end The first 2 methods give me an abnormally low error rate (compared to what I get using the same data on a naiveBayes method) while the last one seems more realistic, but the difference in error rates is very significant. I need to use the last method to cross-validate subsets of my data sequentially(the first two methods use random rows throughout the length of the data), unless there is a better way to do it (?). Something must be very different between the first 2 methods and the last, but which is the correct one? I would greatly appreciate any suggestions on this! Many Thanks Eleni Rapsomaniki ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.