I would expect this regression towards the mean behavior on a new or hold out dataset, not on the training data. In RF terminology, this means that the model prediction from predict is the in-bag estimate, but the out-of-bag estimate is what you want for prediction. In Joshua's example, rf.rf$predicted is an out-of-bag estimate, but since newdata is given, it appears that the result is the in-bag estimate, which still needs an adjustment like Joshua's (and perhaps a more complex one might be needed in some cases). This is a bit confusing since predict() usually matches what's in model$fitted.values. I imagine that's why the author used "predicted" as the component name instead of the standard "fitted.values".
The documentation for predict.randomForest explains: "newdata - a data frame or matrix containing new data. (Note: If not given, the out-of-bag prediction in object is returned. " Patrick Burns wrote: > > What I see is the predictions being less extreme than the > actual values -- predictions for large actual values are smaller > than the actual, and predictions for small actual values are > larger than the actual. That makes sense to me. The object > is to maximize out-of-sample predictive power, not in-sample > predictive power. > > Or am I missing something in what you are saying? > > > Patrick Burns > [EMAIL PROTECTED] > +44 (0)20 8525 0696 > http://www.burns-stat.com > (home of S Poetry and "A Guide for the Unwilling S User") > > > Joshua Knowles wrote: > >>Hi all, >> >>I have observed that when using the randomForest package to do regression, the >>predicted values of the dependent variable given by a trained forest are not >>centred and have the wrong slope when plotted against the true values. >> >>This means that the R^2 value obtained by squaring the Pearson correlation are >>better than those obtained by computing the coefficient of determination >>directly. The R^2 value obtained by squaring the Pearson can, however, be >>exactly reproduced by the coeff. of det. if the predicted values are first >>linearly transformed (using lm() to find the required intercept and slope). >> >>Does anyone know why the randomForest behaves in this way - producing offset >>predictions? Does anyone know a fix for the problem? >> >>(By the way, the feature is there even if the original dependent variable >>values are initially transformed to have zero mean and unit variance.) >> >>As an example, here is some simple R code that uses the available swiss >>dataset to show the effect I am observing. >> >>Thanks for any help. >> >>-- >>#### EXAMPLE OF RANDOM FOREST REGRESSION >> >>library(randomForest) >>data(swiss) >>swiss >> >>#Build the random forest to predict Infant Mortality >>rf.rf<-randomForest(Infant.Mortality ~ ., data=swiss) >> >>#And predict the training set again >>pred<-c(predict(rf.rf,swiss)) >>actual<-swiss$Infant.Mortality >> >>#Plotting predicted against actual values shows the effect (uncomment to see >>this) >>#plot(pred,actual) >>#abline(0,1) >> >># calculate R^2 as pearson coefficient squared >>R2one<-cor(pred,actual)^2 >> >># calculate R^2 value as fraction of variance explained >>residOpt<-(actual-pred) >>residnone<-(actual-mean(actual)) >>R2two<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE) >> >># now fit a line through the predicted and true values and >># use this to normalize the data before calculating R^2 >> >>fit<-lm(actual ~ pred) >>coef(fit) >>pred2<-pred*coef(fit)[2]+coef(fit)[1] >>residOpt<-(actual-pred2) >>R2three<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE) >> >>cat("Pearson squared = ",R2one,"\n") >>cat("Coeff of determination = ", R2two, "\n") >>cat("Coeff of determination after linear fitting = ", R2three, "\n") >> >>## END >> >> >> >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- View this message in context: http://www.nabble.com/randomForest%28%29-for-regression-produces-offset-predictions-tp14415517p14447468.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.