Since Frank has made this somewhat cryptic remark (sample size > 20K) several times now, perhaps I can add a few words of (what I hope is) further clarification.
Despite any claims to the contrary, **all** statistical (i.e. empirical) modeling procedures are just data interpolators: that is, all that they can claim to do is produce reasonable predictions of what may be expected within the extent of the data. The quality of the model is judged by the goodness of fit/prediction over this extent. Ergo the standard textbook caveats about the dangers of extrapolation when using fitted models for prediction. Note, btw, the contrast to "mechanistic" models, which typically **are** assessed by how well they **extrapolate** beyond current data. For example, Newton's apple to the planets. They are often "validated" by their ability to "work" in circumstances (or scales) much different than those from which they were derived. So statistical models are just fancy "prediction engines." In particular, there is no guarantee that they provide any meaningful assessment of variable importance: how predictors causally relate to the response. Obviously, empirical modeling can often be useful for this purpose, especially in well-designed studies and experiments, but there's no guarantee: it's an "accidental" byproduct of effective prediction. This is particularly true for happenstance (un-designed) data and non-parametric models like regression/classification trees. Typically, there are many alternative models (trees) that give essentially the same quality of prediction. You can see this empirically by removing a modest random subset of the data and re-fitting. You should not be surprised to see the fitted model -- the tree topology -- change quite radically. HOWEVER, the predictions of the models within the extent of the data will be quite similar to the original results. Frank's point is that unless the data set is quite large and the predictive relationships quite strong -- which usually implies parsimony -- this is exactly what one should expect. Thus it is critical not to over-interpret the particular model one get, i.e. to infer causality from the model (tree)structure. Incidentally, there is nothing new or radical in this; indeed, John Tukey, Leo Breiman, George Box, and others wrote eloquently about this decades ago. And Breiman's random forest modeling procedure explicitly abandoned efforts to build simply interpretable models (from which one might infer causality) in favor of building better interpolators, although assessment of "variable importance" does try to recover some of that interpretability (however, no guarantees are given). HTH. And contrary views welcome, as always. Cheers to all, Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Frank E Harrell Jr Sent: Thursday, April 01, 2010 5:02 AM To: vibha patel Cc: r-help@r-project.org Subject: Re: [R] fitness of regression tree: how to measure??? vibha patel wrote: > Hello, > > I'm using rpart function for creating regression trees. > now how to measure the fitness of regression tree??? > > thanks n Regards, > Vibha If the sample size is less than 20,000, assume that the tree is a somewhat arbitrary representation of the relationships in the data and that the form of the tree will not replicate in future datasets. Frank -- Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.