Re: [R] How to estimate whether overfitting?

Frank E Harrell Jr Mon, 10 May 2010 05:59:56 -0700

On 05/10/2010 12:32 AM, bbslover wrote:


many thanks .  I can try to use test set with 100 samples.

anther question is that how can I rationally split my data to training set
and test set? (training set with 108 samples, and test set with 100 samples)

as I  know, the test set should the same distribute to the training set. and
what method can deal with it to rationally split?

and what packages in R can deal with splitting training/test set rationally
question?


if the split is random. it seems to need many times splits, and the average
results consider as the final results.

however, I want to several methods to perform split and get the firm
training set and test set instead of random split.

training set and test set should like this：ideally, the division must be
performed sunch that points representing both traing and training set are
distributed within the hole feature space occupied by the entire dataset,
and each point of the test set is close to at least one point of the
training set. this approach ensures that the similarity principle can be
enmployed for the output prediction of the test set. Certainly,this
condition can not always be satistied.

thus, generally, what algorithms often be perform to split? and more
rational? some paper often say, they split the data set  randomly, thus,
what is randomly?  just selection random? or have some clear method? e.g.
output order,  I really know, which package can do with split data
rationally?

other, if one want to get the better results, some "tips" can be done. e.g.
they can select test set again and again, and use the test set with best
results as final test set and say that the test set was selectd randomly,
but it is not true random, it is false.

thank you, sorry to so many questions. but it puzzled me always.  up to now,
I have no good method to split rationally my data into training set and test
set.

at last, split training and test set should be done before modeling, and it
seems that this can be done just from featrue? (som)  ( or feature and
output?(alogorithm spxy. paper:"a method for calibration and validation
subset partioning")  or just output?(output order)).

but always, often there are many features to be calculated. and some featrue
is zero or low standard deviation(sd<0.5),  should we delete these features
before split the whole data?

and use the remaining feature to split data, and just using the training set
to build the regression model and to perform feature selection as well as to
do cross-validation,  and the independent test set just used to test the
built model, yes?

maybe, my thinking is not clear about the whole model precess. but I think
it is like this:
1) get samples
2) calculate features
3) preprocess features calculated (e.g.remove zero)
4)rational split data into training and test set (always puzzle me, how to
split on earth?)
5)build model and at the same time tune parameter of model  based on the
resample methods using just training set. and get the final model.
6) test the model performance using independent test set (unseen samples).
7) estimate the model. good? or bad?  overfitting?  (generally, what case is
overfitting? can you give me a example? as i know, it is overfitting when
the trainging set fit good, but the independent test set is bad,but what is
good ? what is bad?    r2=0.94 in the training set and r2=0.70 in the test,
in this case, the model is overfitting?  the model can be accepted?  and
generally what model can be well accetpt?)
8) conclusion. how is the model.

above is my thinking.  and many question wait for answering.

thanks

kevin

Kevin: I'm sorry I don't have time to deal with such a long note, butbriefly data splitting is not a good idea no matter how you do it unlessN > perhaps 20,000. I suggest resampling, e.g., either the bootstrapwith 300 resamples or 50-fold repeats of 10-fold cross-validation.Among other places these are implemented in my rms package.


Frank

--
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to estimate whether overfitting?

Reply via email to