Gustaf, Summarizing things I don't understand: - Honestly, I was thinking I can use bootstrap to obtain better estimate of a mean - provided that I want it. So, I can't? - If I can't obtain reliable estimates of CI and variance from a small dataset, but I can do it with bootstrap - isn't it a "virtual increase" of the size of dataset? OK, these are just words, I won't fight for that. - I don't understand why a procedure works for 26 models and doesn't work for one... Intuitively this doesn't make sense... - I don't understand why resampling *cannot* improve... while it does? I know the proof is going to be hard to follow, but let me try! (The proof of the opposite is in the paper). - I truly don't understand what I don't understand about what I am doing. This is getting too much convoluted for me...
And a remark about what I don't agree with Gustaf: The text below, quoted from Pawinski et al ("Twenty six..."), is missing an important information - that they repeated that step 50 times - each time with "randomly selected subset". Excuse my ignorance again, but this looks like bootstrap (re-sampling), doesn't it? Although I won't argue for names. I want to assure everyone here that I did *exactly* what they did. I work in the same lab, that this paper came from, and I just had their procedure in SPSS translated to SAS. Moreover, the translation was done with help of a _trustworthy biostatistician_ - I was not that good with SAS at the time to do it myself. The biostatistician wrote the randomization and regression subroutines. I later improved them using macros (less code) and added validation part. It was then approved by that biostatistician. OK, I did not exactly do the same, because I repeated the step 100 times for 34 *pre-defined* models and on a different dataset. But that's about all the difference. I hope this solves everyone's dilemma whether I did what is described in Pawinski's paper or not. This discussion, though, started with my question on: how to do it in R, instead of SAS, and with logistic (not linear) regression. Thank you, Gustaf, for the code - this was the help I needed. -- Michal J. Figurski Gustaf Rydevik wrote:
" For example, in here, the statistical estimator is the sample mean. Using bootstrap sampling, you can do beyond your statistical estimators. You can now get even the distribution of your estimator and the statistics (such as confidence interval, variance) of your estimator." Again you are misinterpreting text. The phrase about "doing beyond your statistical estimators", is explained in the next sentence, where he says that using bootstrap gives you information about the mean *estimator* (and not more information about the population mean). And since you're not interested in this information, in your case bootstrap/resampling is not useful at all. As another example of misinterpretation: In your email from a week ago, it sounds like you believe that the authors of the original paper are trying to improve on a fixed model Figurski: "Regarding the "multiple stepwise regression" - according to the cited SPSS manual, there are 5 options to select from. I don't think they used 'stepwise selection' option, because their models were already pre-defined. Variables were pre-selected based on knowledge of pharmacokinetics of this drug and other factors. I think this part I understand pretty well." This paragraph is wrong. Sorry, no way around it. Quoting from the paper Pawinski etal: " *__Twenty-six____(!)* 1-, 2-, or 3-sample estimation models were fit (r2 0.341� 0.862) to a randomly selected subset of the profiles using linear regression and were used to estimate AUC0�12h for the profiles not included in the regression fit, comparing those estimates with the corresponding AUC0�12h values, calculated with the linear trapezoidal rule, including all 12 timed MPA concentrations. The 3-sample models were constrained to include no samples past 2 h." (emph. mine) They clearly state that they are choosing among 26 different models by using their bootstrap-like procedure, not improving on a single, predefined model. This procedure is statistically sound (more or less at least), and not controversial. However, (again) what you are wanting to do is *not* what they did in their paper! resampling can not improve on the performance of a pre-specified model. This is intuitively obvious, but moreover its mathematically provable! That's why we're so certain of our standpoint. If you really wish, I (or someone else) could write out a proof, but I'm unsure if you would be able to follow. In the end, it doesn't really matter. What you are doing amounts to doing a regression 50 times, when once would suffice. No big harm done, just a bit of unnecessary work. And proof to a statistically competent reviewer that you don't really understand what you're doing. The better option would be to either study some more statistics yourself, or find a statistician that can do your analysis for you, and trust him to do it right. Anyhow, good luck with your research. Best regards, Gustaf
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.