Gustaf,

Summarizing things I don't understand:
 - Honestly, I was thinking I can use bootstrap to obtain better
estimate of a mean - provided that I want it. So, I can't?
 - If I can't obtain reliable estimates of CI and variance from a small
dataset, but I can do it with bootstrap - isn't it a "virtual increase"
of the size of dataset? OK, these are just words, I won't fight for that.
 - I don't understand why a procedure works for 26 models and doesn't
work for one... Intuitively this doesn't make sense...
 - I don't understand why resampling *cannot* improve... while it does?
I know the proof is going to be hard to follow, but let me try! (The
proof of the opposite is in the paper).
 - I truly don't understand what I don't understand about what I am
doing. This is getting too much convoluted for me...

And a remark about what I don't agree with Gustaf:

The text below, quoted from Pawinski et al ("Twenty six..."), is missing
an important information - that they repeated that step 50 times - each
time with "randomly selected subset". Excuse my ignorance again, but
this looks like bootstrap (re-sampling), doesn't it? Although I won't
argue for names.

I want to assure everyone here that I did *exactly* what they did. I
work in the same lab, that this paper came from, and I just had their
procedure in SPSS translated to SAS. Moreover, the translation was done
with help of a _trustworthy biostatistician_ - I was not that good with
SAS at the time to do it myself. The biostatistician wrote the
randomization and regression subroutines. I later improved them using
macros (less code) and added validation part. It was then approved by
that biostatistician.
OK, I did not exactly do the same, because I repeated the step 100 times
for 34 *pre-defined* models and on a different dataset. But that's about
all the difference.

I hope this solves everyone's dilemma whether I did what is described in
Pawinski's paper or not.

This discussion, though, started with my question on: how to do it in R,
instead of SAS, and with logistic (not linear) regression. Thank you,
Gustaf, for the code - this was the help I needed.

--
Michal J. Figurski


Gustaf Rydevik wrote:

" For example, in here, the statistical estimator is  the sample mean.
Using bootstrap sampling, you can do beyond your statistical
estimators. You can now get even the distribution of your estimator
and the statistics (such as confidence interval, variance) of your
estimator."

Again you are misinterpreting text. The phrase about "doing beyond
your statistical estimators", is explained in the next sentence, where
he says that using bootstrap gives you information about the mean
*estimator* (and not more information about the population mean).
And since you're not interested in this information, in your case
bootstrap/resampling is not useful at all.

As another example of misinterpretation: In your email from  a week
ago, it sounds like you believe that the authors of the original paper
are trying to improve on a fixed model
Figurski:
"Regarding the "multiple stepwise regression" - according to the cited
SPSS manual, there are 5 options to select from. I don't think they used
'stepwise selection' option, because their models were already
pre-defined. Variables were pre-selected based on knowledge of
pharmacokinetics of this drug and other factors. I think this part I
understand pretty well."

This paragraph is wrong. Sorry, no way around it.

Quoting from the paper Pawinski etal:
"  *__Twenty-six____(!)*     1-, 2-, or 3-sample estimation
models were fit (r2  0.341� 0.862) to a randomly
selected subset of the profiles using linear regression
and were used to estimate AUC0�12h for the profiles not
included in the regression fit, comparing those estimates
with the corresponding AUC0�12h values, calculated
with the linear trapezoidal rule, including all 12
timed MPA concentrations. The 3-sample models were
constrained to include no samples past 2 h."
(emph. mine)

They clearly state that they are choosing among 26 different models by
using their bootstrap-like procedure, not improving on a single,
predefined model.
This procedure is statistically sound (more or less at least), and not
controversial.

However, (again) what you are wanting to do is *not* what they did in
their paper!

resampling can not improve on the performance of a pre-specified
model. This is intuitively obvious, but moreover its mathematically
provable! That's why we're so certain of our standpoint. If you really
wish, I (or someone else) could write out a proof, but I'm unsure if
you would be able to follow.

In the end, it doesn't really matter. What you are doing amounts to
doing a regression 50 times, when once would suffice. No big harm
done, just a bit of unnecessary work. And proof to a statistically
competent reviewer that you don't really understand what you're doing.
The better option would be to either study some more statistics
yourself, or find a statistician that can do your analysis for you,
and trust him to do it right.

Anyhow, good luck with your research.

Best regards,

Gustaf

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to