Re: [R] calculating AUCs for each of the 1000 boot strap samples

2011-03-18 Thread taby gathoni
Brian,

Thanks for the insights, i did manage to play around with the sample sizes and 
I have got some better results with both ROCR and pROC

Thanks also to Andrija for providing the main code and insights.








Thanks alot
Taby
  




--- On Thu, 3/17/11, Brian Diggs  wrote:

From: Brian Diggs 
Subject: Re: calculating AUCs for each of the 1000 boot strap samples
To: "tab...@yahoo.com" , R-help@r-project.org
Date: Thursday, March 17, 2011, 5:43 PM

Taby,

First, it is better to reply to the whole list (which I have included on 
this reply); there is a better chance of someone helping you.  Just 
because I could help with one aspect does not mean I necessarily can (or 
have the time to) help with more.

Further comments are inline below.

On 3/16/2011 10:45 PM, taby gathoni wrote:
> Hi Brian,
>
> Thanks for this comment I will action on this. Thanks also for the
> comment, Andrija also advised the same thing and it worked like magic.
>
> My next cause of action was to get the confidence intervals with the
> AUC values.
>
> For the confidence intervals i did them manually. for 99% i cut out
> first 5 and last 5 after ranking the ACs while for 95% CI i cut out
> first 25 and last 25

In general, this is right.  The middle 95% excludes the 2.5% on the 
ends, so for 1000 samples that is excluding the 25 most extreme values.

> and this is my output
>
>
>         Upper bound     Lower  Bound
>
> at 99% CI       0.8175  0.50125
>
> at 95% CI       0.7775  0.50375
>
> from my understanding because there are small samples of 20 GOOD and
> 20 BAD the variations in the upper and lower bound should be minimal in
> the 1000 samples.

I don't know why you would necessarily expect the variance to be 
minimal.  It is what it is.  Also, I don't know why you took 20 of each 
rather than just a random sub-sample.

> If you get time, Would you be in a position to assist me find out
> why  i have such huge variations? Thank you for taking time to respond.

Maybe pull out 10 of your bootstrap samples and look at the ROC curves 
themselves and their associated AUC.  That might give you a sense as to 
the variability that is possible (which is reflected in the confidence 
interval).

As a final note, you are reinventing the wheel.  There are several 
packages that deal with ROC curves.  Two I like in particular are ROCR 
and pROC.  The latter even has built in routines for computing 
confidence intervals for the AUC using bootstrap replication.

> Kind regards,
> Taby
>
>
> --- On Wed, 3/16/11, Brian Diggs  wrote:
>
> From: Brian Diggs
> Subject: Re: calculating AUCs for each of the 1000 boot strap samples
> To: tab...@yahoo.com
> Cc: "R help"
> Date: Wednesday, March 16, 2011, 10:42 PM
>
> On 3/16/2011 8:04 AM, taby gathoni wrote:
>>> data<-data.frame(id=1:(165+42),main_samp$SCORE, 
>>> x=rep(c("BAD","GOOD"),c(42,165)))
   f<-function(x) {
>> + str.sample<-list()
>> + for (i in 1:length(levels(x$x)))
>> + {
>> + str.sample[[i]]<-x[x$x==levels(x$x)[i] 
>> ,][sample(tapply(x$x,x$x,length)[i],20,rep=T),]
>> + }
>> + strat.sample<-do.call("rbind",str.sample)
>> + return(strat.sample$main_samp.SCORE)
>> + }
   f(data)
>>    [1]
>>    706 633 443 843 756 743 730 843 706 730 606 743 768 768 743 763 608 730
>>    743 743 530 813 813 831 793 900 793 693 900 738 706 831
>> [33] 818 758 718 831 768 638 770 738
   repl<-list()
   auc<-list()
   for(i in 1:1000)
>> + {
>> + repl[[i]]<-f(data)
>> + auc[[i]]<-colAUC(repl[[i]],rep(c("BAD","GOOD")),plotROC=FALSE,alg="ROC")
>> + }
>> Error in
>>    colAUC(repl[[i]], rep(c("BAD", "GOOD")), plotROC = FALSE, alg = "ROC") :
>>     colAUC: length(y) and nrow(X) must be the same Thanks alotTaby
>
> I think (though I can't check because the example is not reproducible without 
> main_samp$SCORE), that the problem is that the second argument to colAUC 
> should be
> rep(c("BAD", "GOOD"), c(20,20))
> The error is that repl[[i]] is length 40 while rep(c("BAD", "GOOD")) is 
> length 2.
>
> P.S. When giving an example, it is better to not include the prompts and 
> continuation prompts.  Copy it from the script rather than the output. 
> Relevant output can then be included as script comments (prefixed with #).  
> That makes cutting-and-pasting to test easier.
>
> -- Brian S. Diggs, PhD
> Senior Research Associate, Department of Surgery
> Oregon Health&  Science University
>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University



  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] calculating AUCs for each of the 1000 boot strap samples

2011-03-17 Thread Frank Harrell
Taby,

At the end of your note are you referring to the bootstrap confidence
intervals in the "external validation" case, i.e., not corrrected for
overfitting?  If so you can get that without the bootstrap (e.g., Hmisc
package rcorr.cens function).

You can get bootstrap overfitting-corrected ROC areas ("C-index") easily
using the rms package's validate function, though not confidence intervals.

Frank


Brian Diggs wrote:
> 
> Taby,
> 
> First, it is better to reply to the whole list (which I have included on 
> this reply); there is a better chance of someone helping you.  Just 
> because I could help with one aspect does not mean I necessarily can (or 
> have the time to) help with more.
> 
> Further comments are inline below.
> 
> On 3/16/2011 10:45 PM, taby gathoni wrote:
>> Hi Brian,
>>
>> Thanks for this comment I will action on this. Thanks also for the
>> comment, Andrija also advised the same thing and it worked like magic.
>>
>> My next cause of action was to get the confidence intervals with the
>> AUC values.
>>
>> For the confidence intervals i did them manually. for 99% i cut out
>> first 5 and last 5 after ranking the ACs while for 95% CI i cut out
>> first 25 and last 25
> 
> In general, this is right.  The middle 95% excludes the 2.5% on the 
> ends, so for 1000 samples that is excluding the 25 most extreme values.
> 
>> and this is my output
>>
>>
>> Upper bound Lower  Bound
>>
>> at 99% CI   0.8175  0.50125
>>
>> at 95% CI   0.7775  0.50375
>>
>> from my understanding because there are small samples of 20 GOOD and
>> 20 BAD the variations in the upper and lower bound should be minimal in
>> the 1000 samples.
> 
> I don't know why you would necessarily expect the variance to be 
> minimal.  It is what it is.  Also, I don't know why you took 20 of each 
> rather than just a random sub-sample.
> 
>> If you get time, Would you be in a position to assist me find out
>> why  i have such huge variations? Thank you for taking time to respond.
> 
> Maybe pull out 10 of your bootstrap samples and look at the ROC curves 
> themselves and their associated AUC.  That might give you a sense as to 
> the variability that is possible (which is reflected in the confidence 
> interval).
> 
> As a final note, you are reinventing the wheel.  There are several 
> packages that deal with ROC curves.  Two I like in particular are ROCR 
> and pROC.  The latter even has built in routines for computing 
> confidence intervals for the AUC using bootstrap replication.
> 
>> Kind regards,
>> Taby
>>
>>
>> --- On Wed, 3/16/11, Brian Diggs  wrote:
>>
>> From: Brian Diggs
>> Subject: Re: calculating AUCs for each of the 1000 boot strap samples
>> To: tab...@yahoo.com
>> Cc: "R help"
>> Date: Wednesday, March 16, 2011, 10:42 PM
>>
>> On 3/16/2011 8:04 AM, taby gathoni wrote:
 data<-data.frame(id=1:(165+42),main_samp$SCORE,
 x=rep(c("BAD","GOOD"),c(42,165)))
>   f<-function(x) {
>>> + str.sample<-list()
>>> + for (i in 1:length(levels(x$x)))
>>> + {
>>> + str.sample[[i]]<-x[x$x==levels(x$x)[i]
>>> ,][sample(tapply(x$x,x$x,length)[i],20,rep=T),]
>>> + }
>>> + strat.sample<-do.call("rbind",str.sample)
>>> + return(strat.sample$main_samp.SCORE)
>>> + }
>   f(data)
>>>[1]
>>>706 633 443 843 756 743 730 843 706 730 606 743 768 768 743 763 608
>>> 730
>>>743 743 530 813 813 831 793 900 793 693 900 738 706 831
>>> [33] 818 758 718 831 768 638 770 738
>   repl<-list()
>   auc<-list()
>   for(i in 1:1000)
>>> + {
>>> + repl[[i]]<-f(data)
>>> +
>>> auc[[i]]<-colAUC(repl[[i]],rep(c("BAD","GOOD")),plotROC=FALSE,alg="ROC")
>>> + }
>>> Error in
>>>colAUC(repl[[i]], rep(c("BAD", "GOOD")), plotROC = FALSE, alg =
>>> "ROC") :
>>> colAUC: length(y) and nrow(X) must be the same Thanks alotTaby
>>
>> I think (though I can't check because the example is not reproducible
>> without main_samp$SCORE), that the problem is that the second argument to
>> colAUC should be
>> rep(c("BAD", "GOOD"), c(20,20))
>> The error is that repl[[i]] is length 40 while rep(c("BAD", "GOOD")) is
>> length 2.
>>
>> P.S. When giving an example, it is better to not include the prompts and
>> continuation prompts.  Copy it from the script rather than the output.
>> Relevant output can then be included as script comments (prefixed with
>> #).  That makes cutting-and-pasting to test easier.
>>
>> -- Brian S. Diggs, PhD
>> Senior Research Associate, Department of Surgery
>> Oregon Health&  Science University
>>
> 
> -- 
> Brian S. Diggs, PhD
> Senior Research Associate, Department of Surgery
> Oregon Health & Science University
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-
Frank Harrell
Department of Bi

Re: [R] calculating AUCs for each of the 1000 boot strap samples

2011-03-17 Thread Brian Diggs

Taby,

First, it is better to reply to the whole list (which I have included on 
this reply); there is a better chance of someone helping you.  Just 
because I could help with one aspect does not mean I necessarily can (or 
have the time to) help with more.


Further comments are inline below.

On 3/16/2011 10:45 PM, taby gathoni wrote:

Hi Brian,

Thanks for this comment I will action on this. Thanks also for the
comment, Andrija also advised the same thing and it worked like magic.

My next cause of action was to get the confidence intervals with the
AUC values.

For the confidence intervals i did them manually. for 99% i cut out
first 5 and last 5 after ranking the ACs while for 95% CI i cut out
first 25 and last 25


In general, this is right.  The middle 95% excludes the 2.5% on the 
ends, so for 1000 samples that is excluding the 25 most extreme values.



and this is my output


Upper bound Lower  Bound

at 99% CI   0.8175  0.50125

at 95% CI   0.7775  0.50375

from my understanding because there are small samples of 20 GOOD and
20 BAD the variations in the upper and lower bound should be minimal in
the 1000 samples.


I don't know why you would necessarily expect the variance to be 
minimal.  It is what it is.  Also, I don't know why you took 20 of each 
rather than just a random sub-sample.



If you get time, Would you be in a position to assist me find out
why  i have such huge variations? Thank you for taking time to respond.


Maybe pull out 10 of your bootstrap samples and look at the ROC curves 
themselves and their associated AUC.  That might give you a sense as to 
the variability that is possible (which is reflected in the confidence 
interval).


As a final note, you are reinventing the wheel.  There are several 
packages that deal with ROC curves.  Two I like in particular are ROCR 
and pROC.  The latter even has built in routines for computing 
confidence intervals for the AUC using bootstrap replication.



Kind regards,
Taby


--- On Wed, 3/16/11, Brian Diggs  wrote:

From: Brian Diggs
Subject: Re: calculating AUCs for each of the 1000 boot strap samples
To: tab...@yahoo.com
Cc: "R help"
Date: Wednesday, March 16, 2011, 10:42 PM

On 3/16/2011 8:04 AM, taby gathoni wrote:

data<-data.frame(id=1:(165+42),main_samp$SCORE, 
x=rep(c("BAD","GOOD"),c(42,165)))

  f<-function(x) {

+ str.sample<-list()
+ for (i in 1:length(levels(x$x)))
+ {
+ str.sample[[i]]<-x[x$x==levels(x$x)[i] 
,][sample(tapply(x$x,x$x,length)[i],20,rep=T),]
+ }
+ strat.sample<-do.call("rbind",str.sample)
+ return(strat.sample$main_samp.SCORE)
+ }

  f(data)

   [1]
   706 633 443 843 756 743 730 843 706 730 606 743 768 768 743 763 608 730
   743 743 530 813 813 831 793 900 793 693 900 738 706 831
[33] 818 758 718 831 768 638 770 738

  repl<-list()
  auc<-list()
  for(i in 1:1000)

+ {
+ repl[[i]]<-f(data)
+ auc[[i]]<-colAUC(repl[[i]],rep(c("BAD","GOOD")),plotROC=FALSE,alg="ROC")
+ }
Error in
   colAUC(repl[[i]], rep(c("BAD", "GOOD")), plotROC = FALSE, alg = "ROC") :
colAUC: length(y) and nrow(X) must be the same Thanks alotTaby


I think (though I can't check because the example is not reproducible without 
main_samp$SCORE), that the problem is that the second argument to colAUC should 
be
rep(c("BAD", "GOOD"), c(20,20))
The error is that repl[[i]] is length 40 while rep(c("BAD", "GOOD")) is length 
2.

P.S. When giving an example, it is better to not include the prompts and 
continuation prompts.  Copy it from the script rather than the output. Relevant 
output can then be included as script comments (prefixed with #).  That makes 
cutting-and-pasting to test easier.

-- Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health&  Science University



--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] calculating AUCs for each of the 1000 boot strap samples

2011-03-16 Thread Brian Diggs

On 3/16/2011 8:04 AM, taby gathoni wrote:

data<-data.frame(id=1:(165+42),main_samp$SCORE, 
x=rep(c("BAD","GOOD"),c(42,165)))
>  f<-function(x) {

+ str.sample<-list()
+ for (i in 1:length(levels(x$x)))
+ {
+ str.sample[[i]]<-x[x$x==levels(x$x)[i] 
,][sample(tapply(x$x,x$x,length)[i],20,rep=T),]
+ }
+ strat.sample<-do.call("rbind",str.sample)
+ return(strat.sample$main_samp.SCORE)
+ }

>  f(data)

  [1]
  706 633 443 843 756 743 730 843 706 730 606 743 768 768 743 763 608 730
  743 743 530 813 813 831 793 900 793 693 900 738 706 831
[33] 818 758 718 831 768 638 770 738

>  repl<-list()
>  auc<-list()
>  for(i in 1:1000)

+ {
+ repl[[i]]<-f(data)
+ auc[[i]]<-colAUC(repl[[i]],rep(c("BAD","GOOD")),plotROC=FALSE,alg="ROC")
+ }
Error in
  colAUC(repl[[i]], rep(c("BAD", "GOOD")), plotROC = FALSE, alg = "ROC") :
   colAUC: length(y) and nrow(X) must be the same Thanks alotTaby


I think (though I can't check because the example is not reproducible 
without main_samp$SCORE), that the problem is that the second argument 
to colAUC should be

rep(c("BAD", "GOOD"), c(20,20))
The error is that repl[[i]] is length 40 while rep(c("BAD", "GOOD")) is 
length 2.


P.S. When giving an example, it is better to not include the prompts and 
continuation prompts.  Copy it from the script rather than the output. 
Relevant output can then be included as script comments (prefixed with 
#).  That makes cutting-and-pasting to test easier.


--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] calculating AUCs for each of the 1000 boot strap samples

2011-03-16 Thread taby gathoni
Hallo,

I modified a code given by Andrija, a contributor in the list  to achieve two 
objectives:
create 1000 samples from a list of 207 samples with each of the samples 
cointaining 20 good and 20 bad. THis i have achievedcalcuate AUC each of the 
1000 samples, this i get an error. 
Please see the code below and assist me.


> data<-data.frame(id=1:(165+42),main_samp$SCORE, 
> x=rep(c("BAD","GOOD"),c(42,165)))
> f<-function(x) {
+ str.sample<-list()
+ for (i in 1:length(levels(x$x)))
+ {
+ str.sample[[i]]<-x[x$x==levels(x$x)[i] 
,][sample(tapply(x$x,x$x,length)[i],20,rep=T),]
+ }
+ strat.sample<-do.call("rbind",str.sample)
+ return(strat.sample$main_samp.SCORE)
+ }
> f(data)
 [1]
 706 633 443 843 756 743 730 843 706 730 606 743 768 768 743 763 608 730
 743 743 530 813 813 831 793 900 793 693 900 738 706 831
[33] 818 758 718 831 768 638 770 738
> repl<-list()
> auc<-list()
> for(i in 1:1000)
+ {
+ repl[[i]]<-f(data)
+ auc[[i]]<-colAUC(repl[[i]],rep(c("BAD","GOOD")),plotROC=FALSE,alg="ROC")
+ }
Error in
 colAUC(repl[[i]], rep(c("BAD", "GOOD")), plotROC = FALSE, alg = "ROC") : 
  colAUC: length(y) and nrow(X) must be the same Thanks alotTaby









  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.