to cut a long story short, how about that: bins<-quantile(x,0:200/200) #maybe replacing x by c(x,y) ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b)))) colnames(ctch)<-c("bin","tp","fp")
Cheers. Am 13.07.2011 18:49, schrieb Eik Vettorazzi: > Hello Lyndon, > I'm still puzzled what x and y actually are and how they are related. > Your "bins" are calculated from x only, so you might missing some range > of y and since you use a fixed number of bins of equal size, some bins > might be empty (and therefore reproducing the result from the bin > before), and some might contain more than one distinct value (and so are > ignoring an intermediate step). But this might be intentional and the > reason for you to build up your own function. > If not, have a look at pROC, there is a clever way to calculate the > required bins, in particular look at > > getAnywhere("roc.utils.thresholds") > > > Some tweaking of your code is possible anyway: > >> for(i in 1:length(threshold)) { >> tp[i] <- length(x2) - length(x2[x2 <= bins[i]]) >> fp[i] <- length(y2) - length(y2[y2 <= bins[i]]) >> } > > calculates length(x2) and length(y2) in every single loop, but this is a > constant, so it is wasting time. Anyway, you get away even without > calculating this once, using > > tp[i]<-sum(x2>bins[i]) #indexing is not necessary, > #and 1-x2[j]<=tr == x2[j]>tr > > and on top one that, the fanciest or at least R'ish way, by common > understanding, is avoiding loops and using *apply, but that's a matter > of taste, when it comes to that ;) > > tp<-sapply(bins,function(b)sum(x>b)) > fp<-sapply(bins,function(b)sum(y>b)) > > #or all at once > ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b)))) > > A last minor remark: sorting x and y is not necessary in the new code > fragment. > > hth. > > > > Am 13.07.2011 16:46, schrieb Lyndon Estes: >> Hello Eik, >> >> Thanks very much for your response and for directing me to a useful >> explanation. >> >> To make sure I am grasping your answer correctly, the two problems I >> was experiencing are related to the fact that the floating point >> numbers I was calculating and using in subsequent indices were not >> entirely equivalent, even if they were calculated using the same >> function (quantile). >> >> I confirmed this as follows: >> >> j <- 0.055 # bins value for 5.5% threshold >> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE))) >> bin.ct >> #430.29 >> bin.ct2 <- quantile(x, j, na.rm = TRUE) >> bin.ct2 >> # 5.5% >> #430.29 >> bin.ct - bin.ct2 >> # 5.5% >> #5.684342e-14 >> length(x) - length(x[x <= bin.ct]) >> length(x) - length(x[x <= bin.ct]) # 2875 >> length(x) - length(x[x <= bin.ct2]) # 3029 >> >> Testing the unname() option does not fix the result, I should however note: >> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE))) >> bin.ct2 <- unname(quantile(x, j, na.rm = TRUE)) >> bin.ct - bin.ct2 # 5.684342e-14 >> >> But rounding, as an alternative approach, works: >> bin.ct2 <- round(quantile(x, j, na.rm = TRUE), 2) >> bin.ct - bin.ct2 >> #5.5% >> # 0 >> length(x) - length(x[x <= bin.ct]) # 2875 >> length(x) - length(x[x <= bin.ct2]) # 2875 >> >> As to my code, it is part of a custom ROC function I created a while >> back and started using again recently (the data are rainfall >> values). I can't remember why I did this rather than using one of the >> existing ROC functions, but I thought (probably incorrectly) that I >> had some compelling reason. In >> any case, it is quite unwieldy, so I will explore those other >> packages, or try revise this to be more efficient >> >> (e.g. maybe this is a better approach, although the answers are fairly >> different?). >> >> x2 <- x[order(x)] >> y2 <- y[order(y)] >> bins <- round(seq(min(x2), max(x2), by = diff(range(x2)) / 200), 2) >> threshold <- seq(0, 100, by = 0.5) >> tp <- rep(0, length(bins)) >> fp <- rep(0, length(bins)) >> for(i in 1:length(threshold)) { >> tp[i] <- length(x2) - length(x2[x2 <= bins[i]]) >> fp[i] <- length(y2) - length(y2[y2 <= bins[i]]) >> } >> ctch <- cbind(threshold, bins, tp, fp) >> ctch[1:20, ] >> >> Thanks again for your help. >> >> Cheers, Lyndon >> >> >> On Tue, Jul 12, 2011 at 5:09 AM, Eik Vettorazzi >> <e.vettora...@uke.uni-hamburg.de> wrote: >>> Hi, >>> >>> Am 11.07.2011 22:57, schrieb Lyndon Estes: >>>> ctch[ctch$threshold == 3.5, ] >>>> # [1] threshold val tp fp tn fn tpr >>>> fpr tnr fnr >>>> #<0 rows> (or 0-length row.names) >>> >>> this is the very effective FAQ 7.31 trap. >>> http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f >>> >>> Welcome to the first circle of Patrick Burns' R Inferno! >>> >>> Also, unname() is a more intuitive way of removing names. >>> >>> And I think your code is quite inefficient, because you calculate >>> quantiles many times, which involves repeated ordering of x, and you may >>> use a inefficient size of bin (either to small and therefore calculating >>> the same split many times or to large and then missing some splits). >>> I'm a bit puzzled what is x and y in your code, so any further advise is >>> vague but you might have a look at any package that calculates >>> ROC-curves such as ROCR or pROC (and many more). >>> >>> Hth >>> >>> -- >>> Eik Vettorazzi >>> >>> Department of Medical Biometry and Epidemiology >>> University Medical Center Hamburg-Eppendorf >>> >>> Martinistr. 52 >>> 20246 Hamburg >>> >>> T ++49/40/7410-58243 >>> F ++49/40/7410-57790 >>> >> >> >> > -- Eik Vettorazzi Institut für Medizinische Biometrie und Epidemiologie Universitätsklinikum Hamburg-Eppendorf Martinistr. 52 20246 Hamburg T ++49/40/7410-58243 F ++49/40/7410-57790 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.