Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution
Please allow one (hopefully ;) last question: Do you think the code I adopted from Hänggi is valid in selecting a contour which encloses i.e. 68% of the scatter-plot data? - I am still not completely shore... still looking for the reason of the different result of Hänggis and Foresters code. Should be equivalent in theory. Felix Am 03.03.12 22:13, schrieb peter dalgaard: > > On Mar 3, 2012, at 20:25 , drflxms wrote: > >> "Once you go into two dimensions, SD loses all meaning, and adding >> nonparametric density estimation into the mix doesn't help, so just stop >> thinking in those terms!" >> >> This makes me really think a lot! Is plotting the 0,68 confidence >> interval in 2D as equivalent to +-1 SD really nonsense!? > > > Nono.., a 68% CI (or rather, prediction region) is a 68% region, no problem > in that. And choosing that particular value because it is traditional in 1D > quite defensible. It's just that there is nothing (at least not anything > obvious) that it corresponds to 1SD of. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution
Puhh, I am really happy to read that the idea was not completely sensless. This would have heavily damaged my anyway labile view of the world of statistics ;) In any case I need to get to know more about bivariate normal distributions! Any literature recommendations? Felix Am 03.03.12 22:13, schrieb peter dalgaard: > > On Mar 3, 2012, at 20:25 , drflxms wrote: > >> "Once you go into two dimensions, SD loses all meaning, and adding >> nonparametric density estimation into the mix doesn't help, so just stop >> thinking in those terms!" >> >> This makes me really think a lot! Is plotting the 0,68 confidence >> interval in 2D as equivalent to +-1 SD really nonsense!? > > > Nono.., a 68% CI (or rather, prediction region) is a 68% region, no problem > in that. And choosing that particular value because it is traditional in 1D > quite defensible. It's just that there is nothing (at least not anything > obvious) that it corresponds to 1SD of. > > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution
Greg, extremely cool thoughts! Thank you for delving into it this deep. As I mentioned, I am a neurologist with unfortunately poor statistical training. You are professional statisticians. So I'd like to apologize for any unprofessional nomenclature and strange thoughts beforehand. As my previous postings show, I think a lot in pictures. Therefore I found the interactive demo(bivar) in the rgl package very impressive. I believe everything you are writing is completely true for theoretical/parametric normal distributions. When you plot the density of such an probability distribution it will have the form of a sugar loaf (skewed or unskewed doesn't matter in this case). The projection of the standard deviations of all possible slices to the x-y-plane below the sugar loaf will be an ellipse for shore. Corresponding to that you would get nested ellipses when you apply contour() to the 2d scatter plot of the data of that bivariate distribution. But the distribution generated via rnorm are random. demo(bivar) shows both for such an rnorm() distribution: 1.) the "sugar loaf" parametric theoretical 3d normal density, and 2.) the crumpled non-parametric "real" 3d density (looks like a mountain), which is derived from the data of the 2d scatter plot. I grasped the terms parametric and non-parametric form the code of demo(bivar) by the way. In this case contour() will not draw nested ellipses but contour lines similar to those on a mountain map. As far as I understand this is closer to reality because fewer assumptions are made compared to the parametric approach. That's why I preferred contours over ellipses. Hopefully I could made my thoughts understood and they are not completely ridiculous. Thank you again for your detailed explanations which I am going to study in detail right now... :) Just wanted to answer before that more detailed study, as this is certainly taking some time... Best, Felix Am 03.03.12 22:14, schrieb Greg Snow: > To further explain. If you want contours of a bivariate normal, then > you want ellipses. The density for a bivariate normal (with 0 > correlation to keep things simple, but the theory will extend to > correlated cases) is proportional to exp( -1/2 ( x1^2/v1 + x2^2/v2 ) > so a contour of the distribution will be all points such that x1^2/v1 > + x2^2/v2 = c for some constant c (each c will give a different > contour), but that is the definition of an ellipse (well divide both > sides by c so that the right side is 1 to get the canonical form). > The ellipse function in the ellipse package chooses c from the chi > squared distribution (since if x1 and x2 are normally distributed with > mean 0 (or have the mean subtracted), then x1^2/v1 +x2^2/v2 is chi > squared distributed with 2 degrees of freedom. > > So if you really want to you can try to approximate the contours in > some other way, but any decent approach will just converge to the > ellipse. > > On Sat, Mar 3, 2012 at 1:26 PM, drflxms wrote: >> Wow, David, >> >> thank you for these sources, which I just screened. bagplot looks most >> promising to me. I found it in the package ‘aplpack’ as well as in the R >> Graph Gallery >> http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=112 >> >> Ellipses are not exactly what I am heading for. I am looking for a 2D >> equivalent of plotting 1D standard deviation boundaries. In other words >> how to plot SD boundaries on a 2D scatter plot. >> So I started searching for contour/boundaries etc. instead of ellipse >> leading me to Pascal Hänggi: >> http://www.mail-archive.com/r-help@r-project.org/msg24013.html >> >> To describe it in an image: I want to cut the density mountain above the >> scatter plot (see demo(bivar) in the rgl package) in a way so that the >> part of the mountain, that covers 68% of the data on the x-y-plane below >> it (+-1 SD) is removed. Then I'd like to project the edge that results >> from the cut to the x-y-plane below the mountain. This should be the 2d >> equivalent of 1s SD boundaries. >> >> I think this might be achieved as well by Hänggis code as by the >> function of Forester. Unfortunately they result in slightly different >> boundaries which shouldn't be the case. And I did not figure out which >> one is correct if one is correct at all (!?). >> >> Can anyone explain the difference? >> >> I compared them with this code: >> >> # parameters: >> n<-100 >> >> # generate samples: >> set.seed(138813) >> x<-rnorm(n); y<-rnorm(n) >> a<-list("x"=x,"y"=y) # input for Foresters function which is appended at >> the very end >> >> # estimate non-parameteric density surface via kern
Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution
Wow, David, thank you for these sources, which I just screened. bagplot looks most promising to me. I found it in the package ‘aplpack’ as well as in the R Graph Gallery http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=112 Ellipses are not exactly what I am heading for. I am looking for a 2D equivalent of plotting 1D standard deviation boundaries. In other words how to plot SD boundaries on a 2D scatter plot. So I started searching for contour/boundaries etc. instead of ellipse leading me to Pascal Hänggi: http://www.mail-archive.com/r-help@r-project.org/msg24013.html To describe it in an image: I want to cut the density mountain above the scatter plot (see demo(bivar) in the rgl package) in a way so that the part of the mountain, that covers 68% of the data on the x-y-plane below it (+-1 SD) is removed. Then I'd like to project the edge that results from the cut to the x-y-plane below the mountain. This should be the 2d equivalent of 1s SD boundaries. I think this might be achieved as well by Hänggis code as by the function of Forester. Unfortunately they result in slightly different boundaries which shouldn't be the case. And I did not figure out which one is correct if one is correct at all (!?). Can anyone explain the difference? I compared them with this code: # parameters: n<-100 # generate samples: set.seed(138813) x<-rnorm(n); y<-rnorm(n) a<-list("x"=x,"y"=y) # input for Foresters function which is appended at the very end # estimate non-parameteric density surface via kernel smoothing library(MASS) den<-kde2d(x, y, n=n) z <- array() for (i in 1:n){ z.x <- max(which(den$x < x[i])) z.y <- max(which(den$y < y[i])) z[i] <- den$z[z.x, z.y] } # store class/level borders of confidence interval in variables confidence.border <- quantile(z, probs=0.05, na.rm = TRUE)# 0.05 corresponds to 0.95 in draw.contour plot(x,y) draw.contour(a, alpha=0.95) par(new=TRUE) contour(den, levels=confidence.border, col = "red", add = TRUE) ### ## drawcontour.R ## Written by J.D. Forester, 17 March 2008 ### ## This function draws an approximate density contour based on empirical, bivariate data. ##change testit to FALSE if sourcing the file testit=TRUE draw.contour<-function(a,alpha=0.95,plot.dens=FALSE, line.width=2, line.type=1, limits=NULL, density.res=800,spline.smooth=-1,...){ ## a is a list or matrix of x and y coordinates (e.g., a=list("x"=rnorm(100),"y"=rnorm(100))) ## if a is a list or dataframe, the components must be labeled "x" and "y" ## if a is a matrix, the first column is assumed to be x, the second y ## alpha is the contour level desired ## if plot.dens==TRUE, then the joint density of x and y are plotted, ## otherwise the contour is added to the current plot. ## density.res controls the resolution of the density plot ## A key assumption of this function is that very little probability mass lies outside the limits of ## the x and y values in "a". This is likely reasonable if the number of observations in a is large. require(MASS) require(ks) if(length(line.width)!=length(alpha)){ line.width <- rep(line.width[1],length(alpha)) } if(length(line.type)!=length(alpha)){ line.type <- rep(line.type[1],length(alpha)) } if(is.matrix(a)){ a=list("x"=a[,1],"y"=a[,2]) } ##generate approximate density values if(is.null(limits)){ limits=c(range(a$x),range(a$y)) } f1<-kde2d(a$x,a$y,n=density.res,lims=limits) ##plot empirical density if(plot.dens) image(f1,...) if(is.null(dev.list())){ ##ensure that there is a window in which to draw the contour plot(a,type="n",xlab="X",ylab="Y") } ##estimate critical contour value ## assume that density outside of plot is very small zdens <- rev(sort(f1$z)) Czdens <- cumsum(zdens) Czdens <- (Czdens/Czdens[length(zdens)]) for(cont.level in 1:length(alpha)){ ##This loop allows for multiple contour levels crit.val <- zdens[max(which(Czdens<=alpha[cont.level]))] ##determine coordinates of critical contour b.full=contourLines(f1,levels=crit.val) for(c in 1:length(b.full)){ ##This loop is used in case the density is multimodal or if the desired contour ## extends outside the plotting region b=list("x"=as.vector(unlist(b.full[[c]][2])),"y"=as.vector(unlist(b.full[[c]][3]))) ##plot desired contour line.dat<-xspline(b,shape=spline.smooth,open=TRUE,draw=FALSE) lines(line.dat,lty=line.type[cont.level],lwd=line.width[cont.level]) } } } ## ##Test the function ## ##generate data if(testit){ n=1 a<-list("x"=rnorm(n,400
Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution
Thank you very much for your thoughts! Exactly what you mention is, what I am thinking about during the last hours: What is the relation between the den$z distribution and the z distribution. That's why I asked for ecdf(distribution)(value)->percentile earlier this day (thank you again for your quick and insightful answer on that!). I used it to compare certain values in both distributions by their percentile. I really think you are completely right: I urgently need some lessons in bivariate/multivariate normal distributions. (I am a neurologist and unfortunately did not learn too much about statistics in university :-() I'll take your statement as a starter: "Once you go into two dimensions, SD loses all meaning, and adding nonparametric density estimation into the mix doesn't help, so just stop thinking in those terms!" This makes me really think a lot! Is plotting the 0,68 confidence interval in 2D as equivalent to +-1 SD really nonsense!? By the way: all started very harmless. I was asked to draw an example of the well known target analogy for accuracy and precision based on "real" (=simulated) data. (see i.e. http://en.wikipedia.org/wiki/Accuracy_and_precision for a simple hand made 2d graphic). Well, I did by set.seed(138813) x<-rnorm(n); y<-rnorm(n) plot(x,y) I was asked whether it might be possible to add a histogram with superimposed normal curve to the drawing: no problem. "And where is the standard deviation", well abline(v=sd(... OK. Then I realized, that this is of course only true for one of the distributions (x) and only in one "slice" of the scatterplot of x and y. The real thing is is a 3d density map above the scatterplot. A very nice example of this is demo(bivar) in the rgl package (for a picture see i.e http://rgl.neoscientists.org/gallery.shtml right upper corner). Great! But how to correctly draw the standard deviation boundaries for the "shots on the target" (the scatterplot of x and y)... I'd be grateful for hints on what to read on that matter (book, website etc.) Greetings from Munich, Felix. Am 03.03.12 19:22, schrieb peter dalgaard: > > On Mar 3, 2012, at 17:01 , drflxms wrote: > >> # this is the critical block, which I still do not comprehend in detail >> z <- array() >> for (i in 1:n){ >>z.x <- max(which(den$x < x[i])) >>z.y <- max(which(den$y < y[i])) >>z[i] <- den$z[z.x, z.y] >> } > > As far as I can tell, the point is to get at density values corresponding to > the values of (x,y) that you actually have in your sample, as opposed to > den$z which is for an extended grid of all possible (x_i, y_j) combinations. > > It's unclear to me what happens if you look at quantiles for the entire > den$z. I kind of suspect that it is some sort of approximate numerical > integration, but maybe not of the right thing > > Re SD: Once you go into two dimensions, SD loses all meaning, and adding > nonparametric density estimation into the mix doesn't help, so just stop > thinking in those terms! > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution
Thanx a lot Greg for the hint and letting me not alone with this! Tried ellipse and it works well. But I am looking for something more precise. The ellipse fits the real border to the nearest possible ellipse. I want the "real" contour, if possible. Meanwhile I found an interesting function named draw.contour http://quantitative-ecology.blogspot.com/2008/03/plotting-contours.html provided by "Forester", an "Assistant Professor in the Department of Fisheries". - I love it how statistics brings people from completely different specialities together. - I am a neurologist. This function results not exactly in the same curve, I got with the code I posted last, but is in deed very close by (parallel). So I think both solutions are probably true and differ only because of rounding and method (..!?). Still I am not completely happy with 1. the numeric solution of setting to 68% to get 1SD. I'd prefer a symbolic, formula driven solution instead. 2. me not comprehending the code completely. While 2. will be hopefully solved soon by me delving into the code, 1 remains... Best, Felix Am 03.03.12 17:28, schrieb Greg Snow: > Look at the ellipse package (and the ellipse function in the package) > for a simple way of showing a confidence region for bivariate data on > a plot (a 68% confidence interval is about 1 SD if you just want to > show 1 SD). > > On Sat, Mar 3, 2012 at 7:54 AM, drflxms wrote: >> Dear all, >> >> I created a bivariate normal distribution: >> >> set.seed(138813) >> n<-100 >> x<-rnorm(n); y<-rnorm(n) >> >> and plotted a scatterplot of it: >> >> plot(x,y) >> >> Now I'd like to add the 2D-standard deviation. >> >> I found a thread regarding plotting arbitrary confidence boundaries from >> Pascal Hänggi >> http://www.mail-archive.com/r-help@r-project.org/msg24013.html >> which cites the even older thread >> http://tolstoy.newcastle.edu.au/R/help/03b/5384.html >> >> As I am unfortunately only a very poor R programmer, the code of Pascal >> Hänggi is a myth to me and I am not sure whether I was able to translate >> the recommendation of Brain Ripley in the later thread (which provides >> no code) into the the correct R code. Brain wrote: >> >> You need a 2D density estimate (e.g. kde2d in MASS) then compute the >> density values at the points and draw the contour of the density which >> includes 95% of the points (at a level computed from the sorted values >> via quantile()). [95% confidence interval was desired in thread instead >> of standard deviation...] >> >> So I tried this... >> >> den<-kde2d(x, y, n=n) #as I chose n to be the same as during creating >> the distributions x and y (see above), a z-value is assigned to every >> combination of x and y. >> >> # create a sorted vector of z-values (instead of the matrix stored >> inside the den object >> den.z <-sort(den$z) >> >> # set desired confidence border to draw and store it in variable >> confidence.border <- quantile(den.z, probs=0.6827, na.rm = TRUE) >> >> # draw a line representing confidence.border on the existing scatterplot >> par(new=TRUE) >> contour(den, levels=confidence.border, col = "red", add = TRUE) >> >> Unfortunately I doubt very much this is correct :( In fact I am sure >> this is wrong, because the border for probs=0.05 is drawn outside the >> values So please help and check. >> Pascal Hänggis code seems to work, but I don't understand the magic he >> does with >> >> pp <- array() >> for (i in 1:1000){ >>z.x <- max(which(den$x < x[i])) >>z.y <- max(which(den$y < y[i])) >>pp[i] <- den$z[z.x, z.y] >> } >> >> before doing the very same as I did above: >> >> confidencebound <- quantile(pp, 0.05, na.rm = TRUE) >> >> plot(x, y) >> contour(den, levels = confidencebound, col = "red", add = TRUE) >> >> >> My problems: >> >> 1.) setting probs=0.6827 is somehow a dirty trick which I can only use >> by simply knowing that this is the percentage of values inside +-1sd >> when a distribution is normal. Is there a way doing this with "native" >> sd function? >> sd(den.z) is not correct, as den.z is in contrast to x and y not normal >> any more. So ecdf(den.z)(sd(den.z)) results in a percentile of 0.5644 in >> this example instead of the desired 0.6827. >> >> 2.) I would like to have code that works with any desired confidence. >> Unfortunately setting probs to the desired confidence would probably b
Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution
OK, the following seems to work still do not understand exactly why... library(MASS) # parameters: n<-100 # generate samples: set.seed(138813) #seed <- .Random.seed x<-rnorm(n); y<-rnorm(n) # estimate non-parameteric density surface via kernel smoothing den<-kde2d(x, y, n=n) # store z values of density estimate in an extra variable den.z <-den$z # this is the critical block, which I still do not comprehend in detail z <- array() for (i in 1:n){ z.x <- max(which(den$x < x[i])) z.y <- max(which(den$y < y[i])) z[i] <- den$z[z.x, z.y] } # store class/level borders of confidence interval in variables confidence.border <- quantile(z, probs=1-0.6827, na.rm = TRUE) # +-1sd plot(x,y) par(new=TRUE) contour(den, levels=confidence.border, col = "red", add = TRUE) The results look at least plausible. Do not know if they are really true. Still hoping for some replies ;), Felix Am 03.03.12 15:54, schrieb drflxms: > Dear all, > > I created a bivariate normal distribution: > > set.seed(138813) > n<-100 > x<-rnorm(n); y<-rnorm(n) > > and plotted a scatterplot of it: > > plot(x,y) > > Now I'd like to add the 2D-standard deviation. > > I found a thread regarding plotting arbitrary confidence boundaries from > Pascal Hänggi > http://www.mail-archive.com/r-help@r-project.org/msg24013.html > which cites the even older thread > http://tolstoy.newcastle.edu.au/R/help/03b/5384.html > > As I am unfortunately only a very poor R programmer, the code of Pascal > Hänggi is a myth to me and I am not sure whether I was able to translate > the recommendation of Brain Ripley in the later thread (which provides > no code) into the the correct R code. Brain wrote: > > You need a 2D density estimate (e.g. kde2d in MASS) then compute the > density values at the points and draw the contour of the density which > includes 95% of the points (at a level computed from the sorted values > via quantile()). [95% confidence interval was desired in thread instead > of standard deviation...] > > So I tried this... > > den<-kde2d(x, y, n=n) #as I chose n to be the same as during creating > the distributions x and y (see above), a z-value is assigned to every > combination of x and y. > > # create a sorted vector of z-values (instead of the matrix stored > inside the den object > den.z <-sort(den$z) > > # set desired confidence border to draw and store it in variable > confidence.border <- quantile(den.z, probs=0.6827, na.rm = TRUE) > > # draw a line representing confidence.border on the existing scatterplot > par(new=TRUE) > contour(den, levels=confidence.border, col = "red", add = TRUE) > > Unfortunately I doubt very much this is correct :( In fact I am sure > this is wrong, because the border for probs=0.05 is drawn outside the > values So please help and check. > Pascal Hänggis code seems to work, but I don't understand the magic he > does with > > pp <- array() > for (i in 1:1000){ > z.x <- max(which(den$x < x[i])) > z.y <- max(which(den$y < y[i])) > pp[i] <- den$z[z.x, z.y] > } > > before doing the very same as I did above: > > confidencebound <- quantile(pp, 0.05, na.rm = TRUE) > > plot(x, y) > contour(den, levels = confidencebound, col = "red", add = TRUE) > > > My problems: > > 1.) setting probs=0.6827 is somehow a dirty trick which I can only use > by simply knowing that this is the percentage of values inside +-1sd > when a distribution is normal. Is there a way doing this with "native" > sd function? > sd(den.z) is not correct, as den.z is in contrast to x and y not normal > any more. So ecdf(den.z)(sd(den.z)) results in a percentile of 0.5644 in > this example instead of the desired 0.6827. > > 2.) I would like to have code that works with any desired confidence. > Unfortunately setting probs to the desired confidence would probably be > wrong (?!) as it relates to den.z instead of x and y, which are the > underlying distributions I am interested in. To put it short I want the > confidence of x/y and not of den.z. > > > I am really completely stuck. Please help me out of this! Felix > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] contour for plotting confidence interval on scatter plot of bivariate normal distribution
Dear all, I created a bivariate normal distribution: set.seed(138813) n<-100 x<-rnorm(n); y<-rnorm(n) and plotted a scatterplot of it: plot(x,y) Now I'd like to add the 2D-standard deviation. I found a thread regarding plotting arbitrary confidence boundaries from Pascal Hänggi http://www.mail-archive.com/r-help@r-project.org/msg24013.html which cites the even older thread http://tolstoy.newcastle.edu.au/R/help/03b/5384.html As I am unfortunately only a very poor R programmer, the code of Pascal Hänggi is a myth to me and I am not sure whether I was able to translate the recommendation of Brain Ripley in the later thread (which provides no code) into the the correct R code. Brain wrote: You need a 2D density estimate (e.g. kde2d in MASS) then compute the density values at the points and draw the contour of the density which includes 95% of the points (at a level computed from the sorted values via quantile()). [95% confidence interval was desired in thread instead of standard deviation...] So I tried this... den<-kde2d(x, y, n=n) #as I chose n to be the same as during creating the distributions x and y (see above), a z-value is assigned to every combination of x and y. # create a sorted vector of z-values (instead of the matrix stored inside the den object den.z <-sort(den$z) # set desired confidence border to draw and store it in variable confidence.border <- quantile(den.z, probs=0.6827, na.rm = TRUE) # draw a line representing confidence.border on the existing scatterplot par(new=TRUE) contour(den, levels=confidence.border, col = "red", add = TRUE) Unfortunately I doubt very much this is correct :( In fact I am sure this is wrong, because the border for probs=0.05 is drawn outside the values So please help and check. Pascal Hänggis code seems to work, but I don't understand the magic he does with pp <- array() for (i in 1:1000){ z.x <- max(which(den$x < x[i])) z.y <- max(which(den$y < y[i])) pp[i] <- den$z[z.x, z.y] } before doing the very same as I did above: confidencebound <- quantile(pp, 0.05, na.rm = TRUE) plot(x, y) contour(den, levels = confidencebound, col = "red", add = TRUE) My problems: 1.) setting probs=0.6827 is somehow a dirty trick which I can only use by simply knowing that this is the percentage of values inside +-1sd when a distribution is normal. Is there a way doing this with "native" sd function? sd(den.z) is not correct, as den.z is in contrast to x and y not normal any more. So ecdf(den.z)(sd(den.z)) results in a percentile of 0.5644 in this example instead of the desired 0.6827. 2.) I would like to have code that works with any desired confidence. Unfortunately setting probs to the desired confidence would probably be wrong (?!) as it relates to den.z instead of x and y, which are the underlying distributions I am interested in. To put it short I want the confidence of x/y and not of den.z. I am really completely stuck. Please help me out of this! Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] percentile of a given value: is there a "reverse" quantile function?
Thank you a lot Peter, Stefan and Pascal, for your quick an inspiring answers. ecdf(distribution)(value)->percentile was exactly, what I was looking for, as it is in my eyes somehow the equivalent to quantile(distribution, percentile)->value, isn't it. Greetings from sunny Munich, Felix Am 03.03.12 14:33, schrieb peter dalgaard: > > On Mar 3, 2012, at 13:37 , drflxms wrote: > >> Dear all, >> >> I am familiar with obtaining the value corresponding to a chosen >> probability via the quantile function. >> Now I am facing the opposite problem I have a value an want to know it's >> corresponding percentile in the distribution. So is there a function for >> this as well? > > For a single value, mean(x <= a) will do. Otherwise check ecdf(). > >> x <- rnorm(100) >> mean(x <= 2) > [1] 0.97 >> ecdf(x)(2) > [1] 0.97 >> ecdf(x)(-3:3) > [1] 0.00 0.01 0.14 0.48 0.80 0.97 1.00 > > if you need values for your original data points, rank(x)/length(x) should do > (bar missing value issues). > >> >> Thank you for your support in advance, Felix >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] percentile of a given value: is there a "reverse" quantile function?
Dear all, I am familiar with obtaining the value corresponding to a chosen probability via the quantile function. Now I am facing the opposite problem I have a value an want to know it's corresponding percentile in the distribution. So is there a function for this as well? Thank you for your support in advance, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] plotting standard deviation of multivariate normal distribution (preferred in rgl package)
Dear R colleagues, for a statistics tutorial I want to develop a nice 3d-graphic of the well known target comparison/analogy of accuracy and precision (see i.e. http://en.wikipedia.org/wiki/Accuracy_and_precision for a simple hand made 2d graphic). The code for a really beautiful graphic is already provided as demo(bivar) in the rgl package (for a picture see i.e http://rgl.neoscientists.org/gallery.shtml right upper corner). Now I'd like to add the standard deviation to the 3d plot. Unfortunately I couldn't figure out how to do that. What I did so far is: Assuming you have executed the code of the demo mentioned above you can plot a 2d scatter plot of the data ("the shots on the target") via a simple plot(x,y) and add a contour plot via par(new=TRUE) contour(denobj, nlevels=8) So the contour levels are about what I am looking for but... 1. how to make contour plot display exactly standard deviations as contours and not arbitrary levels? 2. how to project the standard deviation contours on the 3d surface? Probably there are (as always) lots of different solutions as well. Id' appreciate any kind of help very much! Greetings from Munich, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] splitting strings effriciently
Hi Andrew, I am aware, that this is an R-mailing list, but for such tasks (I deal a lot with huge genomic datasets) I tend to use awk and sed for preprocessing of data, in case I run into performance problems. Otherwise for handling of strings in R I recommend stringr library, but I don't know about it's performance... Felix > Folks, > > I have a data frame with 4861469 rows that contains an ip address > xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each > row based on IP ranges. To do this I have a function to split the ip > address as character into class A,B,C and D components. It works but is > horribly inefficient in terms of speed. I can't quite see how one of the > l/s/m/t/apply functions could be brought to bear on the problem. Does > anyone have any thoughts? > > for(i in 1:4861469) >{ >lst <-unlist(strsplit(data$ComputerName[i], "\\.")) >data$IPA[i] <-lst[[1]] >data$IPB[i] <-lst[[2]] >data$IPC[i] <-lst[[3]] >data$IPD[i] <-lst[[4]] >rm(lst) >} > > Andrew > > Andrew Roberts > Children's Orthopaedic Surgeon > RJAH, Oswestry, UK __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] colouring a table, data.frame or matrix in an interactive R session
Am 07.01.12 21:12, schrieb Michael Friendly: > On 1/7/2012 5:48 AM, drflxms wrote: >> Hi everybody, >> >> as I am dealing with complex confusion matrices, I wonder whether there >> might be a way to colour text/tabular data in R. I.e. imagine >> highlighting the true positive values or certain classes in a table. >> >> I know how to colour text in graphical output as well as how to sweave >> or odfWeave coloured tables. But is there a way to do it directly in an >> interactive R session? >> Of course I understand that terminals originally haven't been designed >> for coloured output and that one can easily extract every information >> needed by subscripting. But still coloured text in the interactive >> session would sometimes just be comfortable. >> > > It depends on what kind of terminal you are using, but maybe not worth > the effort. > lattice::levelplot can do a nice job of coloring cells in a table. > Well, not exactly what I was looking for, I guess, as I'd like to colour text in the terminal in this case but definitely a great package which is very useful for another project ;) Thanx a lot for your help! Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] colouring a table, data.frame or matrix in an interactive R session
Am 08.01.12 13:52, schrieb Vincent Zoonekynd: > On 7 January 2012 19:48, drflxms wrote: > >> as I am dealing with complex confusion matrices, I wonder whether there >> might be a way to colour text/tabular data in R. I.e. imagine >> highlighting the true positive values or certain classes in a table. > > The "colorout" package does part of what you want: colouring the > output, in a terminal. > But some more work may be needed if you want to change the colours > depending on the values of your matrix. > > -- Vincent Great! That is in fact almost exactly what I was looking for. Thank you very much for the hint!!! Working on a MacBook Pro (OS X Lion) and using Aquamacs (GNU EMACS) with ESS I can confirm that it runs on Mac just as described for Unix in general. Pretty Cool although - as you already mentioned - colouring depending on conditions is not implemented yet. I'll send an e-mail to the author, whether he thinks the library can be developed in that direction - don't know too much about the technical details of the shell unfortunately. But that may change ;) Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] colouring a table, data.frame or matrix in an interactive R session
Hi everybody, as I am dealing with complex confusion matrices, I wonder whether there might be a way to colour text/tabular data in R. I.e. imagine highlighting the true positive values or certain classes in a table. I know how to colour text in graphical output as well as how to sweave or odfWeave coloured tables. But is there a way to do it directly in an interactive R session? Of course I understand that terminals originally haven't been designed for coloured output and that one can easily extract every information needed by subscripting. But still coloured text in the interactive session would sometimes just be comfortable. Looking forward to read from you, Greetings from Munich, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] sorting a data.frame (df) by a vector (which is not contained in the df) - unexpected behaviour of match and factor
Jeff, thanks a lot for your quick reply and the hint! Meanwhile I found a solution that works - at least for my case ;) The code to get the job done is df[order(match(df$level,desiredOrder)),] So we seem in need of one order statement more. I found this solution doing it stepwise: ## sorting the levels of the level column in the data.frame df$level <- factor(df$level,levels=desiredOrder) ## sorting the data frame by the newly sorted level column df[order(df$level),] Maybe this solution is of a help for someone else as well? But honestly I still do not exactly understand why df[match(df$level,desiredOrder),] doesn't work... Cheers, Felix Am 29.12.11 10:58, schrieb Jeff Newmiller: > Your desiredOrder vector is a vector of strings. Convert it to numeric and it > should work. > --- > Jeff NewmillerThe . . Go Live... > DCN:Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/BatteriesO.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --- > Sent from my phone. Please excuse my brevity. > > drflxms wrote: > >> Dear R colleagues, >> >> consider my data.frame named "df" with 3 columns - being level, >> prevalence and sensitivity - and 7 rows of data (see dump below). >> >> df <- >> structure(list(level = structure(1:7, .Label = c("0", "1", "10", >> "100", "1010", "11", "110"), class = "factor"), prevalence = >> structure(c(4L, >> 2L, 3L, 5L, 6L, 1L, 7L), .Label = c("0.488", "0.5", "0.754", >> "0.788", "0.803", "0.887", "0.905"), class = "factor"), sensitivity = >> structure(c(6L, >> 1L, 5L, 4L, 3L, 2L, 1L), .Label = c("0", "0.05", "0.091", "0.123", >> "0.327", "0.933"), class = "factor")), .Names = c("level", >> "prevalence", >> "sensitivity"), class = "data.frame", row.names = c(NA, -7L)) >> >> I'd like to order df by a vector which is NOT contained in the >> data.frame. Let's call this vector desiredOrder (see dump below). >> >> desiredOrder <- c("0", "1", "10", "100", "11", "110", "1010") >> >> So after sorting, the order of the level column (df$level) should be in >> the order of the vector desiredOrder (as well a the associated data in >> the other columns). >> I know that this is not an easy task to achieve by order(...) as the >> order of desiredOrder isn't a natural one. But I would expect both of >> the following to work: >> >> ## using match >> df[match(df$level,desiredOrder),] >> >> ## using factor >> df[factor(df$level,levels=desiredOrder),] >> >> Unfortunately the result isn't what I expected: I get a data.frame with >> the level column in the order 0,1,10,100,110,1010,11 instead of the >> order in desiredOrder (0,1,10,100,11,110,1010). >> >> Does anybody see, what I am doing wrong? >> I'd appreciate any kind of help very much! >> Best regards, Felix >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] sorting a data.frame (df) by a vector (which is not contained in the df) - unexpected behaviour of match and factor
Dear R colleagues, consider my data.frame named "df" with 3 columns - being level, prevalence and sensitivity - and 7 rows of data (see dump below). df <- structure(list(level = structure(1:7, .Label = c("0", "1", "10", "100", "1010", "11", "110"), class = "factor"), prevalence = structure(c(4L, 2L, 3L, 5L, 6L, 1L, 7L), .Label = c("0.488", "0.5", "0.754", "0.788", "0.803", "0.887", "0.905"), class = "factor"), sensitivity = structure(c(6L, 1L, 5L, 4L, 3L, 2L, 1L), .Label = c("0", "0.05", "0.091", "0.123", "0.327", "0.933"), class = "factor")), .Names = c("level", "prevalence", "sensitivity"), class = "data.frame", row.names = c(NA, -7L)) I'd like to order df by a vector which is NOT contained in the data.frame. Let's call this vector desiredOrder (see dump below). desiredOrder <- c("0", "1", "10", "100", "11", "110", "1010") So after sorting, the order of the level column (df$level) should be in the order of the vector desiredOrder (as well a the associated data in the other columns). I know that this is not an easy task to achieve by order(...) as the order of desiredOrder isn't a natural one. But I would expect both of the following to work: ## using match df[match(df$level,desiredOrder),] ## using factor df[factor(df$level,levels=desiredOrder),] Unfortunately the result isn't what I expected: I get a data.frame with the level column in the order 0,1,10,100,110,1010,11 instead of the order in desiredOrder (0,1,10,100,11,110,1010). Does anybody see, what I am doing wrong? I'd appreciate any kind of help very much! Best regards, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to use a variable after $ ($x) for subscripting a list, How to source from a character object (source(x))
Thank you very much! foreach(i=1:length(levels)) %do% {listResult[[i]][[input]]} (instead of listResult[[i]]$input) does the trick!!! Sorry for being a little bit confuse in describing the problem. My example object listResult is meant to be a list-object with several levels. More concrete: it contains the output of confusionMatrix function from library(caret) as well as other output for several confusion matrices. Hope, this is a little bit clearer now? Thank you very much for your examples, which help me to understand how to program similar problems. Thanx and all the best, happy Felix Am 13.09.11 10:29, schrieb R. Michael Weylandt: > I had a little bit of trouble following what exactly you mean to do > (specifically, I'm having a little bit of trouble understanding your sample > object "listResult" -- is it a list or a set of factors or a list of lists > or something else entirely), but I believe that replacing the `$` operator > with additional copies of `[` or `[[` as appropriate should get the job > done. $ is, for almost all purposes, a less flexible version of `[[`. > > Consider the following: > > R> a = list(a1 = rnorm(5), rnorm(6), rnorm(7)) > R> b = list(runif(5), b2 = runif(6), runif(7)) > R> c = list(-runif(5), -runif(6), c3 = -runif(7)) > > R> A = list(a = a,b = b,c = c) > > R> for( i in seq_along(A)) { print(A[[i]][[3]][i]) } > > which seems to work for me (even if it is frustratingly opaque as to its > function). Specifically, > > R> identical(A$a, A[["a"]]) > TRUE > > R> identical(A$a,A[[1]]) > TRUE > > If this doesn't work, please clarify what exactly the object you have is and > what you are trying to get out of it and I'll give a more concrete answer. > > On Tue, Sep 13, 2011 at 4:11 AM, drflxms wrote: > >> Dear R colleagues, >> >> as result of a function a huge list is generated. From this result list >> I'd like to extract information. >> >> Think of the list i.e. as an object named "listResult" with the >> following form: >> >> [[a]] >> [1] [2] [3] [4] [5] >> >> [[b]] >> [1] [2] [3] [4] [5] >> >> [[c]] >> [1] [2] [3] [4] [5] >> >> where levels=c(a,b,c) >> >> I'd like to extract a data.frame like >> >> a [2] >> b [2] >> c [2] >> >> What I tried, is a function like this: >> >> library(foreach) >> loopExtract <- function(input) { >> foreach(i=1:length(levels)) %do% {listResult[[i]]$input} >> ... >> >> where the name of the variable [2] is meant to be the input. >> >> Unfortunately it turned out, that the "input"-variable after the $ is >> not substituted as expected. Subscripting the list with a variable after >> the $ seems not possible. >> The only workaround I found for that is a function like >> >> loopExtract <- function(input) { >> codetemplate <- as.character("result <- foreach(i=1:length(levels)) %do% >> {listResult[[i]]$input)}") >> require(stringr) >> write(str_replace_all(codetemplate, "input", input), file="tmp.r") >> source("tmp.r") >> return(result) >> } >> >> in other words I stored a template of the desired code as characters in >> an object, substituted the "input" string in that character object, >> wrote the result to a file and sourced the code of that file. >> >> I stored the result in a file cause the expression source(codetemplate) >> did not work. From the documentation I learned, that one can only source >> "connections". And there seems no chance to establish a connection to an >> object. (Probably no one else, except me, has such a strange idea.) >> >> Well, it works that way, but even though I am not a real programmer, I >> realize how dirty this solution is. There must be a much simpler and >> elegant way. >> >> To sum it all up, my problem can be reduced to the following two questions >> >> (1) How to subscript a list with a variable after $ >> (2) How to source from an object containing a code template >> or (probably the most promising) (3) How to avoid both ;) >> >> Unfortunately I am not experienced enough to find the solution. Please >> help! >> >> Greetings from sunny Munich, Felix >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to use a variable after $ ($x) for subscripting a list, How to source from a character object (source(x))
Dear R colleagues, as result of a function a huge list is generated. From this result list I'd like to extract information. Think of the list i.e. as an object named "listResult" with the following form: [[a]] [1] [2] [3] [4] [5] [[b]] [1] [2] [3] [4] [5] [[c]] [1] [2] [3] [4] [5] where levels=c(a,b,c) I'd like to extract a data.frame like a [2] b [2] c [2] What I tried, is a function like this: library(foreach) loopExtract <- function(input) { foreach(i=1:length(levels)) %do% {listResult[[i]]$input} ... where the name of the variable [2] is meant to be the input. Unfortunately it turned out, that the "input"-variable after the $ is not substituted as expected. Subscripting the list with a variable after the $ seems not possible. The only workaround I found for that is a function like loopExtract <- function(input) { codetemplate <- as.character("result <- foreach(i=1:length(levels)) %do% {listResult[[i]]$input)}") require(stringr) write(str_replace_all(codetemplate, "input", input), file="tmp.r") source("tmp.r") return(result) } in other words I stored a template of the desired code as characters in an object, substituted the "input" string in that character object, wrote the result to a file and sourced the code of that file. I stored the result in a file cause the expression source(codetemplate) did not work. From the documentation I learned, that one can only source "connections". And there seems no chance to establish a connection to an object. (Probably no one else, except me, has such a strange idea.) Well, it works that way, but even though I am not a real programmer, I realize how dirty this solution is. There must be a much simpler and elegant way. To sum it all up, my problem can be reduced to the following two questions (1) How to subscript a list with a variable after $ (2) How to source from an object containing a code template or (probably the most promising) (3) How to avoid both ;) Unfortunately I am not experienced enough to find the solution. Please help! Greetings from sunny Munich, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to sort the levels of a table
Dear colleagues, I have really heavy problems in sorting the results of a table according to certain features of the levels in the table. Prerequisites: It all starts with a fairly simple data set, which stores observations of 21 observers (horizontally from 1 to 21; 21 is reference/goldstandard) who diagnosed 42 videos (vertically from 1 to 42). See dump of R-object "input" below in case you are interested in the original data. Each observation is a series of 1 and 0 (different length!) which was obtained by concatenating answers of a multiple choice questionaire. I created a confusion matrix (table) displaying the observations of the 20 observers (nr. 1 to 20) versus the reference (nr. 21) via the following code: ## observations of the reference obsValues<-factor(unlist(input[,-c(1,ncol(input))])) ## observations of the observers refValues<-factor(input[,ncol(input)]) ## data.frame that relates observations of observers and reference RefObs<-data.frame(refValues, obsValues) mtx<-table( RefObs$obsValues, RefObs$refValues, dnn=c("observers", "reference"), useNA=c("always") ) And now the problem: I need to sort the levels/classes of the table. Both axes shall be ordered 1st according to how often 1 occurs in the string that codes the observation and 2nd treating the observation string (formed by 0 and 1) as a number. i.e. "1" "0" "1010" "1" "100" "11" should be sorted as "0" "1" "100" "1" "11" "1010" I am aware of the order function and tried a command of the form mtx[order(row1stsort,row2ndsort),order(col1stsort,col2ndsort)]. Sorting i.e. the levels of observers and the reference works well this way: ## sorted levels generated by the reference choices. refLevels <- unique(input[,ncol(input)])[order( as.numeric(sapply(unique(input[,ncol(input)]), FUN=function(x) sum(as.numeric(unlist(strsplit(x,"")), as.numeric(as.vector(unique(input[,ncol(input)]))) )] ## sorted levels generated by the observers choices. obsLevels <- unique(unlist(input[,-c(1,ncol(input))]))[order( as.numeric(sapply(unique(unlist(input[,-c(1,ncol(input))])), FUN=function(x) sum(as.numeric(unlist(strsplit(x,"")), as.numeric(as.vector(unique(unlist(input[,-c(1,ncol(input))] )] Unfortunately this code does not the trick in case of the table mtx (see above). At least I was not successfull with the following code: mtx<-table( RefObs$obsValues, RefObs$refValues, dnn=c("observers", "reference"), useNA=c("always") )[order( as.numeric(sapply(as.vector(unique(RefObs$obsValues)), FUN=function(x) sum(as.numeric(replace(unlist(strsplit(x,"")),which(x=="w"),NA), ## generates numeric vector containing sum of digits of classes chosen by observers. 1st order of vertical (row) sorting. Structural missings coded as w are substituted by NA. as.numeric(as.vector(unique(RefObs$obsValues))) ## generates numeric vector containing classes chosen by observers. 2nd order of vertical (row) sorting. Structural missings coded as w are substituted by NA. ), order( as.numeric(sapply(as.vector(unique(RefObs$refValues)), FUN=function(x) sum(as.numeric(replace(unlist(strsplit(x,"")),which(x=="w"),NA), ## generates numeric vector containing sum of digits of classes chosen by reference. 1st order of horizontal (column) sorting. Structural missings coded as w are substituted by NA. as.numeric(as.vector(unique(RefObs$refValues))) ## generates numeric vector containing classes chosen by the reference. 2nd order of horizontal (column) sorting. ) ] Any suggestions? I'd appreaciate any kind of help very much! Greetings from Munich, Felix dump of data.frame: options(width=180) # to view it as a whole input <- structure(list(video = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42"), `1` = c("110", "0", "0", "1", "0", "0", "0", "1", "0", "0", "0", "1", "0", "100", "1110", "10110", "110", "1100", "0", "1000", "11100", "0", "11000", "0", "0", "0", "1110", "0", "0", "10110", "1000", "10010", "10001", "1", "100", "0", "100", "111", "1000", "0", "0", "0"), `2` = c("0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "100", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"), `3` = c("0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "110", "0", "0", "0", "0", "0", "10", "0", "0", "0", "0", "0", "0", "1000", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1100"), `4` = c("100", "10100", "10100", "10010", "10100", "1", "0", "10100", "1000", "1", "10100", "1", "10100", "1", "10100", "1", "1", "1", "0", "11000", "11000", "0", "1000", "1000", "0", "0", "0", "0", "1", "1", "1", "1", "10100", "1", "0", "0", "
Re: [R] sum of digits or how to slice a number into its digits
Hi Dimitris, thank you very much for your quick an efficient help! Your solution is perfect for me. Does exactly what I was looking for if combined with unlist and as.numeric before using sum. Now I can keep on with my real problem ;)... Thanx Again!!! Best, Felix Am 04.03.2011 14:25, schrieb Dimitris Rizopoulos: > one way is using function strsplit(), e.g., > > x <- c("100100110", "1001001", "1101", "00101") > sapply(strsplit(x, ""), function (x) sum(x == 1)) > > > I hope it helps. > > Best, > Dimitris > > > On 3/4/2011 2:18 PM, drflxms wrote: >> Dear R colleagues, >> >> I face a seemingly simple problem I couldn't find a solution for myself >> so far: >> >> I have to sum the digits of numbers. Example: 1010 ->2 100100110 -> 4 >> Unfortunately there seems not to be a function for this task. So my idea >> was to use sum(x) for it. But I did not figure out how to slice a number >> to a vector of its digits. Example (continued from above): 1010 -> >> c(1,0,1,0) 100100110 -> (1,0,0,1,0,0,1,1,0). >> >> Does anyone know either a function for calculating the sum of the digits >> of a bumber, or how to slice a number into a vector of its digits as >> described above? >> >> I'd appreciate any kind of help very much! >> Thanx in advance and greetings from cloudy Munich, >> Felix >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] sum of digits or how to slice a number into its digits
Dear R colleagues, I face a seemingly simple problem I couldn't find a solution for myself so far: I have to sum the digits of numbers. Example: 1010 ->2 100100110 -> 4 Unfortunately there seems not to be a function for this task. So my idea was to use sum(x) for it. But I did not figure out how to slice a number to a vector of its digits. Example (continued from above): 1010 -> c(1,0,1,0) 100100110 -> (1,0,0,1,0,0,1,1,0). Does anyone know either a function for calculating the sum of the digits of a bumber, or how to slice a number into a vector of its digits as described above? I'd appreciate any kind of help very much! Thanx in advance and greetings from cloudy Munich, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] selecting only corresponding categories from a confusion matrix
Dear R colleagues, as a result of my calculations regarding the inter-observer-variability in bronchoscopy, I get a confusion matrix like the following: 0 1 1001 1010 11 0609 11 54 36 6 1 1 260 2 1014 008 4 1004 000 0 1000 23 7 12 10 5 1001 0 040 0 1010 4 003 0 1011 1 010 2 11 0 033 1 1101 000 0 1100 2 000 0 1110 1 000 0 The first column represents the categories found among observers, the top row represents the categories found by the reference ("goldstandard"). I am looking for a way (general algorithm) to extract a data.frame with only the corresponding categories among observers and reference from the above confusion matrix. "Corresponding" means in this case, that a category has been chosen by both: observers and reference. In this example corresponding categories would be simply all categories that have been chosen by the reference (0,1,1001,1010,11), but generally there might also occur categories which are found by the reference only (and not among observers - in the first column). So the solution-dataframe for the above example would look like: 0 1 1001 1010 11 0609 11 54 36 6 1 1 260 2 1001 0 040 0 1010 4 003 0 11 0 033 1 All the categories found among observers only, were omitted. If the solution algorithm would include a method to list the omitted categories and to count their number as well as the number of omitted cases, it would be just perfect for me. I'd be happy to read from you soon! Thanks in advance for any kind of help with this. Greetings from snowy Munich, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] can't load rJava in R 2.8.1 on Windows XP
Dear Mr. Murdoch, Dear Mr. Ripley, Dear Mr. Wang, thank you very much for your quick and efficient help! It is exactly as Duncan explained it: Including jvm.dll in PATH solved the problem immediately. Everything works fine now. The only thing is, that I do not understand why I had to do this manualy, as I installed Java Runtime Environment a long time ago, updated frequently and never encountered any difficulties in using Java software. Anyway, I am happy now :-) Thanx again for your great support! Greetings from Munich, Germany, Felix Duncan Murdoch schrieb: > On 1/23/2009 7:38 AM, drflxms wrote: >> Dear community, >> >> unfortunately I did not manage load the rJava package receiving the >> following >> error-message: >> >> >>> library("rJava") >> Error in inDL(x, as.logical(local), as.logical(now), ...) : kann >> shared library 'C:/Programme/R/2.8.1/library/rJava/libs/rJava.dll' nicht >> laden: >> LoadLibrary failure: Das angegebene Modul wurde nicht gefunden. >> >> Error : .onLoad in 'loadNamespace' für 'rJava' fehlgeschlagen >> Fehler: Laden von Paket/Namensraum für 'rJava' fehlgeschlagen >> >> >> Translation: can't load library ... rJava.dll >> LoadLibrary failure: the module was not found > > That message comes from Windows, not R, and it's misleading. It does > not say that rJava.dll was not found, it says that a DLL needed by it > is not found. It would be helpful if it told you which one. You > should complain to Microsoft about it. If rJava.dll had been missing, > the English message would have been > > shared library 'rJava' not found > > The pedump utility (in the Rtools set, see > www.murdoch-sutherland.com/Rtools) can tell you what the dependencies > are: > > pedump -i rJava.dll > > shows that it imports things from these dlls: > > R.dll > KERNEL32.dll > msvcrt.dll > jvm.dll > > The first 3 are routine; without those R wouldn't work. (Without > KERNEL32.dll, nothing in Windows would work.) So as Brian said, it's > likely jvm.dll that it can't find, or possibly a DLL that it depends on. > Did you install Java first, as rJava requires? > > Duncan Murdoch > >> >> Reinstalling the package did not help, installing the latest >> developement >> version didn't help as well. >> >> The shared library rJava.dll is in place (exactly where R is looking >> for it)! >> >> Are there any ideas, what's wrong. >> I'd appreciate any kind of help very much, as I need rJava urgently >> to use RWeka and iPlots. >> >> Best regards, >> Felix >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] can't load rJava in R 2.8.1 on Windows XP
Dear community, unfortunately I did not manage load the rJava package receiving the following error-message: > library("rJava") Error in inDL(x, as.logical(local), as.logical(now), ...) : kann shared library 'C:/Programme/R/2.8.1/library/rJava/libs/rJava.dll' nicht laden: LoadLibrary failure: Das angegebene Modul wurde nicht gefunden. Error : .onLoad in 'loadNamespace' für 'rJava' fehlgeschlagen Fehler: Laden von Paket/Namensraum für 'rJava' fehlgeschlagen Translation: can't load library ... rJava.dll LoadLibrary failure: the module was not found Reinstalling the package did not help, installing the latest developement version didn't help as well. The shared library rJava.dll is in place (exactly where R is looking for it)! Are there any ideas, what's wrong. I'd appreciate any kind of help very much, as I need rJava urgently to use RWeka and iPlots. Best regards, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to address last and all but last column in dataframe
Hello Mr. Burns, Hello Mr. Dwinseminus thank you very much for your incredible quick and efficient reply! I was completely successful with the following command: pairs<-data.frame(pred=factor(unlist(input[,-c(1,ncol(input))])),ref=factor(input[,ncol(input)])) In case of the "input" example data.frame I sent with my question the above code is equivalent to: pairs<-data.frame(pred=factor(unlist(input[2:17])),ref=factor(input[,18])) Great! That is exactly what I was looking for! This simple code will save me hours! Patrick, your book looks in fact very interesting and will be my perfect reading material for the following nights :-) (probably not only the first chapter ;-). Thanks for the hint - and the free book of course. David, the "input" data.frame is the result of the reshape-command I performed. I just copied it from the R-console into the e-mail. In fact the first column "video" is not part of the data, but needed for analysis with kappam.fleiss function of the irr-package. Sorry, you are absolutely correct, I should have mentioned this in my question. I will improve when I ask my next question :-). Again I like to thank you for your help and wish you a pleasant Sunday. Greetings from Munich, Felix Patrick Burns wrote: > If I understand properly, you want > > input[, -c(1, ncol(input))] > > rather than > > input[[, -c(video, 21)]] > > Chapter 1 of S Poetry might be of interest to you. > > Patrick Burns > [EMAIL PROTECTED] > +44 (0)20 8525 0696 > http://www.burns-stat.com > (home of S Poetry and "A Guide for the Unwilling S User") > > drflxms wrote: >> Dear R-colleagues, >> >> another question from a newbie: I am creating a lot of simple >> pivot-charts from my raw data using the reshape-package. In these charts >> we have medical doctors judging videos in the columns and the videos >> they judge in the rows. Simple example of chart/data.frame "input" with >> two categories 1/0: >> >> video 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 >> >> 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 >> 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 >> 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 >> 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 >> 9 9 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0 >> 1010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> I recently learned, that I can easily create a confusion matrix out of >> this data using the following commands: >> >> pairs<-data.frame(pred=factor(unlist(input[2:21])),ref=factor(input[,22])) >> >> pred<-pairs$pred >> ref <- pairs$ref >> library (caret) >> confusionMatrix(pred, ref, positive=1) >> >> - where column 21 is the reference/goldstandard. >> >> My problem is now, that I analyse data.frames with an unknown count of >> columns. So to get rid of the first and last column for the "pred" >> variable and to select the last column for the "ref" variable, I have to >> look at the data.frame before doing the above commands to set the proper >> column numbers. >> >> It would be very comfortable, if I could address the last column not by >> number (where I have to count beforehand) but by a variable "last >> column". >> >> Probably there is a more easy solution for this problem using the names >> of the columns as well: the reference is always number "21" the first >> column is always called "video". So I tried: >> >> attach(input) >> pairs<-data.frame(pred=factor(unlist(input[[,-c(video,21)]])),ref=factor(input[[21]])) >> >> >> which does not work unfortunately :-(. >> >> I'd be very happy in case someone could help me out, cause I am really >> tired of counting - there are a lot of tables to analyse... >> >> Cheers and greetings from Munich, >> Felix >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> >> __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to address last and all but last column in dataframe
Dear R-colleagues, another question from a newbie: I am creating a lot of simple pivot-charts from my raw data using the reshape-package. In these charts we have medical doctors judging videos in the columns and the videos they judge in the rows. Simple example of chart/data.frame "input" with two categories 1/0: video 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 9 9 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0 1010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I recently learned, that I can easily create a confusion matrix out of this data using the following commands: pairs<-data.frame(pred=factor(unlist(input[2:21])),ref=factor(input[,22])) pred<-pairs$pred ref <- pairs$ref library (caret) confusionMatrix(pred, ref, positive=1) - where column 21 is the reference/goldstandard. My problem is now, that I analyse data.frames with an unknown count of columns. So to get rid of the first and last column for the "pred" variable and to select the last column for the "ref" variable, I have to look at the data.frame before doing the above commands to set the proper column numbers. It would be very comfortable, if I could address the last column not by number (where I have to count beforehand) but by a variable "last column". Probably there is a more easy solution for this problem using the names of the columns as well: the reference is always number "21" the first column is always called "video". So I tried: attach(input) pairs<-data.frame(pred=factor(unlist(input[[,-c(video,21)]])),ref=factor(input[[21]])) which does not work unfortunately :-(. I'd be very happy in case someone could help me out, cause I am really tired of counting - there are a lot of tables to analyse... Cheers and greetings from Munich, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Free SQL Database with R
Hello Chibisi, you might be looking for something like "Rpad" (http://www.rpad.org/Rpad/). There are some other systems like SNetscape, Rserve, RSOAP, R.NET.Web as well. Unfortunately I have no personal experience with these systems so far, so I can't give you any real advice. I just know them to exist and surfed some testpages. To me Rpad looks as probably the most feasible system for the task you describe (and from a cost vs. benefit point of view) . As far as I understand/remember users need no unusual prerequisites to use Rpad. For a general overview of GUIs see i.e. http://www.sciviews.org/_rgui/. According to your latest description of the database associated with the webpage, SQlite probably will suffice and might be easier to handle than a whole MySQL server - as Hadley Wickham already proposed in his e-mail. But this all of course depends on your resources as well (i.e. do you have access to your own root-server, or just some megabytes of ordinary webspace...?). Kind regards and greetings from Munich, Felix. Chibisi Chima-Okereke schrieb: > Dear Felix, > > Thanks for the reply, > > If you haven't already guessed I am new to web programming. > > The sort of webpage I want to build is one that presents quantitative > information in graphs and charts, that people can interact with, e.g. > select parts of charts to zoom into, highlight values, click buttons > to do analysis on the data displayed, so yes some sort of interactive > GUI. I initially thought of using flash as a front end but I don't > know any actionscript, so learning that would to a suitable standard > take alot of extra time, and I think it would be best if everything > could be done in R as much as possible. > > If I used an RGUI I guess I would be using the playwith package? Do > the consumers of the website need to have R to consume stuff displayed > with an RGUI? > > The database itself would just be pretty static just being queried for > information, unless some analysis was required in which case R would > query the database do the analysis and write the analysis back to the > database (I guessing that is the way it would be done), before it gets > displayed on the web page. > > Kind Regards > > Chibisi > > On Wed, Sep 3, 2008 at 11:39 AM, drflxms <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > Hello Chibisi, > > I am not shore whether I completely understand your needs: Do you want > to build a webpage which relies on a content management system > (cms)? Do > you want to collect data (i.e. survey) which later on shall be > analysed > using R? Or shall it be a webpage with an interactive R GUI? What > else? > > But personally I would prefer MySQL as backend for websites, as most > professional (opensource) cms (i.e. typo3, wordpress etc.) are created > with MySQL in mind. > There is a good interface between R and MySQL called RMySQL as well. I > use this on a daily basis, as all my data is stored in a local MySQL > database (more flexible than always reading in text files - at > least in > my opinion). > > Hope this personal view might help a little bit. > > Cheers, > Felix > > __ > R-help@r-project.org <mailto:R-help@r-project.org> mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Free SQL Database with R
Hello Chibisi, I am not shore whether I completely understand your needs: Do you want to build a webpage which relies on a content management system (cms)? Do you want to collect data (i.e. survey) which later on shall be analysed using R? Or shall it be a webpage with an interactive R GUI? What else? But personally I would prefer MySQL as backend for websites, as most professional (opensource) cms (i.e. typo3, wordpress etc.) are created with MySQL in mind. There is a good interface between R and MySQL called RMySQL as well. I use this on a daily basis, as all my data is stored in a local MySQL database (more flexible than always reading in text files - at least in my opinion). Hope this personal view might help a little bit. Cheers, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] convenient way to calculate specificity, sensitivity and accuracy from raw data
1 0 > 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 > 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 > 9 9 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0 > 1010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 1111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 1212 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 1313 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 1414 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 1515 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 1616 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 1717 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 1818 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 > 1919 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 2020 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 2121 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 > 2222 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 2323 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 > 2424 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 1 > 2525 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 > 2626 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 > 2727 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 2828 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 2929 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 3030 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 3131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 3232 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 3333 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 3434 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 3535 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 3636 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 3737 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 > 3838 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 3939 0 1 0 0 1 0 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 > 4040 1 1 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 > 4141 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 > 4242 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"), > header = TRUE) > closeAllConnections() > > goldstand <- dat$X21 > prev <- sum(goldstand) > cprev <- sum(!goldstand) > n <- prev + cprev > lapply(dat[-1], function(x){ > tab <- table(x, goldstand) > cS <- colSums(tab) > if(nrow(tab) > 1 && ncol(tab) > 1) { > out <- c(sp = tab[1,1], sn = tab[2,2]) / cS > c(out, ac = (out[1] * cprev + out[2] * prev) / n) > } > }) > > > I hope it helps. > > Best, > Dimitris > > Quoting drflxms <[EMAIL PROTECTED]>: > >> Dear R-colleagues, >> >> this is a question from a R-newbie medical doctor: >> >> I am evaluating data on inter-observer-reliability in endoscopy. 20 >> medical doctors judged 42 videos filling out a multiple choice survey >> for each video. The overall-data is organized in a classical way: >> observations (items from the multiple choice survey) as columns, each >> case (identified by the two columns "number of medical doctor" and >> "number of video") in a row. In addition there is a medical doctor >> number 21 who is assumed to be a gold-standard. >> >> As measure of inter-observer-agreement I calculated kappa according to >> Fleiss and simple agreement in percent using the routines >> "kappam.fleiss" and "agree" from the irr-package. Everything worked fine >> so far. >> >> Now I'd like to calculate specificity, sensitivity and accuracy for each >> item (compared to the gold-standard), as these are well-known and easy >> to understand quantities for medical doctors. >> >> Unfortunately I haven't found a feasible way to do this in R so far. All >> solutions I found, describe calculation of specificity, sensitivity and >> accuracy from a contingency-table / confusion-matrix only. For me it is >> very difficult to create such contingency-tables / confusion-matrices >> from the raw data I have. >> >> So I started to do it in Excel by hand - a lot of work! When I'll keep >> on doing this, I'll miss the deadline. So maybe someone can help me out: >> >> It would be very convenient, if there is way to calculate specificity, >> sensitivity and accuracy from the very same data.frames I created for &g
[R] convenient way to calculate specificity, sensitivity and accuracy from raw data
Dear R-colleagues, this is a question from a R-newbie medical doctor: I am evaluating data on inter-observer-reliability in endoscopy. 20 medical doctors judged 42 videos filling out a multiple choice survey for each video. The overall-data is organized in a classical way: observations (items from the multiple choice survey) as columns, each case (identified by the two columns "number of medical doctor" and "number of video") in a row. In addition there is a medical doctor number 21 who is assumed to be a gold-standard. As measure of inter-observer-agreement I calculated kappa according to Fleiss and simple agreement in percent using the routines "kappam.fleiss" and "agree" from the irr-package. Everything worked fine so far. Now I'd like to calculate specificity, sensitivity and accuracy for each item (compared to the gold-standard), as these are well-known and easy to understand quantities for medical doctors. Unfortunately I haven't found a feasible way to do this in R so far. All solutions I found, describe calculation of specificity, sensitivity and accuracy from a contingency-table / confusion-matrix only. For me it is very difficult to create such contingency-tables / confusion-matrices from the raw data I have. So I started to do it in Excel by hand - a lot of work! When I'll keep on doing this, I'll miss the deadline. So maybe someone can help me out: It would be very convenient, if there is way to calculate specificity, sensitivity and accuracy from the very same data.frames I created for the calculation of kappa and agreement. In these data.frames, which were generated from the overall-data-table described above using the "reshape" package, we have the judging medical doctor in the columns and the videos in the rows. In the cells there are the coded answer-options from the multiple choice survey. Please see an simple example with answer-options 0/1 (copied from R console) below: video 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 9 9 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0 1010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1212 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1313 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1414 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1515 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1616 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1717 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1818 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1919 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2020 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2121 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2222 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2323 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 2424 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 1 2525 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 2626 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2727 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2828 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2929 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3030 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3232 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3333 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3434 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3535 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3636 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3737 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 3838 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3939 0 1 0 0 1 0 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 4040 1 1 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 4141 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 4242 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 What I did in Excel is: Creating the very same tables using pivot-charts. Comparing columns 1-20 to column 21 (gold-standard), summing up the count of values that are identical to 21. I repeated this for each answer-option. From the results, one can easily calculate specificity, sensitivity and accuracy. How to do this, or something similar leading to the
Re: [R] simple generation of artificial data with defined features
Hello all, beside saying again thank you for your help, I'd like to present the final solution of my problem and the results of the kappa-calculation: > election.2005 <- c(16194,13136,3494,3838,4648,4118) #data obtained via genesis-database of "Statistisches Bundesamt" www.destatis.de #simply cut of last 3 digits because of limited calculation-power of laptop > attr(election.2005, "class") <- "table" > attr(election.2005, "dim") <- c(1,6) > attr(election.2005, "dimnames") <- list(c("votes"), c(1,2,3,4,5,6)) #used numbers instead of names of parties for easier handling later on #1=spd,2=cdu,3=csu,4=gruene,5=fdp,6=pds > head(election.2005) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 16194 13136 3494 3838 4648 4118 #replicate rows according to frequency-table: > el.dt.exp <- el.dt[rep(1:nrow(el.dt), el.dt$Freq), -ncol(el.dt)] > el.dt.exp$id=seq(1:nrow(el.dt.exp)) #add voter id > el.dt.exp$year=2005 #add column with year of election # remove a column we don't need: > el.dt.exp<-subset(el.dt.exp, select=-c(Var1)) > dim(el.dt.exp) [1] 45428 3 > head(el.dt.exp) Var2 id year 1 1 1 2005 1.11 2 2005 1.21 3 2005 1.31 4 2005 1.41 5 2005 1.51 6 2005 1.51 6 2005 > el.dt.exp<-as.data.frame(el.dt.exp, row.names=seq(1:nrow(el.dt.exp))) # get rid of the unusual numbering of rows > head(el.dt.exp) Var2 id year 11 1 2005 21 2 2005 31 3 2005 41 4 2005 51 5 2005 61 6 2005 > summary(el.dt.exp) Var2id year 1:16194 Min. :1 Min. :2005 2:13136 1st Qu.:11358 1st Qu.:2005 3: 3494 Median :22715 Median :2005 4: 3838 Mean :22715 Mean :2005 5: 4648 3rd Qu.:34071 3rd Qu.:2005 6: 4118 Max. :45428 Max. :2005 Var2 is of type character, which is uncomfortable for further processing. I changed type with the data editor using fix(el.dt.exp) to number. #create the dataframe for the calculation of kappa > library(reshape) > el.dt.exp.molten<-melt(el.dt.exp, id=c(2,3), na.rm=FALSE) > kappa.frame<-cast(el.dt.exp.molten, year ~ id) > dim(kappa.frame) [1] 1 45429 #calculate kappa > library(irr) > kappam.fleiss(kappa.frame, exact=FALSE, detail=TRUE) Fleiss' Kappa for m Raters Subjects = 1 Raters = 45428 Kappa = -2.2e-05 z = -1.35 p-value = 0.176 Kappa z p.value 1 0.000 -0.707 0.479 2 0.000 -0.707 0.479 3 0.000 -0.707 0.479 4 0.000 -0.707 0.479 5 0.000 -0.707 0.479 6 0.000 -0.707 0.479 What a surprise! So Greg was absolutely right, that this is probably not a good example for Kappa. But still a very interesting one, if you ask me! My theory: Kappa doesn't express simply agreement. As far as I learned from the Handbook of Inter-Rater Reliability (Gwet, Kilem 2001; STATAXIS Publishing Company; www.stataxis.com) Kappa tries to measure how different and observed agreement is from an agreement that arises from chance. So in this case this probably means, that the results of the election 2005 are not significantly different from results, that could have arisen by chance. Anyway I personally learned a very interesting lesson about Kappa and R. Thank you all for your professional and quick help to a newbie! Greetings from Munich, Felix __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] simple generation of artificial data with defined features
Hi Christoph, perfect! Your code worked out of the box (copy and paste ;-). I had expected at least some lines of code, but this is really easy! So once you get used to command line, this is much more flexible (and comfortable!) than all these coloured windows. Can't tell you how happy I am, that I seem to make it away from these terrible SPSS-license hassle. Taking into account that these are my first weeks with R, and that I learn (My)SQL in parallel (using RMySQL) and did know not too much (not to say nothing) about statistics before (like nearly all medical doctors...), I even would't say, the learning curve is too steep. Anyway thank you for your quick and efficient help for a newbie! One of the reasons for my delayed answer is some trouble I experienced in the following steps. I'll write a concluding e-mail to all about that soon. Greetings from Munich, Felix Christoph Meyer schrieb: > Hi, > > to add voter.id and election.year to your data frame you could try: > > el.dt.exp$voter.id=seq(1:nrow(el.dt.exp)) > > el.dt.exp$election.year=2005 > > Cheers, > > Christoph Meyer > > > *** > Dr. Christoph Meyer > Institute of Experimental Ecology > University of Ulm > Albert-Einstein-Allee 11 > D-89069 Ulm > Germany > Phone: ++49-(0)731-502-2675 > Fax:++49-(0)731-502-2683 > Mobile: ++49-(0)1577-156-7049 > E-mail: [EMAIL PROTECTED] > http://www.uni-ulm.de/index.php?id=7885 > *** > > Saturday, August 23, 2008, 1:25:05 PM, you wrote: > > >> Dear Mr. Christos Hatzis, >> > > >> thank you so much for your answer which is in my eyes just brilliant! I >> followed it step by step (great and detailed explanation) and nearly >> everything is fine. - Except a problem in the very end, I haven't found >> a solution for until now. (Despite playing arround quite a lot...) >> Please let me explain: >> > > >>> election.2005 <- c(16194,13136,3494,3838,4648,4118) #cut of last 3 >>> >> digits, cause my laptop can't handle millions of rows... >> >>> attr(election.2005, "class") <- "table" >>> attr(election.2005, "dim") <- c(1,6) >>> attr(election.2005, "dimnames") <- list(c("votes"), c("spd", "cdu", >>> >> "csu", "gruene", "fdp", "pds")) >> >>> head(election.2005) >>> >> spd cdu csu gruene fdp pds >> votes 16194 13136 3494 3838 4648 4118 >> >>> el.dt <- as.data.frame(election.2005) >>> el.dt.exp <- el.dt[rep(1:nrow(el.dt), el.dt$Freq), -ncol(el.dt)] >>> dim(el.dt.exp) >>> >> [1] 45428 2 >> >>> head(el.dt.exp) >>> >> Var1 Var2 >> 1 votes spd >> 1.1 votes spd >> 1.2 votes spd >> 1.3 votes spd >> 1.4 votes spd >> 1.5 votes spd >> > > >> My problem now is, that I would need either an autoincrementing >> identifier instead of "votes" in Var1 or the possibility to access the >> numbering by a column name (i.e. Var0). In addition I need a 3rd >> Variable for the year oft the election (2005, which is the same for all, >> but needed later on). So this is what it should look like: >> > > >> voter.id party election.year >> 1 1spd2005 >> 1.1 2 spd 2005 >> 1.2 3spd 2005 >> 1.3 4spd2005 >> 1.4 5spd2005 >> 1.5 6spd2005 >> > > ... > > > > > > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] simple generation of artificial data with defined features
Hello Mr. Greg Snow! Thank you very much for your prompt answer. > I don't think that the election data is the right data to demonstrate Kappa, > you need subjects that are classified by 2 or more different raters/methods. > The election data could be considered classifying the voters into which party > they voted for, but you only have 1 rater. I think, It should be possible to calculate kappa in case one has a little different point of view from the one you described above: Take the voters as raters who "judge" the category "election" with one party out of the six mentioned in my previous e-mail (which are simply the top six). This makes sense to me, because an election is somehow nothing else but a survey with the question "who should lead our country" - given six options in this example. As kappa is a measure of agreement, it should be able to illustrate the agreement of the voters answers to this question. For me this is - in priciple - no different from asking "Where is the stenosis in the video of this endoscopy" offering six options representing anatomic locations each. > Otherwise you may want to stick with the sample datasets. > The example data sets are of excellent quality and very interesting. I am sure there would be brilliant examples among them. But I have to admit that,t a I have no t a good overview of the available datasets at the moment (as a newbie). I just wanted to give an example out of every days life, everybody is familiar with. An election is something which came to my mind spontaneously. > There are other packages that compute Kappa values as well (I don't know if > others calculate this particular version), but some of those take the summary > data as input rather than the raw data, which may be easier if you just have > the summary tables. > > I chose Fleiss Kappa, because it is a more general form of Cohen's Kappa allowing m raters and n categories (instead of only two raters and to categories when using Cohen's kappa). Looking for another package calculating it from summary tables might be the simplest solution to my problem. Thank you very much for this hint! On the other hand it would be nice to use the very same method for the example as for the "real" data. The example will be part of the "methods" section. Thank you again very much for your tips and the quick reply. Have a nice weekend! Greetings from Munich, Felix Mueller-Sarnowski >> -Original Message- >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED] On Behalf Of drflxms >> Sent: Friday, August 22, 2008 6:12 AM >> To: r-help@r-project.org >> Subject: [R] simple generation of artificial data with >> defined features >> >> Dear R-colleagues, >> >> I am quite a newbie to R fighting my stupidity to solve a >> probably quite simple problem of generating artificial data >> with defined features. >> >> I am conducting a study of inter-observer-agreement in >> child-bronchoscopy. One of the most important measures is >> Kappa according to Fleiss, which is very comfortable >> available in R through the irr-package. >> Unfortunately medical doctors like me don't really understand >> much of statistics. Therefore I'd like to give the reader an >> easy understandable example of Fleiss-Kappa in the Methods >> part. To achieve this, I obtained a table with the results of >> the German election from 2005: >> >> partynumber of votespercent >> >> SPD1619466534,2 >> CDU1313674027,8 >> CSU34943097,4 >> Gruene38383268,1 >> FDP46481449,8 >> PDS41181948,7 >> >> I want to show the agreement of voters measured by >> Fleiss-Kappa. To calculate this with the >> kappam.fleiss-function of irr, I need a data.frame like this: >> >> (id of 1st voter) (id of 2nd voter) >> >> party spd cdu >> >> Of course I don't plan to calculate this with the million of >> cases mentioned in the table above (I am working on a small >> laptop). A division by 1000 would be more than perfect for >> this example. The exact format of the table is generally not >> so important, as I could reshape nearly every format with the >> help of the reshape-package. >> >> Unfortunately I could not figure out how to create such a >> fictive/artificial dataset as described above. Any data.frame >> would be nice, that keeps at least the percentage. String-IDs >> of parties could be substituted by numbers of course (would >> be even
Re: [R] simple generation of artificial data with defined features
Dear Mr. Christos Hatzis, thank you so much for your answer which is in my eyes just brilliant! I followed it step by step (great and detailed explanation) and nearly everything is fine. - Except a problem in the very end, I haven't found a solution for until now. (Despite playing arround quite a lot...) Please let me explain: > election.2005 <- c(16194,13136,3494,3838,4648,4118) #cut of last 3 digits, cause my laptop can't handle millions of rows... > attr(election.2005, "class") <- "table" > attr(election.2005, "dim") <- c(1,6) > attr(election.2005, "dimnames") <- list(c("votes"), c("spd", "cdu", "csu", "gruene", "fdp", "pds")) > head(election.2005) spd cdu csu gruene fdp pds votes 16194 13136 3494 3838 4648 4118 > el.dt <- as.data.frame(election.2005) > el.dt.exp <- el.dt[rep(1:nrow(el.dt), el.dt$Freq), -ncol(el.dt)] > dim(el.dt.exp) [1] 45428 2 > head(el.dt.exp) Var1 Var2 1 votes spd 1.1 votes spd 1.2 votes spd 1.3 votes spd 1.4 votes spd 1.5 votes spd My problem now is, that I would need either an autoincrementing identifier instead of "votes" in Var1 or the possibility to access the numbering by a column name (i.e. Var0). In addition I need a 3rd Variable for the year oft the election (2005, which is the same for all, but needed later on). So this is what it should look like: voter.id party election.year 1 1spd2005 1.1 2 spd 2005 1.2 3spd 2005 1.3 4spd2005 1.4 5spd2005 1.5 6spd2005 The reason for that is the input format of the kappam.fleiss function of the irr package I use for calculation. It accepts a data.frame with the categories as rows (here we would have only one catgory: the year of the election) and the raters (here the voters) as columns. In the data.frame there will be the chosen party for each combination of electionyear and voter. This format can be easily achieved using the reshape package. Assuming voter.id would be an autoincrementing identifier, the command should be: >library(reshape) >el.dt.exp.molten<-melt(el.dt.exp, id=c("voter.id")) #which would propably change not really anything in this case, because the data is already in a "molten" form >kappa.frame<-cast(el.dt.exp.molten, election.year ~ voter.id, subset=variable=="party") I'd be extremely happy in case you might help me out again! Have a nice weekend and many thanks so far! Greetings from Munich, Felix Mueller-Sarnowski Christos Hatzis wrote: > On the general question on how to create a dataset that matches the > frequencies in a table, function as.data.frame can be useful. It takes as > argument an object of a class 'table' and returns a data frame of > frequencies. > > Consider for example table 6.1 of Fleiss et al (3rd Ed): > > >> birth.weight <- c(10,15,40,135) >> attr(birth.weight, "class") <- "table" >> attr(birth.weight, "dim") <- c(2,2) >> attr(birth.weight, "dimnames") <- list(c("A", "Ab"), c("B", "Bb")) >> birth.weight >> > B Bb > A 10 40 > Ab 15 135 > >> summary(birth.weight) >> > Number of cases in table: 200 > Number of factors: 2 > Test for independence of all factors: > Chisq = 3.429, df = 1, p-value = 0.06408 > >> bw.dt <- as.data.frame(birth.weight) >> > > Observations (rows) in this table can then be replicated according to their > corresponding frequencies to yield the expanded dataset that conforms with > the original table. > > >> bw.dt.exp <- bw.dt[rep(1:nrow(bw.dt), bw.dt$Freq), -ncol(bw.dt)] >> dim(bw.dt.exp) >> > [1] 200 2 > >> table(bw.dt.exp) >> > Var2 > Var1 B Bb > A 10 40 > Ab 15 135 > > The above approach is not restricted to 2x2 tables, and should be > straightforward generate datasets that conform to arbitrary nxm frequency > tables. > > -Christos Hatzis > > > >> -Original Message- >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED] On Behalf Of Greg Snow >> Sent: Friday, August 22, 2008 12:41 PM >> To: drflxms; r-help@r-project.org >> Subject: Re: [R] simple generation of artificial data with >> defined features >> >> I don't think that the election data is the right data to >> demonstrate Kappa, you need subjects that are classified by 2 >> or more different raters/methods. The
[R] simple generation of artificial data with defined features
Dear R-colleagues, I am quite a newbie to R fighting my stupidity to solve a probably quite simple problem of generating artificial data with defined features. I am conducting a study of inter-observer-agreement in child-bronchoscopy. One of the most important measures is Kappa according to Fleiss, which is very comfortable available in R through the irr-package. Unfortunately medical doctors like me don't really understand much of statistics. Therefore I'd like to give the reader an easy understandable example of Fleiss-Kappa in the Methods part. To achieve this, I obtained a table with the results of the German election from 2005: partynumber of votespercent SPD1619466534,2 CDU1313674027,8 CSU34943097,4 Gruene38383268,1 FDP46481449,8 PDS41181948,7 I want to show the agreement of voters measured by Fleiss-Kappa. To calculate this with the kappam.fleiss-function of irr, I need a data.frame like this: (id of 1st voter) (id of 2nd voter) party spd cdu Of course I don't plan to calculate this with the million of cases mentioned in the table above (I am working on a small laptop). A division by 1000 would be more than perfect for this example. The exact format of the table is generally not so important, as I could reshape nearly every format with the help of the reshape-package. Unfortunately I could not figure out how to create such a fictive/artificial dataset as described above. Any data.frame would be nice, that keeps at least the percentage. String-IDs of parties could be substituted by numbers of course (would be even better for function kappam.fleiss in irr!). I would appreciate any kind of help very much indeed. Greetings from Munich, Felix Mueller-Sarnowski __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.