Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution

2012-03-03 Thread drflxms
Please allow one (hopefully ;) last question:
Do you think the code I adopted from Hänggi is valid in selecting a
contour which encloses i.e. 68% of the scatter-plot data? - I am still
not completely shore... still looking for the reason of the different
result of Hänggis and Foresters code. Should be equivalent in theory.

Felix

Am 03.03.12 22:13, schrieb peter dalgaard:
> 
> On Mar 3, 2012, at 20:25 , drflxms wrote:
> 
>> "Once you go into two dimensions, SD loses all meaning, and adding
>> nonparametric density estimation into the mix doesn't help, so just stop
>> thinking in those terms!"
>>
>> This makes me really think a lot! Is plotting the 0,68 confidence
>> interval in 2D as equivalent to +-1 SD really nonsense!?
> 
> 
> Nono.., a 68% CI (or rather, prediction region) is a 68% region, no problem 
> in that. And choosing that particular value because it is traditional in 1D 
> quite defensible. It's just that there is nothing (at least not anything 
> obvious) that it corresponds to 1SD of.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution

2012-03-03 Thread drflxms
Puhh, I am really happy to read that the idea was not completely
sensless. This would have heavily damaged my anyway labile view of the
world of statistics ;)
In any case I need to get to know more about bivariate normal
distributions! Any literature recommendations?

Felix

Am 03.03.12 22:13, schrieb peter dalgaard:
> 
> On Mar 3, 2012, at 20:25 , drflxms wrote:
> 
>> "Once you go into two dimensions, SD loses all meaning, and adding
>> nonparametric density estimation into the mix doesn't help, so just stop
>> thinking in those terms!"
>>
>> This makes me really think a lot! Is plotting the 0,68 confidence
>> interval in 2D as equivalent to +-1 SD really nonsense!?
> 
> 
> Nono.., a 68% CI (or rather, prediction region) is a 68% region, no problem 
> in that. And choosing that particular value because it is traditional in 1D 
> quite defensible. It's just that there is nothing (at least not anything 
> obvious) that it corresponds to 1SD of.
> 
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution

2012-03-03 Thread drflxms
Greg, extremely cool thoughts! Thank you for delving into it this deep.

As I mentioned, I am a neurologist with unfortunately poor statistical
training. You are professional statisticians. So I'd like to apologize
for any unprofessional nomenclature and strange thoughts beforehand.

As my previous postings show, I think a lot in pictures. Therefore I
found the interactive demo(bivar) in the rgl package very impressive.

I believe everything you are writing is completely true for
theoretical/parametric normal distributions. When you plot the density
of such an probability distribution it will have the form of a sugar
loaf (skewed or unskewed doesn't matter in this case). The projection of
the standard deviations of all possible slices to the x-y-plane below
the sugar loaf will be an ellipse for shore. Corresponding to that you
would get nested ellipses when you apply contour() to the 2d scatter
plot of the data of that bivariate distribution.

But the distribution generated via rnorm are random. demo(bivar) shows
both for such an rnorm() distribution: 1.) the "sugar loaf" parametric
theoretical 3d normal density, and 2.) the crumpled non-parametric
"real" 3d density (looks like a mountain), which is derived from the
data of the 2d scatter plot. I grasped the terms parametric and
non-parametric form the code of demo(bivar) by the way.
In this case contour() will not draw nested ellipses but contour lines
similar to those on a mountain map.
As far as I understand this is closer to reality because fewer
assumptions are made compared to the parametric approach. That's why I
preferred contours over ellipses.

Hopefully I could made my thoughts understood and they are not
completely ridiculous.
Thank you again for your detailed explanations which I am going to study
in detail right now... :) Just wanted to answer before that more
detailed study, as this is certainly taking some time...

Best, Felix



Am 03.03.12 22:14, schrieb Greg Snow:
> To further explain.  If you want contours of a bivariate normal, then
> you want ellipses.   The density for a bivariate normal (with 0
> correlation to keep things simple, but the theory will extend to
> correlated cases)  is proportional to exp( -1/2 ( x1^2/v1 + x2^2/v2 )
> so a contour of the distribution will be all points such that x1^2/v1
> + x2^2/v2 = c for some constant c (each c will give a different
> contour), but that is the definition of an ellipse (well divide both
> sides by c so that the right side is 1 to get the canonical form).
> The ellipse function in the ellipse package chooses c from the chi
> squared distribution (since if x1 and x2 are normally distributed with
> mean 0 (or have the mean subtracted), then x1^2/v1 +x2^2/v2 is chi
> squared distributed with 2 degrees of freedom.
> 
> So if you really want to you can try to approximate the contours in
> some other way, but any decent approach will just converge to the
> ellipse.
> 
> On Sat, Mar 3, 2012 at 1:26 PM, drflxms  wrote:
>> Wow, David,
>>
>> thank you for these sources, which I just screened. bagplot looks most
>> promising to me. I found it in the package ‘aplpack’ as well as in the R
>> Graph Gallery
>> http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=112
>>
>> Ellipses are not exactly what I am heading for. I am looking for a 2D
>> equivalent of plotting 1D standard deviation boundaries. In other words
>> how to plot SD boundaries on a 2D scatter plot.
>> So I started searching for contour/boundaries etc. instead of ellipse
>> leading me to Pascal Hänggi:
>> http://www.mail-archive.com/r-help@r-project.org/msg24013.html
>>
>> To describe it in an image: I want to cut the density mountain above the
>> scatter plot (see demo(bivar) in the rgl package) in a way so that the
>> part of the mountain, that covers 68% of the data on the x-y-plane below
>> it (+-1 SD) is removed. Then I'd like to project the edge that results
>> from the cut to the x-y-plane below the mountain. This should be the 2d
>> equivalent of 1s SD boundaries.
>>
>> I think this might be achieved as well by Hänggis code as by the
>> function of Forester. Unfortunately they result in slightly different
>> boundaries which shouldn't be the case. And I did not figure out which
>> one is correct if one is correct at all (!?).
>>
>> Can anyone explain the difference?
>>
>> I compared them with this code:
>>
>> # parameters:
>> n<-100
>>
>> # generate samples:
>> set.seed(138813)
>> x<-rnorm(n); y<-rnorm(n)
>> a<-list("x"=x,"y"=y) # input for Foresters function which is appended at
>> the very end
>>
>> # estimate non-parameteric density surface via kern

Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution

2012-03-03 Thread drflxms
Wow, David,

thank you for these sources, which I just screened. bagplot looks most
promising to me. I found it in the package ‘aplpack’ as well as in the R
Graph Gallery
http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=112

Ellipses are not exactly what I am heading for. I am looking for a 2D
equivalent of plotting 1D standard deviation boundaries. In other words
how to plot SD boundaries on a 2D scatter plot.
So I started searching for contour/boundaries etc. instead of ellipse
leading me to Pascal Hänggi:
http://www.mail-archive.com/r-help@r-project.org/msg24013.html

To describe it in an image: I want to cut the density mountain above the
scatter plot (see demo(bivar) in the rgl package) in a way so that the
part of the mountain, that covers 68% of the data on the x-y-plane below
it (+-1 SD) is removed. Then I'd like to project the edge that results
from the cut to the x-y-plane below the mountain. This should be the 2d
equivalent of 1s SD boundaries.

I think this might be achieved as well by Hänggis code as by the
function of Forester. Unfortunately they result in slightly different
boundaries which shouldn't be the case. And I did not figure out which
one is correct if one is correct at all (!?).

Can anyone explain the difference?

I compared them with this code:

# parameters:
n<-100

# generate samples:
set.seed(138813)
x<-rnorm(n); y<-rnorm(n)
a<-list("x"=x,"y"=y) # input for Foresters function which is appended at
the very end

# estimate non-parameteric density surface via kernel smoothing
library(MASS)
den<-kde2d(x, y, n=n)

z <- array()
for (i in 1:n){
z.x <- max(which(den$x < x[i]))
z.y <- max(which(den$y < y[i]))
z[i] <- den$z[z.x, z.y]
}

# store class/level borders of confidence interval in variables
confidence.border <- quantile(z, probs=0.05, na.rm = TRUE)# 0.05
corresponds to 0.95 in draw.contour

plot(x,y)
draw.contour(a, alpha=0.95)
par(new=TRUE)
contour(den, levels=confidence.border, col = "red", add = TRUE)


###
## drawcontour.R
## Written by J.D. Forester, 17 March 2008
###

## This function draws an approximate density contour based on
empirical, bivariate data.

##change testit to FALSE if sourcing the file
testit=TRUE

draw.contour<-function(a,alpha=0.95,plot.dens=FALSE, line.width=2,
line.type=1, limits=NULL, density.res=800,spline.smooth=-1,...){
  ## a is a list or matrix of x and y coordinates (e.g.,
a=list("x"=rnorm(100),"y"=rnorm(100)))
  ## if a is a list or dataframe, the components must be labeled "x" and "y"
  ## if a is a matrix, the first column is assumed to be x, the second y
  ## alpha is the contour level desired
  ## if plot.dens==TRUE, then the joint density of x and y are plotted,
  ## otherwise the contour is added to the current plot.
  ## density.res controls the resolution of the density plot

  ## A key assumption of this function is that very little probability
mass lies outside the limits of
  ## the x and y values in "a". This is likely reasonable if the number
of observations in a is large.

  require(MASS)
  require(ks)
  if(length(line.width)!=length(alpha)){
line.width <- rep(line.width[1],length(alpha))
  }

  if(length(line.type)!=length(alpha)){
line.type <- rep(line.type[1],length(alpha))
  }

  if(is.matrix(a)){
a=list("x"=a[,1],"y"=a[,2])
  }
  ##generate approximate density values
  if(is.null(limits)){
limits=c(range(a$x),range(a$y))
  }
  f1<-kde2d(a$x,a$y,n=density.res,lims=limits)

  ##plot empirical density
  if(plot.dens) image(f1,...)

  if(is.null(dev.list())){
##ensure that there is a window in which to draw the contour
plot(a,type="n",xlab="X",ylab="Y")
  }

  ##estimate critical contour value
  ## assume that density outside of plot is very small

  zdens <- rev(sort(f1$z))
  Czdens <- cumsum(zdens)
  Czdens <- (Czdens/Czdens[length(zdens)])
  for(cont.level in 1:length(alpha)){
##This loop allows for multiple contour levels
crit.val <- zdens[max(which(Czdens<=alpha[cont.level]))]

##determine coordinates of critical contour
b.full=contourLines(f1,levels=crit.val)
for(c in 1:length(b.full)){
  ##This loop is used in case the density is multimodal or if the
desired contour
  ##  extends outside the plotting region

b=list("x"=as.vector(unlist(b.full[[c]][2])),"y"=as.vector(unlist(b.full[[c]][3])))

  ##plot desired contour
  line.dat<-xspline(b,shape=spline.smooth,open=TRUE,draw=FALSE)
  lines(line.dat,lty=line.type[cont.level],lwd=line.width[cont.level])
}
  }
}

##
##Test the function
##

##generate data
if(testit){
  n=1
  a<-list("x"=rnorm(n,400

Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution

2012-03-03 Thread drflxms
Thank you very much for your thoughts!

Exactly what you mention is, what I am thinking about during the last
hours: What is the relation between the den$z distribution and the z
distribution.
That's why I asked for ecdf(distribution)(value)->percentile earlier
this day (thank you again for your quick and insightful answer on
that!). I used it to compare certain values in both distributions by
their percentile.

I really think you are completely right: I urgently need some lessons in
bivariate/multivariate normal distributions. (I am a neurologist and
unfortunately did not learn too much about statistics in university :-()
I'll take your statement as a starter:

"Once you go into two dimensions, SD loses all meaning, and adding
nonparametric density estimation into the mix doesn't help, so just stop
thinking in those terms!"

This makes me really think a lot! Is plotting the 0,68 confidence
interval in 2D as equivalent to +-1 SD really nonsense!?

By the way: all started very harmless. I was asked to draw an example of
the well known target analogy for accuracy and precision based on "real"
(=simulated) data. (see i.e.
http://en.wikipedia.org/wiki/Accuracy_and_precision for a simple hand
made 2d graphic).

Well, I did by

set.seed(138813)
x<-rnorm(n); y<-rnorm(n)
plot(x,y)

I was asked whether it might be possible to add a histogram with
superimposed normal curve to the drawing: no problem. "And where is the
standard deviation", well abline(v=sd(... OK.

Then I realized, that this is of course only true for one of the
distributions (x) and only in one "slice" of the scatterplot of x and y.
The real thing is is a 3d density map above the scatterplot. A very nice
example of this is demo(bivar) in the rgl package (for a picture see i.e
http://rgl.neoscientists.org/gallery.shtml right upper corner).

Great! But how to correctly draw the standard deviation boundaries for
the "shots on the target" (the scatterplot of x and y)...

I'd be grateful for hints on what to read on that matter (book, website
etc.)

Greetings from Munich, Felix.


Am 03.03.12 19:22, schrieb peter dalgaard:
> 
> On Mar 3, 2012, at 17:01 , drflxms wrote:
> 
>> # this is the critical block, which I still do not comprehend in detail
>> z <- array()
>> for (i in 1:n){
>>z.x <- max(which(den$x < x[i]))
>>z.y <- max(which(den$y < y[i]))
>>z[i] <- den$z[z.x, z.y]
>> }
> 
> As far as I can tell, the point is to get at density values corresponding to 
> the values of (x,y) that you actually have in your sample, as opposed to 
> den$z which is for an extended grid of all possible (x_i, y_j) combinations. 
> 
> It's unclear to me what happens if you look at quantiles for the entire 
> den$z. I kind of suspect that it is some sort of approximate numerical 
> integration, but maybe not of the right thing 
> 
> Re SD: Once you go into two dimensions, SD loses all meaning, and adding 
> nonparametric density estimation into the mix doesn't help, so just stop 
> thinking in those terms! 
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution

2012-03-03 Thread drflxms
Thanx a lot Greg for the hint and letting me not alone with this!

Tried ellipse and it works well. But I am looking for something more
precise. The ellipse fits the real border to the nearest possible
ellipse. I want the "real" contour, if possible.

Meanwhile I found an interesting function named draw.contour
http://quantitative-ecology.blogspot.com/2008/03/plotting-contours.html
provided by "Forester", an "Assistant Professor in the Department of
Fisheries". - I love it how statistics brings people from completely
different specialities together. - I am a neurologist.

This function results not exactly in the same curve, I got with the code
I posted last, but is in deed very close by (parallel). So I think both
solutions are probably true and differ only because of rounding and
method (..!?).

Still I am not completely happy with

1. the numeric solution of setting to 68% to get 1SD. I'd prefer a
symbolic, formula driven solution instead.
2. me not comprehending the code completely.

While 2. will be hopefully solved soon by me delving into the code, 1
remains...

Best, Felix


Am 03.03.12 17:28, schrieb Greg Snow:
> Look at the ellipse package (and the ellipse function in the package)
> for a simple way of showing a confidence region for bivariate data on
> a plot (a 68% confidence interval is about 1 SD if you just want to
> show 1 SD).
> 
> On Sat, Mar 3, 2012 at 7:54 AM, drflxms  wrote:
>> Dear all,
>>
>> I created a bivariate normal distribution:
>>
>> set.seed(138813)
>> n<-100
>> x<-rnorm(n); y<-rnorm(n)
>>
>> and plotted a scatterplot of it:
>>
>> plot(x,y)
>>
>> Now I'd like to add the 2D-standard deviation.
>>
>> I found a thread regarding plotting arbitrary confidence boundaries from
>> Pascal Hänggi
>> http://www.mail-archive.com/r-help@r-project.org/msg24013.html
>> which cites the even older thread
>> http://tolstoy.newcastle.edu.au/R/help/03b/5384.html
>>
>> As I am unfortunately only a very poor R programmer, the code of Pascal
>> Hänggi is a myth to me and I am not sure whether I was able to translate
>> the recommendation of Brain Ripley in the later thread (which provides
>> no code) into the the correct R code. Brain wrote:
>>
>> You need a 2D density estimate (e.g. kde2d in MASS) then compute the
>> density values at the points and draw the contour of the density which
>> includes 95% of the points (at a level computed from the sorted values
>> via quantile()). [95% confidence interval was desired in thread instead
>> of standard deviation...]
>>
>> So I tried this...
>>
>> den<-kde2d(x, y, n=n) #as I chose n to be the same as during creating
>> the distributions x and y (see above), a z-value is assigned to every
>> combination of x and y.
>>
>> # create a sorted vector of z-values (instead of the matrix stored
>> inside the den object
>> den.z <-sort(den$z)
>>
>> # set desired confidence border to draw and store it in variable
>> confidence.border <- quantile(den.z, probs=0.6827, na.rm = TRUE)
>>
>> # draw a line representing confidence.border on the existing scatterplot
>> par(new=TRUE)
>> contour(den, levels=confidence.border, col = "red", add = TRUE)
>>
>> Unfortunately I doubt very much this is correct :( In fact I am sure
>> this is wrong, because the border for probs=0.05 is drawn outside the
>> values So please help and check.
>> Pascal Hänggis code seems to work, but I don't understand the magic he
>> does with
>>
>> pp <- array()
>> for (i in 1:1000){
>>z.x <- max(which(den$x < x[i]))
>>z.y <- max(which(den$y < y[i]))
>>pp[i] <- den$z[z.x, z.y]
>> }
>>
>> before doing the very same as I did above:
>>
>> confidencebound <- quantile(pp, 0.05, na.rm = TRUE)
>>
>> plot(x, y)
>> contour(den, levels = confidencebound, col = "red", add = TRUE)
>>
>>
>> My problems:
>>
>> 1.) setting probs=0.6827 is somehow a dirty trick which I can only use
>> by simply knowing that this is the percentage of values inside +-1sd
>> when a distribution is normal. Is there a way doing this with "native"
>> sd function?
>> sd(den.z) is not correct, as den.z is in contrast to x and y not normal
>> any more. So ecdf(den.z)(sd(den.z)) results in a percentile of 0.5644 in
>> this example instead of the desired 0.6827.
>>
>> 2.) I would like to have code that works with any desired confidence.
>> Unfortunately setting probs to the desired confidence would probably b

Re: [R] contour for plotting confidence interval on scatter plot of bivariate normal distribution

2012-03-03 Thread drflxms
OK, the following seems to work
still do not understand exactly why...

library(MASS)

# parameters:
n<-100

# generate samples:
set.seed(138813)
#seed <- .Random.seed
x<-rnorm(n); y<-rnorm(n)

# estimate non-parameteric density surface via kernel smoothing
den<-kde2d(x, y, n=n)
# store z values of density estimate in an extra variable
den.z <-den$z

# this is the critical block, which I still do not comprehend in detail
z <- array()
for (i in 1:n){
z.x <- max(which(den$x < x[i]))
z.y <- max(which(den$y < y[i]))
z[i] <- den$z[z.x, z.y]
}

# store class/level borders of confidence interval in variables
confidence.border <- quantile(z, probs=1-0.6827, na.rm = TRUE) # +-1sd

plot(x,y)
par(new=TRUE)
contour(den, levels=confidence.border, col = "red", add = TRUE)

The results look at least plausible. Do not know if they are really true.
Still hoping for some replies ;), Felix


Am 03.03.12 15:54, schrieb drflxms:
> Dear all,
> 
> I created a bivariate normal distribution:
> 
> set.seed(138813)
> n<-100
> x<-rnorm(n); y<-rnorm(n)
> 
> and plotted a scatterplot of it:
> 
> plot(x,y)
> 
> Now I'd like to add the 2D-standard deviation.
> 
> I found a thread regarding plotting arbitrary confidence boundaries from
> Pascal Hänggi
> http://www.mail-archive.com/r-help@r-project.org/msg24013.html
> which cites the even older thread
> http://tolstoy.newcastle.edu.au/R/help/03b/5384.html
> 
> As I am unfortunately only a very poor R programmer, the code of Pascal
> Hänggi is a myth to me and I am not sure whether I was able to translate
> the recommendation of Brain Ripley in the later thread (which provides
> no code) into the the correct R code. Brain wrote:
> 
> You need a 2D density estimate (e.g. kde2d in MASS) then compute the
> density values at the points and draw the contour of the density which
> includes 95% of the points (at a level computed from the sorted values
> via quantile()). [95% confidence interval was desired in thread instead
> of standard deviation...]
> 
> So I tried this...
> 
> den<-kde2d(x, y, n=n) #as I chose n to be the same as during creating
> the distributions x and y (see above), a z-value is assigned to every
> combination of x and y.
> 
> # create a sorted vector of z-values (instead of the matrix stored
> inside the den object
> den.z <-sort(den$z)
> 
> # set desired confidence border to draw and store it in variable
> confidence.border <- quantile(den.z, probs=0.6827, na.rm = TRUE)
> 
> # draw a line representing confidence.border on the existing scatterplot
> par(new=TRUE)
> contour(den, levels=confidence.border, col = "red", add = TRUE)
> 
> Unfortunately I doubt very much this is correct :( In fact I am sure
> this is wrong, because the border for probs=0.05 is drawn outside the
> values So please help and check.
> Pascal Hänggis code seems to work, but I don't understand the magic he
> does with
> 
> pp <- array()
> for (i in 1:1000){
> z.x <- max(which(den$x < x[i]))
> z.y <- max(which(den$y < y[i]))
> pp[i] <- den$z[z.x, z.y]
> }
> 
> before doing the very same as I did above:
> 
> confidencebound <- quantile(pp, 0.05, na.rm = TRUE)
> 
> plot(x, y)
> contour(den, levels = confidencebound, col = "red", add = TRUE)
> 
> 
> My problems:
> 
> 1.) setting probs=0.6827 is somehow a dirty trick which I can only use
> by simply knowing that this is the percentage of values inside +-1sd
> when a distribution is normal. Is there a way doing this with "native"
> sd function?
> sd(den.z) is not correct, as den.z is in contrast to x and y not normal
> any more. So ecdf(den.z)(sd(den.z)) results in a percentile of 0.5644 in
> this example instead of the desired 0.6827.
> 
> 2.) I would like to have code that works with any desired confidence.
> Unfortunately setting probs to the desired confidence would probably be
> wrong (?!) as it relates to den.z instead of x and y, which are the
> underlying distributions I am interested in. To put it short I want the
> confidence of x/y and not of den.z.
> 
> 
> I am really completely stuck. Please help me out of this! Felix
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] contour for plotting confidence interval on scatter plot of bivariate normal distribution

2012-03-03 Thread drflxms
Dear all,

I created a bivariate normal distribution:

set.seed(138813)
n<-100
x<-rnorm(n); y<-rnorm(n)

and plotted a scatterplot of it:

plot(x,y)

Now I'd like to add the 2D-standard deviation.

I found a thread regarding plotting arbitrary confidence boundaries from
Pascal Hänggi
http://www.mail-archive.com/r-help@r-project.org/msg24013.html
which cites the even older thread
http://tolstoy.newcastle.edu.au/R/help/03b/5384.html

As I am unfortunately only a very poor R programmer, the code of Pascal
Hänggi is a myth to me and I am not sure whether I was able to translate
the recommendation of Brain Ripley in the later thread (which provides
no code) into the the correct R code. Brain wrote:

You need a 2D density estimate (e.g. kde2d in MASS) then compute the
density values at the points and draw the contour of the density which
includes 95% of the points (at a level computed from the sorted values
via quantile()). [95% confidence interval was desired in thread instead
of standard deviation...]

So I tried this...

den<-kde2d(x, y, n=n) #as I chose n to be the same as during creating
the distributions x and y (see above), a z-value is assigned to every
combination of x and y.

# create a sorted vector of z-values (instead of the matrix stored
inside the den object
den.z <-sort(den$z)

# set desired confidence border to draw and store it in variable
confidence.border <- quantile(den.z, probs=0.6827, na.rm = TRUE)

# draw a line representing confidence.border on the existing scatterplot
par(new=TRUE)
contour(den, levels=confidence.border, col = "red", add = TRUE)

Unfortunately I doubt very much this is correct :( In fact I am sure
this is wrong, because the border for probs=0.05 is drawn outside the
values So please help and check.
Pascal Hänggis code seems to work, but I don't understand the magic he
does with

pp <- array()
for (i in 1:1000){
z.x <- max(which(den$x < x[i]))
z.y <- max(which(den$y < y[i]))
pp[i] <- den$z[z.x, z.y]
}

before doing the very same as I did above:

confidencebound <- quantile(pp, 0.05, na.rm = TRUE)

plot(x, y)
contour(den, levels = confidencebound, col = "red", add = TRUE)


My problems:

1.) setting probs=0.6827 is somehow a dirty trick which I can only use
by simply knowing that this is the percentage of values inside +-1sd
when a distribution is normal. Is there a way doing this with "native"
sd function?
sd(den.z) is not correct, as den.z is in contrast to x and y not normal
any more. So ecdf(den.z)(sd(den.z)) results in a percentile of 0.5644 in
this example instead of the desired 0.6827.

2.) I would like to have code that works with any desired confidence.
Unfortunately setting probs to the desired confidence would probably be
wrong (?!) as it relates to den.z instead of x and y, which are the
underlying distributions I am interested in. To put it short I want the
confidence of x/y and not of den.z.


I am really completely stuck. Please help me out of this! Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] percentile of a given value: is there a "reverse" quantile function?

2012-03-03 Thread drflxms
Thank you a lot Peter, Stefan and Pascal,

for your quick an inspiring answers.

ecdf(distribution)(value)->percentile was exactly, what I was looking
for, as it is in my eyes somehow the equivalent to
quantile(distribution, percentile)->value, isn't it.

Greetings from sunny Munich, Felix


Am 03.03.12 14:33, schrieb peter dalgaard:
> 
> On Mar 3, 2012, at 13:37 , drflxms wrote:
> 
>> Dear all,
>>
>> I am familiar with obtaining the value corresponding to a chosen
>> probability via the quantile function.
>> Now I am facing the opposite problem I have a value an want to know it's
>> corresponding percentile in the distribution. So is there a function for
>> this as well?
> 
> For a single value, mean(x <= a) will do. Otherwise check ecdf().
> 
>> x <- rnorm(100)
>> mean(x <= 2)
> [1] 0.97
>> ecdf(x)(2)
> [1] 0.97
>> ecdf(x)(-3:3)
> [1] 0.00 0.01 0.14 0.48 0.80 0.97 1.00
> 
> if you need values for your original data points, rank(x)/length(x) should do 
> (bar missing value issues).
> 
>>
>> Thank you for your support in advance, Felix
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] percentile of a given value: is there a "reverse" quantile function?

2012-03-03 Thread drflxms
Dear all,

I am familiar with obtaining the value corresponding to a chosen
probability via the quantile function.
Now I am facing the opposite problem I have a value an want to know it's
corresponding percentile in the distribution. So is there a function for
this as well?

Thank you for your support in advance, Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] plotting standard deviation of multivariate normal distribution (preferred in rgl package)

2012-03-02 Thread drflxms
Dear R colleagues,

for a statistics tutorial I want to develop a nice 3d-graphic of the
well known target comparison/analogy of accuracy and precision (see i.e.
http://en.wikipedia.org/wiki/Accuracy_and_precision for a simple hand
made 2d graphic).

The code for a really beautiful graphic is already provided as
demo(bivar) in the rgl package (for a picture see i.e
http://rgl.neoscientists.org/gallery.shtml right upper corner).

Now I'd like to add the standard deviation to the 3d plot. Unfortunately
I couldn't figure out how to do that.

What I did so far is:

Assuming you have executed the code of the demo mentioned above you can
plot a 2d scatter plot of the data ("the shots on the target") via a simple

plot(x,y)

and add a contour plot via

par(new=TRUE)
contour(denobj, nlevels=8)

So the contour levels are about what I am looking for but...

1. how to make contour plot display exactly standard deviations as
contours and not arbitrary levels?

2. how to project the standard deviation contours on the 3d surface?

Probably there are (as always) lots of different solutions as well.
Id' appreciate any kind of help very much!

Greetings from Munich, Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] splitting strings effriciently

2012-01-08 Thread drflxms
Hi Andrew,

I am aware, that this is an R-mailing list, but for such tasks (I deal a
lot with huge genomic datasets) I tend to use awk and sed for
preprocessing of data, in case I run into performance problems.
Otherwise for handling of strings in R I recommend stringr library, but
I don't know about it's performance...

Felix

> Folks,
> 
> I have a data frame with 4861469 rows that contains an ip address 
> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each 
> row based on IP ranges. To do this I have a function to split the ip 
> address as character into class A,B,C and D components. It works but is 
> horribly inefficient in terms of speed. I can't quite see how one of the 
> l/s/m/t/apply functions could be brought to bear on the problem. Does 
> anyone have any thoughts?
> 
> for(i in 1:4861469)
>{
>lst <-unlist(strsplit(data$ComputerName[i], "\\."))
>data$IPA[i] <-lst[[1]]
>data$IPB[i] <-lst[[2]]
>data$IPC[i] <-lst[[3]]
>data$IPD[i] <-lst[[4]]
>rm(lst)
>}
> 
> Andrew
> 
> Andrew Roberts
> Children's Orthopaedic Surgeon
> RJAH, Oswestry, UK

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] colouring a table, data.frame or matrix in an interactive R session

2012-01-08 Thread drflxms
Am 07.01.12 21:12, schrieb Michael Friendly:
> On 1/7/2012 5:48 AM, drflxms wrote:
>> Hi everybody,
>>
>> as I am dealing with complex confusion matrices, I wonder whether there
>> might be a way to colour text/tabular data in R. I.e. imagine
>> highlighting the true positive values or certain classes in a table.
>>
>> I know how to colour text in graphical output as well as how to sweave
>> or odfWeave coloured tables. But is there a way to do it directly in an
>> interactive R session?
>> Of course I understand that terminals originally haven't been designed
>> for coloured output and that one can easily extract every information
>> needed by subscripting. But still coloured text in the interactive
>> session would sometimes just be comfortable.
>>
> 
> It depends on what kind of terminal you are using, but maybe not worth
> the effort.
> lattice::levelplot can do a nice job of coloring cells in a table.
> 

Well, not exactly what I was looking for, I guess, as I'd like to colour
text in the terminal in this case but definitely a great package which
is very useful for another project ;) Thanx a lot for your help!
Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] colouring a table, data.frame or matrix in an interactive R session

2012-01-08 Thread drflxms
Am 08.01.12 13:52, schrieb Vincent Zoonekynd:
> On 7 January 2012 19:48, drflxms  wrote:
> 
>> as I am dealing with complex confusion matrices, I wonder whether there
>> might be a way to colour text/tabular data in R. I.e. imagine
>> highlighting the true positive values or certain classes in a table.
> 
> The "colorout" package does part of what you want: colouring the
> output, in a terminal.
> But some more work may be needed if you want to change the colours
> depending on the values of your matrix.
> 
> -- Vincent

Great! That is in fact almost exactly what I was looking for. Thank you
very much for the hint!!!
Working on a MacBook Pro (OS X Lion) and using Aquamacs (GNU EMACS) with
ESS I can confirm that it runs on Mac just as described for Unix in
general. Pretty Cool although - as you already mentioned - colouring
depending on conditions is not implemented yet.

I'll send an e-mail to the author, whether he thinks the library can be
developed in that direction - don't know too much about the technical
details of the shell unfortunately. But that may change ;)

Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] colouring a table, data.frame or matrix in an interactive R session

2012-01-07 Thread drflxms
Hi everybody,

as I am dealing with complex confusion matrices, I wonder whether there
might be a way to colour text/tabular data in R. I.e. imagine
highlighting the true positive values or certain classes in a table.

I know how to colour text in graphical output as well as how to sweave
or odfWeave coloured tables. But is there a way to do it directly in an
interactive R session?
Of course I understand that terminals originally haven't been designed
for coloured output and that one can easily extract every information
needed by subscripting. But still coloured text in the interactive
session would sometimes just be comfortable.

Looking forward to read from you,
Greetings from Munich, Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sorting a data.frame (df) by a vector (which is not contained in the df) - unexpected behaviour of match and factor

2011-12-29 Thread drflxms
Jeff,

thanks a lot for your quick reply and the hint!

Meanwhile I found a solution that works - at least for my case ;)
The code to get the job done is

df[order(match(df$level,desiredOrder)),]

So we seem in need of one order statement more. I found this solution
doing it stepwise:

## sorting the levels of the level column in the data.frame
df$level <- factor(df$level,levels=desiredOrder)
## sorting the data frame by the newly sorted level column
df[order(df$level),]

Maybe this solution is of a help for someone else as well?

But honestly I still do not exactly understand why
df[match(df$level,desiredOrder),] doesn't work...

Cheers, Felix

Am 29.12.11 10:58, schrieb Jeff Newmiller:
> Your desiredOrder vector is a vector of strings. Convert it to numeric and it 
> should work.
> ---
> Jeff NewmillerThe .   .  Go Live...
> DCN:Basics: ##.#.   ##.#.  Live Go...
>   Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
> /Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
> --- 
> Sent from my phone. Please excuse my brevity.
> 
> drflxms  wrote:
> 
>> Dear R colleagues,
>>
>> consider my data.frame named "df" with 3 columns - being level,
>> prevalence and sensitivity - and 7 rows of data (see dump below).
>>
>> df <-
>> structure(list(level = structure(1:7, .Label = c("0", "1", "10",
>> "100", "1010", "11", "110"), class = "factor"), prevalence =
>> structure(c(4L,
>> 2L, 3L, 5L, 6L, 1L, 7L), .Label = c("0.488", "0.5", "0.754",
>> "0.788", "0.803", "0.887", "0.905"), class = "factor"), sensitivity =
>> structure(c(6L,
>> 1L, 5L, 4L, 3L, 2L, 1L), .Label = c("0", "0.05", "0.091", "0.123",
>> "0.327", "0.933"), class = "factor")), .Names = c("level",
>> "prevalence",
>> "sensitivity"), class = "data.frame", row.names = c(NA, -7L))
>>
>> I'd like to order df by a vector which is NOT contained in the
>> data.frame. Let's call this vector desiredOrder (see dump below).
>>
>> desiredOrder <- c("0", "1", "10", "100", "11", "110", "1010")
>>
>> So after sorting, the order of the level column (df$level) should be in
>> the order of the vector desiredOrder (as well a the associated data in
>> the other columns).
>> I know that this is not an easy task to achieve by order(...) as the
>> order of desiredOrder isn't a natural one. But I would expect both of
>> the following to work:
>>
>> ## using match
>> df[match(df$level,desiredOrder),]
>>
>> ## using factor
>> df[factor(df$level,levels=desiredOrder),]
>>
>> Unfortunately the result isn't what I expected: I get a data.frame with
>> the level column in the order 0,1,10,100,110,1010,11 instead of the
>> order in desiredOrder (0,1,10,100,11,110,1010).
>>
>> Does anybody see, what I am doing wrong?
>> I'd appreciate any kind of help very much!
>> Best regards, Felix
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] sorting a data.frame (df) by a vector (which is not contained in the df) - unexpected behaviour of match and factor

2011-12-29 Thread drflxms
Dear R colleagues,

consider my data.frame named "df" with 3 columns - being level,
prevalence and sensitivity - and 7 rows of data (see dump below).

df <-
structure(list(level = structure(1:7, .Label = c("0", "1", "10",
"100", "1010", "11", "110"), class = "factor"), prevalence =
structure(c(4L,
2L, 3L, 5L, 6L, 1L, 7L), .Label = c("0.488", "0.5", "0.754",
"0.788", "0.803", "0.887", "0.905"), class = "factor"), sensitivity =
structure(c(6L,
1L, 5L, 4L, 3L, 2L, 1L), .Label = c("0", "0.05", "0.091", "0.123",
"0.327", "0.933"), class = "factor")), .Names = c("level", "prevalence",
"sensitivity"), class = "data.frame", row.names = c(NA, -7L))

I'd like to order df by a vector which is NOT contained in the
data.frame. Let's call this vector desiredOrder (see dump below).

desiredOrder <- c("0", "1", "10", "100", "11", "110", "1010")

So after sorting, the order of the level column (df$level) should be in
the order of the vector desiredOrder (as well a the associated data in
the other columns).
I know that this is not an easy task to achieve by order(...) as the
order of desiredOrder isn't a natural one. But I would expect both of
the following to work:

## using match
df[match(df$level,desiredOrder),]

## using factor
df[factor(df$level,levels=desiredOrder),]

Unfortunately the result isn't what I expected: I get a data.frame with
the level column in the order 0,1,10,100,110,1010,11 instead of the
order in desiredOrder (0,1,10,100,11,110,1010).

Does anybody see, what I am doing wrong?
I'd appreciate any kind of help very much!
Best regards, Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to use a variable after $ ($x) for subscripting a list, How to source from a character object (source(x))

2011-09-13 Thread drflxms
Thank you very much!

foreach(i=1:length(levels)) %do% {listResult[[i]][[input]]} (instead of
listResult[[i]]$input) does the trick!!!

Sorry for being a little bit confuse in describing the problem.
My example object listResult is meant to be a list-object with several
levels. More concrete: it contains the output of confusionMatrix
function from library(caret) as well as other output for several
confusion matrices.
Hope, this is a little bit clearer now?

Thank you very much for your examples, which help me to understand how
to program similar problems.

Thanx and all the best, happy Felix

Am 13.09.11 10:29, schrieb R. Michael Weylandt:
> I had a little bit of trouble following what exactly you mean to do
> (specifically, I'm having a little bit of trouble understanding your sample
> object "listResult" -- is it a list or a set of factors or a list of lists
> or something else entirely), but I believe that replacing the `$` operator
> with additional copies of  `[` or `[[` as appropriate should get the job
> done. $ is, for almost all purposes, a less flexible version of `[[`.
> 
> Consider the following:
> 
> R> a = list(a1 = rnorm(5), rnorm(6), rnorm(7))
> R> b = list(runif(5), b2 = runif(6), runif(7))
> R> c = list(-runif(5), -runif(6), c3 = -runif(7))
> 
> R> A = list(a = a,b = b,c = c)
> 
> R> for( i in seq_along(A)) { print(A[[i]][[3]][i]) }
> 
> which seems to work for me (even if it is frustratingly opaque as to its
> function). Specifically,
> 
> R> identical(A$a, A[["a"]])
> TRUE
> 
> R> identical(A$a,A[[1]])
> TRUE
> 
> If this doesn't work, please clarify what exactly the object you have is and
> what you are trying to get out of it and I'll give a more concrete answer.
> 
> On Tue, Sep 13, 2011 at 4:11 AM, drflxms  wrote:
> 
>> Dear R colleagues,
>>
>> as result of a function a huge list is generated. From this result list
>> I'd like to extract information.
>>
>> Think of the list i.e. as an object named "listResult" with the
>> following form:
>>
>> [[a]]
>>  [1] [2] [3] [4] [5]
>>
>> [[b]]
>>  [1] [2] [3] [4] [5]
>>
>> [[c]]
>>  [1] [2] [3] [4] [5]
>>
>> where levels=c(a,b,c)
>>
>> I'd like to extract a data.frame like
>>
>> a [2]
>> b [2]
>> c [2]
>>
>> What I tried, is a function like this:
>>
>> library(foreach)
>> loopExtract <- function(input) {
>> foreach(i=1:length(levels)) %do% {listResult[[i]]$input}
>> ...
>>
>> where the name of the variable [2] is meant to be the input.
>>
>> Unfortunately it turned out, that the "input"-variable after the $ is
>> not substituted as expected. Subscripting the list with a variable after
>> the $ seems not possible.
>> The only workaround I found for that is a function like
>>
>> loopExtract <- function(input) {
>> codetemplate <- as.character("result <- foreach(i=1:length(levels)) %do%
>> {listResult[[i]]$input)}")
>> require(stringr)
>> write(str_replace_all(codetemplate, "input", input), file="tmp.r")
>> source("tmp.r")
>> return(result)
>> }
>>
>> in other words I stored a template of the desired code as characters in
>> an object, substituted the "input" string in that character object,
>> wrote the result to a file and sourced the code of that file.
>>
>> I stored the result in a file cause the expression source(codetemplate)
>> did not work. From the documentation I learned, that one can only source
>> "connections". And there seems no chance to establish a connection to an
>> object. (Probably no one else, except me, has such a strange idea.)
>>
>> Well, it works that way, but even though I am not a real programmer, I
>> realize how dirty this solution is. There must be a much simpler and
>> elegant way.
>>
>> To sum it all up, my problem can be reduced to the following two questions
>>
>> (1) How to subscript a list with a variable after $
>> (2) How to source from an object containing a code template
>> or (probably the most promising) (3) How to avoid both ;)
>>
>> Unfortunately I am not experienced enough to find the solution. Please
>> help!
>>
>> Greetings from sunny Munich, Felix
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to use a variable after $ ($x) for subscripting a list, How to source from a character object (source(x))

2011-09-13 Thread drflxms
Dear R colleagues,

as result of a function a huge list is generated. From this result list
I'd like to extract information.

Think of the list i.e. as an object named "listResult" with the
following form:

[[a]]
 [1] [2] [3] [4] [5]

[[b]]
 [1] [2] [3] [4] [5]

[[c]]
 [1] [2] [3] [4] [5]

where levels=c(a,b,c)

I'd like to extract a data.frame like

a [2]
b [2]
c [2]

What I tried, is a function like this:

library(foreach)
loopExtract <- function(input) {
foreach(i=1:length(levels)) %do% {listResult[[i]]$input}
...

where the name of the variable [2] is meant to be the input.

Unfortunately it turned out, that the "input"-variable after the $ is
not substituted as expected. Subscripting the list with a variable after
the $ seems not possible.
The only workaround I found for that is a function like

loopExtract <- function(input) {
codetemplate <- as.character("result <- foreach(i=1:length(levels)) %do%
{listResult[[i]]$input)}")
require(stringr)
write(str_replace_all(codetemplate, "input", input), file="tmp.r")
source("tmp.r")
return(result)
}

in other words I stored a template of the desired code as characters in
an object, substituted the "input" string in that character object,
wrote the result to a file and sourced the code of that file.

I stored the result in a file cause the expression source(codetemplate)
did not work. From the documentation I learned, that one can only source
"connections". And there seems no chance to establish a connection to an
object. (Probably no one else, except me, has such a strange idea.)

Well, it works that way, but even though I am not a real programmer, I
realize how dirty this solution is. There must be a much simpler and
elegant way.

To sum it all up, my problem can be reduced to the following two questions

(1) How to subscript a list with a variable after $
(2) How to source from an object containing a code template
or (probably the most promising) (3) How to avoid both ;)

Unfortunately I am not experienced enough to find the solution. Please help!

Greetings from sunny Munich, Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] how to sort the levels of a table

2011-08-16 Thread drflxms
Dear colleagues,

I have really heavy problems in sorting the results of a table according
to certain features of the levels in the table.

Prerequisites:

It all starts with a fairly simple data set, which stores observations
of 21 observers (horizontally from 1 to 21; 21 is
reference/goldstandard) who diagnosed 42 videos (vertically from 1 to
42). See dump of R-object "input" below in case you are interested in
the original data.
Each observation is a series of 1 and 0 (different length!) which was
obtained by concatenating answers of a multiple choice questionaire.

I created a confusion matrix (table) displaying the observations of the
20 observers (nr. 1 to 20) versus the reference (nr. 21) via the
following code:

## observations of the reference
obsValues<-factor(unlist(input[,-c(1,ncol(input))]))
## observations of the observers
refValues<-factor(input[,ncol(input)])
## data.frame that relates observations of observers and reference
RefObs<-data.frame(refValues, obsValues)
mtx<-table(
   RefObs$obsValues,
   RefObs$refValues,
   dnn=c("observers", "reference"),
   useNA=c("always")
   )

And now the problem:

I need to sort the levels/classes of the table. Both axes shall be
ordered 1st according to how often 1 occurs in the string that codes the
observation and 2nd treating the observation string (formed by 0 and 1)
as a number.

i.e. "1" "0" "1010" "1" "100" "11" should be sorted as
"0" "1" "100" "1" "11" "1010"

I am aware of the order function and tried a command of the form
mtx[order(row1stsort,row2ndsort),order(col1stsort,col2ndsort)].
Sorting i.e. the levels of observers and the reference works well this way:

## sorted levels generated by the reference choices.
refLevels <- unique(input[,ncol(input)])[order(
as.numeric(sapply(unique(input[,ncol(input)]), FUN=function(x)
sum(as.numeric(unlist(strsplit(x,"")),
as.numeric(as.vector(unique(input[,ncol(input)])))
)]
## sorted levels generated by the observers choices.
obsLevels <- unique(unlist(input[,-c(1,ncol(input))]))[order(
as.numeric(sapply(unique(unlist(input[,-c(1,ncol(input))])),
FUN=function(x) sum(as.numeric(unlist(strsplit(x,"")),
as.numeric(as.vector(unique(unlist(input[,-c(1,ncol(input))]
)]

Unfortunately this code does not the trick in case of the table mtx (see
above).

At least I was not successfull with the following code:

mtx<-table(
RefObs$obsValues,
RefObs$refValues,
dnn=c("observers", "reference"),
useNA=c("always")
)[order(
as.numeric(sapply(as.vector(unique(RefObs$obsValues)), FUN=function(x)
sum(as.numeric(replace(unlist(strsplit(x,"")),which(x=="w"),NA), ##
generates numeric vector containing sum of digits of classes chosen by
observers. 1st order of vertical (row) sorting. Structural missings
coded as w are substituted by NA.
as.numeric(as.vector(unique(RefObs$obsValues))) ## generates numeric
vector containing classes chosen by observers. 2nd order of vertical
(row) sorting. Structural missings coded as w are substituted by NA.
),
order(
as.numeric(sapply(as.vector(unique(RefObs$refValues)), FUN=function(x)
sum(as.numeric(replace(unlist(strsplit(x,"")),which(x=="w"),NA), ##
generates numeric vector containing sum of digits of classes chosen by
reference. 1st order of horizontal (column) sorting. Structural missings
coded as w are substituted by NA.
as.numeric(as.vector(unique(RefObs$refValues))) ## generates numeric
vector containing classes chosen by the reference. 2nd order of
horizontal (column) sorting.
)
]

Any suggestions? I'd appreaciate any kind of help very much!
Greetings from Munich, Felix



dump of data.frame:
options(width=180) # to view it as a whole

input <- structure(list(video = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19",
"20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30",
"31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41",
"42"), `1` = c("110", "0", "0", "1", "0", "0", "0", "1",
"0", "0", "0", "1", "0", "100", "1110", "10110", "110",
"1100", "0", "1000", "11100", "0", "11000", "0", "0", "0", "1110",
"0", "0", "10110", "1000", "10010", "10001", "1", "100",
"0", "100", "111", "1000", "0", "0", "0"), `2` = c("0", "0",
"0", "0", "0", "0", "0", "0", "0", "0", "100", "0", "0", "0",
"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",
"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",
"0", "0"), `3` = c("0", "0", "0", "0", "0", "0", "0", "0", "0",
"0", "110", "0", "0", "0", "0", "0", "10", "0", "0", "0", "0",
"0", "0", "1000", "0", "0", "0", "0", "0", "0", "0", "0", "0",
"0", "0", "0", "0", "0", "0", "0", "0", "1100"), `4` = c("100",
"10100", "10100", "10010", "10100", "1", "0", "10100", "1000",
"1", "10100", "1", "10100", "1", "10100", "1",
"1", "1", "0", "11000", "11000", "0", "1000", "1000",
"0", "0", "0", "0", "1", "1", "1", "1", "10100",
"1", "0", "0", "

Re: [R] sum of digits or how to slice a number into its digits

2011-03-04 Thread drflxms
Hi Dimitris,

thank you very much for your quick an efficient help! Your solution is
perfect for me. Does exactly what I was looking for if combined with
unlist and as.numeric before using sum.

Now I can keep on with my real problem ;)...
Thanx Again!!!

Best, Felix

Am 04.03.2011 14:25, schrieb Dimitris Rizopoulos:
> one way is using function strsplit(), e.g.,
> 
> x <- c("100100110", "1001001", "1101", "00101")
> sapply(strsplit(x, ""), function (x) sum(x == 1))
> 
> 
> I hope it helps.
> 
> Best,
> Dimitris
> 
> 
> On 3/4/2011 2:18 PM, drflxms wrote:
>> Dear R colleagues,
>>
>> I face a seemingly simple problem I couldn't find a solution for myself
>> so far:
>>
>> I have to sum the digits of numbers. Example: 1010 ->2 100100110 ->  4
>> Unfortunately there seems not to be a function for this task. So my idea
>> was to use sum(x) for it. But I did not figure out how to slice a number
>> to a vector of its digits. Example (continued from above): 1010 ->
>> c(1,0,1,0)  100100110 ->  (1,0,0,1,0,0,1,1,0).
>>
>> Does anyone know either a function for calculating the sum of the digits
>> of a bumber, or how to slice a number into a vector of its digits as
>> described above?
>>
>> I'd appreciate any kind of help very much!
>> Thanx in advance and greetings from cloudy Munich,
>> Felix
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] sum of digits or how to slice a number into its digits

2011-03-04 Thread drflxms
Dear R colleagues,

I face a seemingly simple problem I couldn't find a solution for myself
so far:

I have to sum the digits of numbers. Example: 1010 ->2 100100110 -> 4
Unfortunately there seems not to be a function for this task. So my idea
was to use sum(x) for it. But I did not figure out how to slice a number
to a vector of its digits. Example (continued from above): 1010 ->
c(1,0,1,0)  100100110 -> (1,0,0,1,0,0,1,1,0).

Does anyone know either a function for calculating the sum of the digits
of a bumber, or how to slice a number into a vector of its digits as
described above?

I'd appreciate any kind of help very much!
Thanx in advance and greetings from cloudy Munich,
Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] selecting only corresponding categories from a confusion matrix

2010-11-29 Thread drflxms
Dear R colleagues,

as a result of my calculations regarding the inter-observer-variability
in bronchoscopy, I get a confusion matrix like the following:

   0   1 1001 1010  11
0609  11   54   36   6
1  1   260   2
1014   008   4
1004   000   0
1000  23   7   12   10   5
1001   0   040   0
1010   4   003   0
1011   1   010   2
11 0   033   1
1101   000   0
1100   2   000   0
1110   1   000   0

The first column represents the categories found among observers, the
top row represents the categories found by the reference ("goldstandard").
I am looking for a way (general algorithm) to extract a data.frame with
only the corresponding categories among observers and reference from the
above confusion matrix. "Corresponding" means in this case, that a
category has been chosen by both: observers and reference.
In this example corresponding categories would be simply all categories
that have been chosen by the reference (0,1,1001,1010,11), but generally
there might also occur categories which are found by the reference only
(and not among observers - in the first column).
So the solution-dataframe for the above example would look like:

   0   1 1001 1010  11
0609  11   54   36   6
1  1   260   2
1001   0   040   0
1010   4   003   0
11 0   033   1

All the categories found among observers only, were omitted.

If the solution algorithm would include a method to list the omitted
categories and to count their number as well as the number of omitted
cases, it would be just perfect for me.

I'd be happy to read from you soon! Thanks in advance for any kind of
help with this.
Greetings from snowy Munich, Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] can't load rJava in R 2.8.1 on Windows XP

2009-01-23 Thread drflxms
Dear Mr. Murdoch, Dear Mr. Ripley, Dear Mr. Wang,

thank you very much for your quick and efficient help! It is exactly as
Duncan explained it: Including jvm.dll in PATH solved the problem
immediately. Everything works fine now.

The only thing is, that I do not understand why I had to do this
manualy, as I installed Java Runtime Environment a long time ago,
updated frequently and never encountered any difficulties in using Java
software.

Anyway, I am happy now :-)
Thanx again for your great support!
Greetings from Munich, Germany,

Felix

Duncan Murdoch schrieb:
> On 1/23/2009 7:38 AM, drflxms wrote:
>> Dear community,
>>
>> unfortunately I did not manage load the rJava package receiving the
>> following
>> error-message:
>>
>>
>>> library("rJava")
>> Error in inDL(x, as.logical(local), as.logical(now), ...) : kann
>> shared library 'C:/Programme/R/2.8.1/library/rJava/libs/rJava.dll' nicht
>> laden:
>> LoadLibrary failure:  Das angegebene Modul wurde nicht gefunden.
>>
>> Error : .onLoad in 'loadNamespace' für 'rJava' fehlgeschlagen
>> Fehler: Laden von Paket/Namensraum für 'rJava' fehlgeschlagen
>>
>>
>> Translation: can't load library ... rJava.dll
>> LoadLibrary failure: the module was not found
>
> That message comes from Windows, not R, and it's misleading.  It does
> not say that rJava.dll was not found, it says that a DLL needed by it
> is not found.  It would be helpful if it told you which one.  You
> should complain to Microsoft about it.  If rJava.dll had been missing,
> the English message would have been
>
>   shared library 'rJava' not found
>
> The pedump utility (in the Rtools set, see
> www.murdoch-sutherland.com/Rtools) can tell you what the dependencies
> are:
>
> pedump -i rJava.dll
>
> shows that it imports things from these dlls:
>
>  R.dll
>  KERNEL32.dll
>  msvcrt.dll
>  jvm.dll
>
> The first 3 are routine; without those R wouldn't work. (Without
> KERNEL32.dll, nothing in Windows would work.)  So as Brian said, it's
> likely jvm.dll that it can't find, or possibly a DLL that it depends on.
> Did you install Java first, as rJava requires?
>
> Duncan Murdoch
>
>>
>> Reinstalling the package did not help, installing the latest
>> developement
>> version didn't help as well.
>>
>> The shared library rJava.dll is in place (exactly where R is looking
>> for it)!
>>
>> Are there any ideas, what's wrong.
>> I'd appreciate any kind of help very much, as I need rJava urgently
>> to use RWeka and iPlots.
>>
>> Best regards,
>> Felix
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] can't load rJava in R 2.8.1 on Windows XP

2009-01-23 Thread drflxms
Dear community,

unfortunately I did not manage load the rJava package receiving the following
error-message:


> library("rJava")
Error in inDL(x, as.logical(local), as.logical(now), ...) : 
kann shared library 'C:/Programme/R/2.8.1/library/rJava/libs/rJava.dll' nicht
laden:
LoadLibrary failure:  Das angegebene Modul wurde nicht gefunden.

Error : .onLoad in 'loadNamespace' für 'rJava' fehlgeschlagen
Fehler: Laden von Paket/Namensraum für 'rJava' fehlgeschlagen


Translation: 
can't load library ... rJava.dll
LoadLibrary failure: the module was not found

Reinstalling the package did not help, installing the latest developement
version didn't help as well.

The shared library rJava.dll is in place (exactly where R is looking for it)!

Are there any ideas, what's wrong.
I'd appreciate any kind of help very much, as I need rJava urgently to use 
RWeka and iPlots.

Best regards,
Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to address last and all but last column in dataframe

2008-09-06 Thread drflxms
Hello Mr. Burns, Hello Mr. Dwinseminus

thank you very much for your incredible quick and efficient reply! I was
completely successful with the following command:

pairs<-data.frame(pred=factor(unlist(input[,-c(1,ncol(input))])),ref=factor(input[,ncol(input)]))

In case of the "input" example data.frame I sent with my question the
above code is equivalent to:
 
pairs<-data.frame(pred=factor(unlist(input[2:17])),ref=factor(input[,18]))

Great! That is exactly what I was looking for! This simple code will
save me hours!

Patrick, your book looks in fact very interesting and will be my perfect
reading material for the following nights :-) (probably not only the
first chapter ;-). Thanks for the hint - and the free book of course.

David, the "input" data.frame is the result of the reshape-command I
performed. I just copied it from the R-console into the e-mail. In fact
the first column "video" is not part of the data, but needed for
analysis with kappam.fleiss function of the irr-package. Sorry, you are
absolutely correct, I should have mentioned this in my question. I will
improve when I ask my next question :-).

Again I like to thank you for your help and wish you a pleasant Sunday.
Greetings from Munich,
Felix

Patrick Burns wrote:
> If I understand properly, you want
>
> input[, -c(1, ncol(input))]
>
> rather than
>
> input[[, -c(video, 21)]]
>
> Chapter 1 of S Poetry might be of interest to you.
>
> Patrick Burns
> [EMAIL PROTECTED]
> +44 (0)20 8525 0696
> http://www.burns-stat.com
> (home of S Poetry and "A Guide for the Unwilling S User")
>
> drflxms wrote:
>> Dear R-colleagues,
>>
>> another question from a newbie: I am creating a lot of simple
>> pivot-charts from my raw data using the reshape-package. In these charts
>> we have medical doctors judging videos in the columns and the videos
>> they judge in the rows. Simple example of chart/data.frame "input" with
>> two categories 1/0:
>>
>> video 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
>>
>> 1  1 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
>> 2  2 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  1
>> 3  3 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
>> 4  4 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
>> 5  5 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  1  0
>> 6  6 0 0 0 0 0 0 0 0 0  0  0  0  0  1  0  0  0  0  0  0  0
>> 7  7 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
>> 8  8 0 0 0 0 0 0 0 0 0  0  0  0  0  0  1  0  0  0  0  0  0
>> 9  9 0 0 0 0 0 0 0 0 0  1  0  1  1  0  1  1  0  0  0  1  0
>> 1010 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
>>
>> I recently learned, that I can easily create a confusion matrix out of
>> this data using the following commands:
>>
>> pairs<-data.frame(pred=factor(unlist(input[2:21])),ref=factor(input[,22]))
>>
>> pred<-pairs$pred
>> ref <- pairs$ref
>> library (caret)
>> confusionMatrix(pred, ref, positive=1)
>>
>> - where column 21 is the reference/goldstandard.
>>
>> My problem is now, that I analyse data.frames with an unknown count of
>> columns. So to get rid of the first and last column for the "pred"
>> variable and to select the last column for the "ref" variable, I have to
>> look at the data.frame before doing the above commands to set the proper
>> column numbers.
>>
>> It would be very comfortable, if I could address the last column not by
>> number (where I have to count beforehand) but by a variable "last
>> column".
>>
>> Probably there is a more easy solution for this problem using the names
>> of the columns as well: the reference is always number "21" the first
>> column is always called "video". So I tried:
>>
>> attach(input)
>> pairs<-data.frame(pred=factor(unlist(input[[,-c(video,21)]])),ref=factor(input[[21]]))
>>
>>
>> which does not work unfortunately :-(.
>>
>> I'd be very happy in case someone could help me out, cause I am really
>> tired of counting - there are a lot of tables to analyse...
>>
>> Cheers and greetings from Munich,
>> Felix
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] how to address last and all but last column in dataframe

2008-09-06 Thread drflxms
Dear R-colleagues,

another question from a newbie: I am creating a lot of simple
pivot-charts from my raw data using the reshape-package. In these charts
we have medical doctors judging videos in the columns and the videos
they judge in the rows. Simple example of chart/data.frame "input" with
two categories 1/0:

video 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1  1 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
2  2 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  1
3  3 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
4  4 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
5  5 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  1  0
6  6 0 0 0 0 0 0 0 0 0  0  0  0  0  1  0  0  0  0  0  0  0
7  7 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
8  8 0 0 0 0 0 0 0 0 0  0  0  0  0  0  1  0  0  0  0  0  0
9  9 0 0 0 0 0 0 0 0 0  1  0  1  1  0  1  1  0  0  0  1  0
1010 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0

I recently learned, that I can easily create a confusion matrix out of
this data using the following commands:

pairs<-data.frame(pred=factor(unlist(input[2:21])),ref=factor(input[,22]))
pred<-pairs$pred
ref <- pairs$ref
library (caret)
confusionMatrix(pred, ref, positive=1)

- where column 21 is the reference/goldstandard.

My problem is now, that I analyse data.frames with an unknown count of
columns. So to get rid of the first and last column for the "pred"
variable and to select the last column for the "ref" variable, I have to
look at the data.frame before doing the above commands to set the proper
column numbers.

It would be very comfortable, if I could address the last column not by
number (where I have to count beforehand) but by a variable "last column".

Probably there is a more easy solution for this problem using the names
of the columns as well: the reference is always number "21" the first
column is always called "video". So I tried:

attach(input)
pairs<-data.frame(pred=factor(unlist(input[[,-c(video,21)]])),ref=factor(input[[21]]))

which does not work unfortunately :-(.

I'd be very happy in case someone could help me out, cause I am really
tired of counting - there are a lot of tables to analyse...

Cheers and greetings from Munich,
Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Free SQL Database with R

2008-09-03 Thread drflxms
Hello Chibisi,

you might be looking for something like "Rpad"
(http://www.rpad.org/Rpad/). There are some other systems like
SNetscape, Rserve, RSOAP, R.NET.Web as well. Unfortunately I have no
personal experience with these systems so far, so I can't give you any
real advice. I just know them to exist and surfed some testpages.

To me Rpad looks as probably the most feasible system for the task you
describe (and from a cost vs. benefit point of view) . As far as I
understand/remember users need no unusual prerequisites to use Rpad.

For a general overview of GUIs see i.e. http://www.sciviews.org/_rgui/.

According to your latest description of the database associated with the
webpage, SQlite probably will suffice and might be easier to handle than
a whole MySQL server - as Hadley Wickham already proposed in his e-mail.
But this all of course depends on your resources as well (i.e. do you
have access to your own root-server, or just some megabytes of ordinary
webspace...?).

Kind regards and greetings from Munich,
Felix.

Chibisi Chima-Okereke schrieb:
> Dear Felix,
>
> Thanks for the reply,
>
> If you haven't already guessed I am new to web programming.
>
> The sort of webpage I want to build is one that presents quantitative
> information in graphs and charts, that people can interact with, e.g.
> select parts of charts to zoom into, highlight values, click buttons
> to do analysis on the data displayed, so yes some sort of interactive
> GUI. I initially thought of using flash as a front end but I don't
> know any actionscript, so learning that would to a suitable standard
> take alot of extra time, and I think it would be best if everything
> could be done in R as much as possible.
>
> If I used an RGUI I guess I would be using the playwith package? Do
> the consumers of the website need to have R to consume stuff displayed
> with an RGUI?
>
> The database itself would just be pretty static just being queried for
> information, unless some analysis was required in which case R would
> query the database do the analysis and write the analysis back to the
> database (I guessing that is the way it would be done), before it gets
> displayed on the web page.
>
> Kind Regards
>
> Chibisi
>
> On Wed, Sep 3, 2008 at 11:39 AM, drflxms <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
> Hello Chibisi,
>
> I am not shore whether I completely understand your needs: Do you want
> to build a webpage which relies on a content management system
> (cms)? Do
> you want to collect data (i.e. survey) which later on shall be
> analysed
> using R? Or shall it be a webpage with an interactive R GUI? What
> else?
>
> But personally I would prefer MySQL as backend for websites, as most
> professional (opensource) cms (i.e. typo3, wordpress etc.) are created
> with MySQL in mind.
> There is a good interface between R and MySQL called RMySQL as well. I
> use this on a daily basis, as all my data is stored in a local MySQL
> database (more flexible than always reading in text files - at
> least in
> my opinion).
>
> Hope this personal view might help a little bit.
>
> Cheers,
> Felix
>
> __
> R-help@r-project.org <mailto:R-help@r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Free SQL Database with R

2008-09-03 Thread drflxms
Hello Chibisi,

I am not shore whether I completely understand your needs: Do you want
to build a webpage which relies on a content management system (cms)? Do
you want to collect data (i.e. survey) which later on shall be analysed
using R? Or shall it be a webpage with an interactive R GUI? What else?

But personally I would prefer MySQL as backend for websites, as most
professional (opensource) cms (i.e. typo3, wordpress etc.) are created
with MySQL in mind.
There is a good interface between R and MySQL called RMySQL as well. I
use this on a daily basis, as all my data is stored in a local MySQL
database (more flexible than always reading in text files - at least in
my opinion).

Hope this personal view might help a little bit.

Cheers,
Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] convenient way to calculate specificity, sensitivity and accuracy from raw data

2008-09-02 Thread drflxms
1  0
> 6  6 0 0 0 0 0 0 0 0 0  0  0  0  0  1  0  0  0  0  0  0  0
> 7  7 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 8  8 0 0 0 0 0 0 0 0 0  0  0  0  0  0  1  0  0  0  0  0  0
> 9  9 0 0 0 0 0 0 0 0 0  1  0  1  1  0  1  1  0  0  0  1  0
> 1010 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 1111 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 1212 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 1313 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 1414 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 1515 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 1616 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 1717 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 1818 0 0 0 0 1 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  1
> 1919 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 2020 0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 2121 0 0 0 0 0 0 1 0 0  0  0  0  0  0  0  0  0  0  0  0  1
> 2222 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 2323 0 1 0 0 1 0 1 0 0  1  0  0  1  1  0  0  1  0  0  0  0
> 2424 0 0 0 0 0 0 0 0 0  0  0  0  1  1  1  1  0  1  0  0  1
> 2525 0 0 0 0 0 0 0 0 0  0  0  1  0  0  1  1  0  0  0  0  0
> 2626 0 0 0 0 0 0 0 0 0  0  0  1  0  0  0  0  0  0  0  0  0
> 2727 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 2828 0 1 0 1 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 2929 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 3030 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 3131 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 3232 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 3333 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 3434 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 3535 0 0 0 0 0 0 1 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 3636 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 3737 0 1 1 0 1 0 0 1 0  0  0  0  1  1  1  0  1  0  0  1  1
> 3838 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
> 3939 0 1 0 0 1 0 0 1 0  1  1  0  1  1  0  0  1  1  0  1  1
> 4040 1 1 1 1 1 0 1 0 0  0  0  1  1  1  1  0  0  1  0  0  1
> 4141 0 0 0 0 0 0 0 0 0  1  0  0  0  0  0  0  0  0  0  0  1
> 4242 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0"),
> header = TRUE)
> closeAllConnections()
>
> goldstand <- dat$X21
> prev <- sum(goldstand)
> cprev <- sum(!goldstand)
> n <- prev + cprev
> lapply(dat[-1], function(x){
> tab <- table(x, goldstand)
> cS <- colSums(tab)
> if(nrow(tab) > 1 && ncol(tab) > 1) {
> out <- c(sp = tab[1,1], sn = tab[2,2]) / cS
> c(out, ac = (out[1] * cprev + out[2] * prev) / n)
> }
> })
>
>
> I hope it helps.
>
> Best,
> Dimitris
>
> Quoting drflxms <[EMAIL PROTECTED]>:
>
>> Dear R-colleagues,
>>
>> this is a question from a R-newbie medical doctor:
>>
>> I am evaluating data on inter-observer-reliability in endoscopy. 20
>> medical doctors judged 42 videos filling out a multiple choice survey
>> for each video. The overall-data is organized in a classical way:
>> observations (items from the multiple choice survey) as columns, each
>> case (identified by the two columns "number of medical doctor" and
>> "number of video") in a row. In addition there is a medical doctor
>> number 21 who is assumed to be a gold-standard.
>>
>> As measure of  inter-observer-agreement I calculated kappa according to
>> Fleiss and simple agreement in percent using the routines
>> "kappam.fleiss" and "agree" from the irr-package. Everything worked fine
>> so far.
>>
>> Now I'd like to calculate specificity, sensitivity and accuracy for each
>> item (compared to the gold-standard), as these are well-known and easy
>> to understand quantities for medical doctors.
>>
>> Unfortunately I haven't found a feasible way to do this in R so far. All
>> solutions I found, describe calculation of specificity, sensitivity and
>> accuracy from a contingency-table / confusion-matrix only. For me it is
>> very difficult to create such contingency-tables / confusion-matrices
>> from the raw data I have.
>>
>> So I started to do it in Excel by hand - a lot of work! When I'll keep
>> on doing this, I'll miss the deadline. So maybe someone can help me out:
>>
>> It would be very convenient, if there is way to calculate specificity,
>> sensitivity and accuracy from the very same data.frames I created for
&g

[R] convenient way to calculate specificity, sensitivity and accuracy from raw data

2008-09-01 Thread drflxms
Dear R-colleagues,

this is a question from a R-newbie medical doctor:

I am evaluating data on inter-observer-reliability in endoscopy. 20
medical doctors judged 42 videos filling out a multiple choice survey
for each video. The overall-data is organized in a classical way:
observations (items from the multiple choice survey) as columns, each
case (identified by the two columns "number of medical doctor" and
"number of video") in a row. In addition there is a medical doctor
number 21 who is assumed to be a gold-standard.

As measure of  inter-observer-agreement I calculated kappa according to
Fleiss and simple agreement in percent using the routines
"kappam.fleiss" and "agree" from the irr-package. Everything worked fine
so far.

Now I'd like to calculate specificity, sensitivity and accuracy for each
item (compared to the gold-standard), as these are well-known and easy
to understand quantities for medical doctors.

Unfortunately I haven't found a feasible way to do this in R so far. All
solutions I found, describe calculation of specificity, sensitivity and
accuracy from a contingency-table / confusion-matrix only. For me it is
very difficult to create such contingency-tables / confusion-matrices
from the raw data I have.

So I started to do it in Excel by hand - a lot of work! When I'll keep
on doing this, I'll miss the deadline. So maybe someone can help me out:

It would be very convenient, if there is way to calculate specificity,
sensitivity and accuracy from the very same data.frames I created for
the calculation of kappa and agreement. In these data.frames, which were
generated from the overall-data-table described above using the
"reshape" package, we have the judging medical doctor in the columns and
the videos in the rows. In the cells there are the coded answer-options
from the multiple choice survey. Please see an simple example with
answer-options 0/1 (copied from R console) below:

 video 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1  1 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
2  2 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  1
3  3 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
4  4 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
5  5 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  1  0
6  6 0 0 0 0 0 0 0 0 0  0  0  0  0  1  0  0  0  0  0  0  0
7  7 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
8  8 0 0 0 0 0 0 0 0 0  0  0  0  0  0  1  0  0  0  0  0  0
9  9 0 0 0 0 0 0 0 0 0  1  0  1  1  0  1  1  0  0  0  1  0
1010 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
1111 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
1212 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
1313 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
1414 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
1515 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
1616 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
1717 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
1818 0 0 0 0 1 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  1
1919 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
2020 0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
2121 0 0 0 0 0 0 1 0 0  0  0  0  0  0  0  0  0  0  0  0  1
2222 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
2323 0 1 0 0 1 0 1 0 0  1  0  0  1  1  0  0  1  0  0  0  0
2424 0 0 0 0 0 0 0 0 0  0  0  0  1  1  1  1  0  1  0  0  1
2525 0 0 0 0 0 0 0 0 0  0  0  1  0  0  1  1  0  0  0  0  0
2626 0 0 0 0 0 0 0 0 0  0  0  1  0  0  0  0  0  0  0  0  0
2727 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
2828 0 1 0 1 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
2929 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3030 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3131 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3232 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3333 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3434 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3535 0 0 0 0 0 0 1 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3636 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3737 0 1 1 0 1 0 0 1 0  0  0  0  1  1  1  0  1  0  0  1  1
3838 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0
3939 0 1 0 0 1 0 0 1 0  1  1  0  1  1  0  0  1  1  0  1  1
4040 1 1 1 1 1 0 1 0 0  0  0  1  1  1  1  0  0  1  0  0  1
4141 0 0 0 0 0 0 0 0 0  1  0  0  0  0  0  0  0  0  0  0  1
4242 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0

What I did in Excel is: Creating the very same tables using
pivot-charts. Comparing columns 1-20 to column 21 (gold-standard),
summing up the count of values that are identical to 21. I repeated this
for each answer-option. From the results, one can easily calculate
specificity, sensitivity and accuracy.

How to do this, or something similar leading to the 

Re: [R] simple generation of artificial data with defined features

2008-08-24 Thread drflxms
Hello all,

beside saying again thank you for your help, I'd like to present the
final solution of my problem and the results of the kappa-calculation:

> election.2005 <- c(16194,13136,3494,3838,4648,4118)
#data obtained via genesis-database of "Statistisches Bundesamt"
www.destatis.de
#simply cut of last 3 digits because of limited calculation-power of laptop
> attr(election.2005, "class") <- "table"
> attr(election.2005, "dim") <- c(1,6)
> attr(election.2005, "dimnames") <- list(c("votes"), c(1,2,3,4,5,6))
#used numbers instead of names of parties for easier handling later on
#1=spd,2=cdu,3=csu,4=gruene,5=fdp,6=pds
> head(election.2005)
  [,1]  [,2] [,3] [,4] [,5] [,6]
[1,] 16194 13136 3494 3838 4648 4118
#replicate rows according to frequency-table:
> el.dt.exp <- el.dt[rep(1:nrow(el.dt), el.dt$Freq), -ncol(el.dt)]
> el.dt.exp$id=seq(1:nrow(el.dt.exp)) #add voter id
> el.dt.exp$year=2005 #add column with year of election
# remove a column we don't need:
> el.dt.exp<-subset(el.dt.exp, select=-c(Var1))
> dim(el.dt.exp)
[1] 45428 3
> head(el.dt.exp)
Var2 id year
1  1  1 2005
1.11  2 2005
1.21  3 2005
1.31  4 2005
1.41  5 2005
1.51  6 2005
1.51  6 2005
> el.dt.exp<-as.data.frame(el.dt.exp, row.names=seq(1:nrow(el.dt.exp)))
# get rid of the unusual numbering of rows
> head(el.dt.exp)
  Var2 id year
11  1 2005
21  2 2005
31  3 2005
41  4 2005
51  5 2005
61  6 2005
> summary(el.dt.exp)
 Var2id year
 1:16194   Min.   :1   Min.   :2005 
 2:13136   1st Qu.:11358   1st Qu.:2005 
 3: 3494   Median :22715   Median :2005 
 4: 3838   Mean   :22715   Mean   :2005 
 5: 4648   3rd Qu.:34071   3rd Qu.:2005 
 6: 4118   Max.   :45428   Max.   :2005 

Var2 is of type character, which is uncomfortable for further processing.
I changed type with the data editor using fix(el.dt.exp) to number.

#create the dataframe for the calculation of kappa
> library(reshape)
> el.dt.exp.molten<-melt(el.dt.exp, id=c(2,3), na.rm=FALSE)
> kappa.frame<-cast(el.dt.exp.molten, year ~ id)
> dim(kappa.frame)
[1] 1 45429
#calculate kappa
> library(irr)
> kappam.fleiss(kappa.frame, exact=FALSE, detail=TRUE)
 Fleiss' Kappa for m Raters

 Subjects = 1
   Raters = 45428
Kappa = -2.2e-05

z = -1.35
  p-value = 0.176

   Kappa  z p.value
1  0.000 -0.707   0.479
2  0.000 -0.707   0.479
3  0.000 -0.707   0.479
4  0.000 -0.707   0.479
5  0.000 -0.707   0.479
6  0.000 -0.707   0.479

What a surprise! So Greg was absolutely right, that this is probably not
a good example for Kappa. But still a very interesting one, if you ask me!

My theory: Kappa doesn't express simply agreement. As far as I learned
from the Handbook of Inter-Rater Reliability (Gwet, Kilem 2001; STATAXIS
Publishing Company;  www.stataxis.com) Kappa tries to measure how
different and observed agreement is from an agreement that arises from
chance.
So in this case this probably means, that the results of the election
2005 are not significantly different from results, that could have
arisen by chance.

Anyway I personally learned a very interesting lesson about Kappa and R.
Thank you all for your professional and quick help to a newbie!
Greetings from Munich,

Felix

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] simple generation of artificial data with defined features

2008-08-24 Thread drflxms
Hi Christoph,

perfect! Your code worked out of the box (copy and paste ;-). I had
expected at least some lines of code, but this is really easy!

So once you get used to command line, this is much more flexible (and
comfortable!) than all these coloured windows. Can't tell you how happy
I am, that I seem to make it away from these terrible SPSS-license
hassle. Taking into account that these are my first weeks with R, and
that I learn (My)SQL in parallel (using RMySQL) and did know not too
much (not to say nothing) about statistics before (like nearly all
medical doctors...), I even would't say, the learning curve is too steep.

Anyway thank you for your quick and efficient help for a newbie!

One of the reasons for my delayed answer is some trouble I experienced
in the following steps. I'll write a concluding e-mail to all about that
soon.
Greetings from Munich,

Felix

Christoph Meyer schrieb:
> Hi,
>
> to add voter.id and election.year to your data frame you could try:
>
> el.dt.exp$voter.id=seq(1:nrow(el.dt.exp))
>
> el.dt.exp$election.year=2005
>
> Cheers,
>
> Christoph Meyer
>
>
> ***
> Dr. Christoph Meyer
> Institute of Experimental Ecology
> University of Ulm
> Albert-Einstein-Allee 11
> D-89069 Ulm
> Germany
> Phone:  ++49-(0)731-502-2675
> Fax:++49-(0)731-502-2683
> Mobile: ++49-(0)1577-156-7049
> E-mail: [EMAIL PROTECTED]
> http://www.uni-ulm.de/index.php?id=7885
> ***
>
> Saturday, August 23, 2008, 1:25:05 PM, you wrote:
>
>   
>> Dear Mr. Christos Hatzis,
>> 
>
>   
>> thank you so much for your answer which is in my eyes just brilliant! I
>> followed it step by step (great and detailed explanation) and nearly
>> everything is fine. - Except a problem in the very end, I haven't found
>> a solution for until now. (Despite playing arround quite a lot...)
>> Please let me explain:
>> 
>
>   
>>> election.2005 <- c(16194,13136,3494,3838,4648,4118) #cut of last 3
>>>   
>> digits, cause my laptop can't handle millions of rows...
>> 
>>> attr(election.2005, "class") <- "table"
>>> attr(election.2005, "dim") <- c(1,6)
>>> attr(election.2005, "dimnames") <- list(c("votes"), c("spd", "cdu",
>>>   
>> "csu", "gruene", "fdp", "pds"))
>> 
>>> head(election.2005)
>>>   
>> spd   cdu  csu gruene  fdp  pds
>> votes 16194 13136 3494   3838 4648 4118
>> 
>>> el.dt <- as.data.frame(election.2005)
>>> el.dt.exp <- el.dt[rep(1:nrow(el.dt), el.dt$Freq), -ncol(el.dt)]
>>> dim(el.dt.exp)
>>>   
>> [1] 45428 2
>> 
>>> head(el.dt.exp)
>>>   
>>  Var1 Var2
>> 1   votes  spd
>> 1.1 votes  spd
>> 1.2 votes  spd
>> 1.3 votes  spd
>> 1.4 votes  spd
>> 1.5 votes  spd
>> 
>
>   
>> My problem now is, that I would need either an autoincrementing
>> identifier instead of "votes" in Var1 or the possibility to access the
>> numbering by a column name (i.e. Var0). In addition I need a 3rd
>> Variable for the year oft the election (2005, which is the same for all,
>> but needed later on). So this is what it should look like:
>> 
>
>   
>>  voter.id party election.year
>> 1   1spd2005
>> 1.1 2 spd  2005
>> 1.2 3spd   2005
>> 1.3 4spd2005
>> 1.4 5spd2005
>> 1.5 6spd2005
>> 
>
> ...
>
>
>
>
>
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] simple generation of artificial data with defined features

2008-08-23 Thread drflxms
Hello Mr. Greg Snow!

Thank you very much for your prompt answer.
> I don't think that the election data is the right data to demonstrate Kappa, 
> you need subjects that are classified by 2 or more different raters/methods.  
> The election data could be considered classifying the voters into which party 
> they voted for, but you only have 1 rater.
I think, It should be possible to calculate kappa in case one has a
little different point of view from the one you described above: Take
the voters as raters who "judge" the category "election" with one party
out of the six mentioned in my previous e-mail (which are simply the top
six).
This makes sense to me, because an election is somehow nothing else but
a survey with the question "who should lead our country" - given six
options in this example. As kappa is a measure of agreement, it should
be able to illustrate the agreement of the voters answers to this question.
For me this is - in priciple - no different from asking "Where is the
stenosis in the video of this endoscopy" offering six options
representing anatomic locations each.
> Otherwise you may want to stick with the sample datasets.
>   
The example data sets are of excellent quality and very interesting. I
am sure there would be brilliant examples among them. But I have to
admit that,t a I have no t a good overview of the available datasets at
the moment (as a newbie).  I just wanted to give an example out of every
days life, everybody is familiar with. An election is something which
came to my mind spontaneously.
> There are other packages that compute Kappa values as well (I don't know if 
> others calculate this particular version), but some of those take the summary 
> data as input rather than the raw data, which may be easier if you just have 
> the summary tables.
>
>   
I chose Fleiss Kappa, because it is a more general form of Cohen's Kappa
allowing m raters and n categories (instead of only two raters and to
categories when using Cohen's kappa). Looking for another package
calculating it from summary tables might be the simplest solution to my
problem. Thank you very much for this hint!
On the other hand it would be nice to use the very same method for the
example as for the "real" data. The example will be part of the
"methods" section.

Thank you again very much for your tips and the quick reply. Have a nice
weekend!
Greetings from Munich,

Felix Mueller-Sarnowski
>> -Original Message-
>> From: [EMAIL PROTECTED]
>> [mailto:[EMAIL PROTECTED] On Behalf Of drflxms
>> Sent: Friday, August 22, 2008 6:12 AM
>> To: r-help@r-project.org
>> Subject: [R] simple generation of artificial data with
>> defined features
>>
>> Dear R-colleagues,
>>
>> I am quite a newbie to R fighting my stupidity to solve a
>> probably quite simple problem of generating artificial data
>> with defined features.
>>
>> I am conducting a study of inter-observer-agreement in
>> child-bronchoscopy. One of the most important measures is
>> Kappa according to Fleiss, which is very comfortable
>> available in R through the irr-package.
>> Unfortunately medical doctors like me don't really understand
>> much of statistics. Therefore I'd like to give the reader an
>> easy understandable example of Fleiss-Kappa in the Methods
>> part. To achieve this, I obtained a table with the results of
>> the German election from 2005:
>>
>> partynumber of votespercent
>>
>> SPD1619466534,2
>> CDU1313674027,8
>> CSU34943097,4
>> Gruene38383268,1
>> FDP46481449,8
>> PDS41181948,7
>>
>> I want to show the agreement of voters measured by
>> Fleiss-Kappa. To calculate this with the
>> kappam.fleiss-function of irr, I need a data.frame like this:
>>
>> (id of 1st voter) (id of 2nd voter)
>>
>> party spd cdu
>>
>> Of course I don't plan to calculate this with the million of
>> cases mentioned in the table above (I am working on a small
>> laptop). A division by 1000 would be more than perfect for
>> this example. The exact format of the table is generally not
>> so important, as I could reshape nearly every format with the
>> help of the reshape-package.
>>
>> Unfortunately I could not figure out how to create such a
>> fictive/artificial dataset as described above. Any data.frame
>> would be nice, that keeps at least the percentage. String-IDs
>> of parties could be substituted by numbers of course (would
>> be even 

Re: [R] simple generation of artificial data with defined features

2008-08-23 Thread drflxms
Dear Mr. Christos Hatzis,

thank you so much for your answer which is in my eyes just brilliant! I
followed it step by step (great and detailed explanation) and nearly
everything is fine. - Except a problem in the very end, I haven't found
a solution for until now. (Despite playing arround quite a lot...)
Please let me explain:

> election.2005 <- c(16194,13136,3494,3838,4648,4118) #cut of last 3
digits, cause my laptop can't handle millions of rows...
> attr(election.2005, "class") <- "table"
> attr(election.2005, "dim") <- c(1,6)
> attr(election.2005, "dimnames") <- list(c("votes"), c("spd", "cdu",
"csu", "gruene", "fdp", "pds"))
> head(election.2005)
spd   cdu  csu gruene  fdp  pds
votes 16194 13136 3494   3838 4648 4118
> el.dt <- as.data.frame(election.2005)
> el.dt.exp <- el.dt[rep(1:nrow(el.dt), el.dt$Freq), -ncol(el.dt)]
> dim(el.dt.exp)
[1] 45428 2
> head(el.dt.exp)
 Var1 Var2
1   votes  spd
1.1 votes  spd
1.2 votes  spd
1.3 votes  spd
1.4 votes  spd
1.5 votes  spd

My problem now is, that I would need either an autoincrementing
identifier instead of "votes" in Var1 or the possibility to access the
numbering by a column name (i.e. Var0). In addition I need a 3rd
Variable for the year oft the election (2005, which is the same for all,
but needed later on). So this is what it should look like:

 voter.id party election.year
1   1spd2005
1.1 2 spd  2005
1.2 3spd   2005
1.3 4spd2005
1.4 5spd2005
1.5 6spd2005

The reason for that is the input format of the kappam.fleiss function of
the irr package I use for calculation. It accepts a data.frame with the
categories as rows (here we would have only one catgory: the year of the
election) and the raters (here the voters) as columns. In the data.frame
there will be the chosen party for each combination of electionyear and
voter.

This format can be easily achieved using the reshape package. Assuming
voter.id would be an autoincrementing identifier, the command should be:

>library(reshape)
>el.dt.exp.molten<-melt(el.dt.exp, id=c("voter.id")) #which would
propably change not really anything in this case, because the data is
already in a "molten" form
>kappa.frame<-cast(el.dt.exp.molten, election.year ~ voter.id,
subset=variable=="party")

I'd be extremely happy in case you might help me out again!
Have a nice weekend and many thanks so far!
Greetings from Munich,

Felix Mueller-Sarnowski


Christos Hatzis wrote:
> On the general question on how to create a dataset that matches the
> frequencies in a table, function as.data.frame can be useful.  It takes as
> argument an object of a class 'table' and returns a data frame of
> frequencies.
>
> Consider for example table 6.1 of Fleiss et al (3rd Ed):
>
>   
>> birth.weight <- c(10,15,40,135)
>> attr(birth.weight, "class") <- "table"
>> attr(birth.weight, "dim") <- c(2,2)
>> attr(birth.weight, "dimnames") <- list(c("A", "Ab"), c("B", "Bb"))
>> birth.weight
>> 
>  B  Bb
> A   10  40
> Ab  15 135
>   
>> summary(birth.weight)
>> 
> Number of cases in table: 200 
> Number of factors: 2 
> Test for independence of all factors:
> Chisq = 3.429, df = 1, p-value = 0.06408
>   
>> bw.dt <- as.data.frame(birth.weight)
>> 
>
> Observations (rows) in this table can then be replicated according to their
> corresponding frequencies to yield the expanded dataset that conforms with
> the original table. 
>
>   
>> bw.dt.exp <- bw.dt[rep(1:nrow(bw.dt), bw.dt$Freq), -ncol(bw.dt)]
>> dim(bw.dt.exp)
>> 
> [1] 200   2
>   
>> table(bw.dt.exp)
>> 
> Var2
> Var1   B  Bb
>   A   10  40
>   Ab  15 135 
>
> The above approach is not restricted to 2x2 tables, and should be
> straightforward generate datasets that conform to arbitrary nxm frequency
> tables.
>
> -Christos Hatzis
>
>
>   
>> -Original Message-
>> From: [EMAIL PROTECTED] 
>> [mailto:[EMAIL PROTECTED] On Behalf Of Greg Snow
>> Sent: Friday, August 22, 2008 12:41 PM
>> To: drflxms; r-help@r-project.org
>> Subject: Re: [R] simple generation of artificial data with 
>> defined features
>>
>> I don't think that the election data is the right data to 
>> demonstrate Kappa, you need subjects that are classified by 2 
>> or more different raters/methods.  The 

[R] simple generation of artificial data with defined features

2008-08-22 Thread drflxms
Dear R-colleagues,

I am quite a newbie to R fighting my stupidity to solve a probably quite
simple problem of generating artificial data with defined features.

I am conducting a study of inter-observer-agreement in
child-bronchoscopy. One of the most important measures is Kappa
according to Fleiss, which is very comfortable available in R through
the irr-package.
Unfortunately medical doctors like me don't really understand much of
statistics. Therefore I'd like to give the reader an easy understandable
example of Fleiss-Kappa in the Methods part. To achieve this, I obtained
a table with the results of the German election from 2005:

partynumber of votespercent

SPD1619466534,2
CDU1313674027,8
CSU34943097,4
Gruene38383268,1
FDP46481449,8
PDS41181948,7

I want to show the agreement of voters measured by Fleiss-Kappa. To
calculate this with the kappam.fleiss-function of irr, I need a
data.frame like this:

(id of 1st voter) (id of 2nd voter)

party spd cdu

Of course I don't plan to calculate this with the million of cases
mentioned in the table above (I am working on a small laptop). A
division by 1000 would be more than perfect for this example. The exact
format of the table is generally not so important, as I could reshape
nearly every format with the help of the reshape-package.

Unfortunately I could not figure out how to create such a
fictive/artificial dataset as described above. Any data.frame would be
nice, that keeps at least the percentage. String-IDs of parties could be
substituted by numbers of course (would be even better for function
kappam.fleiss in irr!).

I would appreciate any kind of help very much indeed.
Greetings from Munich,

Felix Mueller-Sarnowski

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.