[R] confidence intervals for differences in proportions from complex survey design?
All: I need to generate confidence intervals for differences in proportions using data from a complex survey design. An example follows where I attempt to estimate the difference in depression prevalence by sex. # Data might look something like this: Dfr-data.frame(depression=sample(c(yes,no), size=30, replace=TRUE), sex=sample(c(M,F), size=30, replace=TRUE), cluster=rep(1:10, times=3), stratum=rep(1:5, each=2, times=3), pweight=runif(n=30, min=1, max=3)) Dfr library(survey) msdesign-svydesign(id=~cluster, strata=~stratum, weights=~pweight, nest=TRUE, data=Dfr) # When searching online, one recommendation was to use svyglm() to generate an # approximation as follows: confint(with(Dfr, svyglm(I(depression==yes)~sex, family=gaussian(link=identity), msdesign)), level=0.95, method=Wald) This question has been asked before on the listserv (circa 2007) and I contacted the original poster, who indicated that they never received a reply. Here is the question as described by the original poster: I'm trying to get confidence intervals of proportions (sometimes for subgroups) estimated from complex survey data. Because a function like prop.test() does not exist for the survey package I tried the following: 1) Define a survey object (PSU of clustered sample, population weights); 2) Use svyglm() of the package survey to estimate a binary logistic regression (family='binomial'): For the confidence interval of a single proportion regress the binary dependent variable on a constant (1), for confidence intervals of that variable for subgroups regress this variable on the groups (factor) variable; 3) Use predict() to obtain estimated logits and the respective standard errors (mod.dat specifying either the constant or the subgroups): pred=predict(model,mod.dat,type='link',se.fit=T) and apply the following to obtain the proportion with its confidence intervals (for example, for conf.level=.95): lo.e = pred[1:length(pred)]-qnorm((1+conf.level)/2)*SE(pred) hi.e = pred[1:length(pred)]+qnorm((1+conf.level)/2)*SE(pred) prop = 1/(1+exp(-pred[1:length(pred)])) lo = 1/(1+exp(-lo.e)) hi = 1/(1+exp(-hi.e)) I think that in that way I get CI's based on asymptotic normality - either for a single proportion or split up into subgroups. Question: Is this a correct or a defensible procedure? Or should I use a different approach? Note that this approach should also allow to estimate CI's for proportions of subgroups taking into account the complex survey design. Thanks in advance for any help that you can provide. Tony -- Tony N. Brown, Ph.D. Associate Chair and Associate Professor of Sociology Google Scholar Profile: http://tinyurl.com/lozlht8 LinkedIn Profile: https://www.linkedin.com/pub/tony-nicholas-brown/a6/64/31a __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] graphically representing frequency of words in a speech?
Yihui, This is quite impressive, thanks for helping me think about how to make tag clouds in R. Tony -Original Message- From: Yihui Xie [mailto:xieyi...@gmail.com] Sent: Wednesday, June 10, 2009 3:15 AM To: Brown, Tony Nicholas Cc: r-help@r-project.org Subject: Re: [R] graphically representing frequency of words in a speech? Hi, As Gregor Gorjanc mentioned, it's very inconvenient to let R decide the fontsize and placement of words in a plot. There have already been very mature applications of tag cloud; one of them I'm relatively familiar is the WordPress plugin wp-cumulus, which makes use of a Flash object to generate tag cloud, and it has fantastic 3D rotation effect of the cloud. I've spent a couple of hours porting it into R; see the source code and effect here: http://yihui.name/en/2009/06/creating-tag-cloud-using-r-and-flash-javascript-swfobject/ HTH. Regards, Yihui -- Yihui Xie xieyi...@gmail.com Phone: +86-(0)10-82509086 Fax: +86-(0)10-82509086 Mobile: +86-15810805877 Homepage: http://www.yihui.name School of Statistics, Room 1037, Mingde Main Building, Renmin University of China, Beijing, 100872, China On Mon, Jun 8, 2009 at 2:41 AM, Brown, Tony Nicholastony.n.br...@vanderbilt.edu wrote: Dear all, I recently saw a graph on television that displayed selected words/phrases in a speech scaled in size according to their frequency. So words/phrases that were often used appeared large and words that were rarely used appeared small. The closest thing I can find on the web to approximate what I saw can be found here: http://stateoftheunion.onetwothree.net/ The example at that website is more complicated but captures the general idea. Would someone point me in the right direction in terms of replicating such a graph. Thanks in advance, Tony - Tony N. Brown, Ph.D. Editor-Elect, American Sociological Review Associate Professor of Sociology and Human and Organizational Development (secondary) Program Faculty, Effective Health Communication and African American Diaspora Studies Faculty Head of Hank Ingram House, The Commons Vanderbilt University (615) 322-7518 (615) 322-7505 fax [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] graphically representing frequency of words in a speech?
Dear all, I recently saw a graph on television that displayed selected words/phrases in a speech scaled in size according to their frequency. So words/phrases that were often used appeared large and words that were rarely used appeared small. The closest thing I can find on the web to approximate what I saw can be found here: http://stateoftheunion.onetwothree.net/ The example at that website is more complicated but captures the general idea. Would someone point me in the right direction in terms of replicating such a graph. Thanks in advance, Tony - Tony N. Brown, Ph.D. Editor-Elect, American Sociological Review Associate Professor of Sociology and Human and Organizational Development (secondary) Program Faculty, Effective Health Communication and African American Diaspora Studies Faculty Head of Hank Ingram House, The Commons Vanderbilt University (615) 322-7518 (615) 322-7505 fax [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] graphically representing frequency of words in a speech?
Thank you so much Mark and Gregor. The basic information, suggestions, and R code that you provided is most helpful. Tony -Original Message- From: Gorjanc Gregor [mailto:gregor.gorj...@bfro.uni-lj.si] Sent: Sunday, June 07, 2009 2:17 PM To: Marc Schwartz; Brown, Tony Nicholas Cc: rhelp help Subject: RE: [R] graphically representing frequency of words in a speech? The only thing that I found for R is by Gregor Gorjanc, but the information seems to be dated: http://www.bfro.uni-lj.si/MR/ggorjan/software/R/index.html#tagCloud Hi, Yes, I have tried to create a tag cloud plot in R, but I abandoned the project due to other things. The main obstacle was that in R we need to take care of the fontsizes and placement of words, while this is very easy with say browsers, who do all the renderind. I tracked the last version of the R file which is pasted bellow. I must say that I do not remember the status of the code so use it as you wish. If anyone wishes to take this project further, please do so! gg ### tagCloud.R ###- --- ### What: Tag cloud plot functions ### Time-stamp: 2006-09-10 02:53:29 ggorjan ###- --- tagCloud - function(x, n=100, decreasing=TRUE, threshold=NULL, fontsize=c(12, 36), align=TRUE, expandRow=TRUE, justRow=bottom, title, textGpar=gpar(col=navy), rectGpar=gpar(col=white), titleGpar=gpar(), viewGpar=gpar(), mar=c(1, 1, 1, 1)) { UseMethod(tagCloud) } tagCloud.default - function(x, n=100, decreasing=TRUE, threshold=NULL, fontsize=c(12, 36), align=TRUE, expandRow=TRUE, justRow=bottom, title, textGpar=gpar(col=navy), rectGpar=gpar(col=white), titleGpar=gpar(), viewGpar=gpar(), mar=c(1, 1, 1, 1)) { if(!is.null(dim(x))) stop('x' must be a vector) tagCloud.table(table(x), n=n, decreasing=decreasing, fontsize=fontsize, threshold=threshold, align=align, expandRow=expandRow, justRow=justRow, title=title, textGpar=textGpar, rectGpar=rectGpar, titleGpar=titleGpar, viewGpar=viewGpar, mar=mar) } tagCloud.table - function(x, n=100, decreasing=TRUE, threshold=NULL, fontsize=c(12, 36), align=TRUE, expandRow=TRUE, justRow=bottom, title, textGpar=gpar(col=navy), rectGpar=gpar(col=white), titleGpar=gpar(), viewGpar=gpar(), mar=c(1, 1, 1, 1)) { ## --- Check --- if(length(dim(x)) != 1) stop('x' must be one dimensional table) ## --- Threshold --- if(!is.null(threshold)) x - x[x = threshold] ## --- Number of units --- N - length(x)## length of table if(is.null(n)) { ## if n=NULL, plot all units n - N } else { if(n N) n - N## if n is to big, decrease it if(n 1) n - round(N * n) ## if n is percentage of units } fontsizeLength - length(fontsize) if(fontsizeLength != 2) stop('fontsize' must be of length two) ## --- Sort and subset --- if(n N) { ## only if we want to plot subset of units tmp - sort(x, decreasing=decreasing) x - x[names(x) %in% names(tmp[1:n])] } ## --- Get relative freq --- x - prop.table(x) ## --- Fontsize --- fontsizeDiff - diff(fontsize) xDiff - max(x) - min(x) if(xDiff != 0) { off - ifelse(fontsizeDiff 0, min(x), max(x)) fontsize - (x - off) / xDiff * fontsizeDiff + min(fontsize) } else { ## all units have the same frequency fontsize - rep(min(fontsize), times=n) } ## --- Viewport and rectangle --- grid.newpage() width - unit(1, npc) height - unit(1, npc) vp - viewport(y=unit(mar[1], lines), x=unit(mar[2], lines), , width=width - unit(mar[2] + mar[4], lines), height=height - unit(mar[1] + mar[3], lines), just=c(left, bottom), gp=viewGpar, name=main) pushViewport(vp) if(!missing(title)) grid.text(title, y=height, gp=titleGpar, name=title) grid.rect(gp=rectGpar, name=cloud) ## --- Grobs --- tag - vector(mode=list, length=4) names(tag) - c(fontsize, grob, width, height) tag[[1]] - tag[[2]] - tag[[3]] - tag[[4]] - vector(mode=list, length=n) for(i in 1:n) { tag$fontsize[[i]] - fontsize[i] tag$grob[[i]] - textGrob(names(x[i]), gp=gpar(fontsize=fontsize[i])) tag$width[[i]] - convertWidth(grobWidth(tag$grob[[i]]), unitTo=npc, valueOnly=TRUE) tag$height[[i]] - convertHeight
Re: [R] randomly sample within clustered data?
Thierry, Thanks so much. Your solution works perfectly. Tony -Original Message- From: ONKELINX, Thierry [mailto:[EMAIL PROTECTED] Sent: Monday, September 15, 2008 2:56 AM To: Brown, Tony Nicholas; r-help@r-project.org Subject: RE: [R] randomly sample within clustered data? Something like this? do.call(rbind, lapply( split(Dataf, Dataf$id), function(x){ x[sample(seq_len(nrow(x)), size=2), ] } ) ) HTH, Thierry ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 [EMAIL PROTECTED] www.inbo.be To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey -Oorspronkelijk bericht- Van: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Namens Brown, Tony Nicholas Verzonden: maandag 15 september 2008 9:40 Aan: r-help@r-project.org Onderwerp: [R] randomly sample within clustered data? Dear useRs, What is an efficient way to randomly sample from clustered data such that I get equal representation from each cluster? For example, let's say I want to randomly sample two cases from each cluster created by the id variable in the following data frame: id-c(rep(100, 4),rep(101, 3), rep(102, 6), rep(103, 7)) sex-sample(c(m,f), 20, replace=TRUE) weight-rnorm(n=20, mean=150, sd=3) attitude-sample(1:7, 20, replace=TRUE) Dataf-data.frame(id,sex,weight,attitude) Dataf id sex weight attitude 1 100 m 146.50646 2 100 f 150.23174 3 100 f 149.36865 4 100 m 144.72187 5 101 m 147.90714 6 101 m 148.38026 7 101 m 154.46341 8 102 m 153.27195 9 102 m 148.98215 10 102 f 148.06561 11 102 f 148.89496 12 102 m 146.99634 13 102 m 153.05424 14 103 m 148.15581 15 103 f 148.04824 16 103 m 151.80442 17 103 f 155.49764 18 103 m 150.04231 19 103 f 146.04875 20 103 m 154.66517 Here's the R code I wrote that obviously does not work: sapply(split(Dataf, Dataf$id), sample, size=2) I would prefer a data frame (i.e., Dataf2) as the final output and it should look something like this: Dataf2 id sex weight attitude 1 100 m 146.50646 2 100 m 144.72187 3 101 m 147.90714 4 101 m 154.46341 5 102 m 153.27195 6 102 m 148.98215 7 103 f 155.49764 8 103 f 146.04875 Thanks in advance in your assistance. Tony -- Tony N. Brown, Ph.D. Associate Professor of Sociology Faculty Head of Hank Ingram House, The Commons Research Fellow, Vanderbilt Center for Nashville Studies Vanderbilt University (615) 322-7518 (615) 322-7505 fax [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is door een geldig ondertekend document. The views expressed in this message and any annex are purely those of the writer and may not be regarded as stating an official position of INBO, as long as the message is not confirmed by a duly signed document. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] randomly sample within clustered data?
Dear useRs, What is an efficient way to randomly sample from clustered data such that I get equal representation from each cluster? For example, let's say I want to randomly sample two cases from each cluster created by the id variable in the following data frame: id-c(rep(100, 4),rep(101, 3), rep(102, 6), rep(103, 7)) sex-sample(c(m,f), 20, replace=TRUE) weight-rnorm(n=20, mean=150, sd=3) attitude-sample(1:7, 20, replace=TRUE) Dataf-data.frame(id,sex,weight,attitude) Dataf id sex weight attitude 1 100 m 146.50646 2 100 f 150.23174 3 100 f 149.36865 4 100 m 144.72187 5 101 m 147.90714 6 101 m 148.38026 7 101 m 154.46341 8 102 m 153.27195 9 102 m 148.98215 10 102 f 148.06561 11 102 f 148.89496 12 102 m 146.99634 13 102 m 153.05424 14 103 m 148.15581 15 103 f 148.04824 16 103 m 151.80442 17 103 f 155.49764 18 103 m 150.04231 19 103 f 146.04875 20 103 m 154.66517 Here's the R code I wrote that obviously does not work: sapply(split(Dataf, Dataf$id), sample, size=2) I would prefer a data frame (i.e., Dataf2) as the final output and it should look something like this: Dataf2 id sex weight attitude 1 100 m 146.50646 2 100 m 144.72187 3 101 m 147.90714 4 101 m 154.46341 5 102 m 153.27195 6 102 m 148.98215 7 103 f 155.49764 8 103 f 146.04875 Thanks in advance in your assistance. Tony -- Tony N. Brown, Ph.D. Associate Professor of Sociology Faculty Head of Hank Ingram House, The Commons Research Fellow, Vanderbilt Center for Nashville Studies Vanderbilt University (615) 322-7518 (615) 322-7505 fax [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.