Thanks Dennis, looks like there's even less boiler plate code with plyr. By the way, what I labelled "W.SE" is meant to represent the weighted standard error of the mean. Your "WSE" calculations appear to be providing the weighted standard deviation of the variable. Is this a matter of needing to change my labels to W.SEM to avoid this kind of confusion, or is there literature suggesting that I should be using the standard deviation of the variable to estimate the weighted standard error of the mean?
Thanks, -Solomon On Jan 17, 2011, at 1:11 PM, Dennis Murphy wrote: > Hi: > > Does this do what you need? > > wstats <- function(d) { > require(Hmisc) > N <- length(d$response[!is.na(d$response)]) > c(WM = wtd.mean(d$response, d$weights), > WSE = sqrt(wtd.var(d$response, d$weights)), > N = N) > } > library(plyr) > ddply(mydata, .(group), wstats) > group WM WSE N > 1 1 0.1302255752 1.1911298 20 > 2 2 -0.2814664362 0.8582928 20 > 3 3 -0.3640550516 1.2618343 20 > 4 4 0.0002852392 1.1463205 20 > 5 5 -0.0070283053 1.2315683 20 > > The trick to writing this function for input into plyr is that the argument > is a data frame. When called in ddply(), the function wstats() will be > applied to each sub-frame corresponding to the grouping factor(s). Inside it, > the variables of interest are extracted relative to the input data frame and > the three quantities are computed. I used wtd.mean() and wtd.var() from > Hmisc, as both will remove NAs by default. In the ddply call, the function > name is simply cited since a sub-data frame is the sole argument of the > function. > > I couldn't figure out how to get doBy to get this to work, as it seems best > suited to functions of one argument (a single response), but here's an > alternative using the data.table package: > > library(data.table) > # Assumes Hmisc is already loaded... > myDT <- data.table(mydata, key = 'group') > myDT[, list(N = length(response[!is.na(response)]), > wtdMean = wtd.mean(response, weights), > wtdSE = sqrt(wtd.var(response, weights))), by = 'group'] > group N wtdMean wtdSE > [1,] 1 20 0.1302255752 1.1911298 > [2,] 2 20 -0.2814664362 0.8582928 > [3,] 3 20 -0.3640550516 1.2618343 > [4,] 4 20 0.0002852392 1.1463205 > [5,] 5 20 -0.0070283053 1.2315683 > > data.table uses a different model of data organization from data frames. A > simplistic description is that it you can think of a data.table as analogous > to a table in a DBMS. Notice that the 'function call' is indexed inside the > data table: the first 'subscript' corresponds to what are called I() > operations (analogous to 'select' statements in an SQL); the second > 'subscript' corresponds to J() operations, (analogous to 'where' > statements), while the third argument is the by group(s), or sub-data tables, > to which (in this case) the J() operations apply. > > For functions that take multiple arguments and that are meant to be applied > in a groupwise fashion, I find plyr and data.table to be very good options. > There are also base package alternatives (e.g., some combination of lapply(), > mapply() and do.call()) and several other packages, but plyr and data.table > are generally pretty good at handling most of the niggling details. Having > said that, both have learning curves - data.table, in particular, will be > much easier to pick up if you have some background in SQLs, since its syntax > uses primary principles of SQL. > > data.table has a vignette and FAQ, along with an independent help list - for > details, see its page on R-forge: > http://r-forge.r-project.org/projects/datatable/ > For plyr's documentation, see > http://had.co.nz/plyr/ > A link to its mailing list is found on that page as well. > > HTH, > Dennis > > On Mon, Jan 17, 2011 at 10:24 AM, Solomon Messing <solomon.mess...@gmail.com> > wrote: > Thanks Josh. I built on your example and ended up with the code below--if > you or anyone sees any issues please let me know. It would be great if there > were a slicker way to get these kinds of summary stats in R, but this gets > the job done. > > # takes data frame z with weights w and data x, returns weighted mean, > weighted SE, and N > msenw = function(z){ > N = length(na.omit(z)$response) > i = which(!is.na(z$response)) > return( > c( W.M = weighted.mean(z$response, z$weights, na.rm=T), > W.SE = sqrt(wtd.var(z$response, weights = > z$weights))/sqrt(sum(z$weights[i])), > N=N ) ) > } > > library(doBy) > library(Hmisc) > ## make up some data (easier) > mydata <- data.frame(response = rnorm(100), > group = rep(1:5, each = 20), weights = runif(100, 0, 1)) > > xy <- by(mydata, mydata$group, msenw) > data.frame( group = names(c(xy)), do.call(rbind, xy) ) > > ## can be extended to other data using: > xy <- by(data.frame(response = mydata$response, weights = mydata$weights), > mydata$group, msenw) > > > Solomon Messing > www.stanford.edu/~messing > > > > On Jan 16, 2011, at 11:16 PM, Joshua Wiley wrote: > > > Dear Solomon, > > > > On Sun, Jan 16, 2011 at 10:27 PM, Solomon Messing > > <solomon.mess...@gmail.com> wrote: > >> Dear Soren and R users: > >> > >> I am trying to use the summaryBy function with weights. Is this possible? > >> An example that illustrates what I am trying to do follows: > >> > >> library(doBy) > >> ## make up some data > >> response = rnorm(100) > >> group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20)) > >> weights = runif(100, 0, 1) > >> mydata = data.frame(response,group,weights) > >> > >> ## run summaryBy without weights: > >> summaryBy(response~group, data = mydata, FUN = mean) > >> > >> ## attempt to run summaryBy with weights, throws error > >> summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights ) > >> > >> ## throws the error: > >> # Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { : > >> # arguments must have same length > >> > >> My guess is that summaryBy is not giving weighted.mean() each group of > >> weights, but instead is passing all of the weights in the data set each > >> time it calls weighted.mean(). > > > > Yes, of course. It has no way of knowing that the weights should also > > be being broken down by group....they are not in the formula. > > > >> Do you know if there is some way to get summaryBy to pass weights to > >> weighted.mean() only for each group? > > > > Ideally there would be a way to pass more than one variable to a > > function (e.g., response and weights) or just an entire object > > (mydata) broken down by group. Then you would just make a wrapper > > function to pass the right values to the x and w arguments of > > weighted.mean. Instead here is a somewhat hacked version: > > > > library(doBy) > > ## make up some data (easier) > > mydata <- data.frame(response = rnorm(100), > > group = rep(1:5, each = 20), weights = runif(100, 0, 1)) > > > > ## manually compute weighted mean > > tmp <- summaryBy(response*weights ~ group, data = mydata, FUN = sum) > > tmp[,2] <- tmp[,2]/with(mydata, tapply(weights, group, sum)) > > tmp ## weighted means > > > > ## here's the 'problem', if you will, even with +, they are passed > > one at a time > > summaryBy(response + weights ~ group, data = mydata, FUN = str) > > summaryBy(mydata ~ group, data = mydata, FUN = str) > > > > ## here is an option using by(): > > xy <- by(mydata, mydata$group, function(z) weighted.mean(z$response, > > z$weights)) > > xy > > ## if you don't like the formatting.... > > data.frame(group = names(c(xy)), weighted.mean = c(xy)) > > > > HTH, > > > > Josh > > > >> > >> I suspect this functionality would be a tremendous benefit to R users who > >> regularly work with weighted data, such as myself. > >> > >> Thanks, > >> > >> Solomon Messing > >> www.stanford.edu/~messing > >> > >> PS I know this basic example can be done using lapply(split(...)) approach > >> referenced here: > >> > >> http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg12349.html > >> > >> but for more complex tasks the lapply approach will mean writing a lot of > >> extra code to run everything and then to get things formatted as nicely as > >> summaryBy() was designed to do. > >> > >> > >> [[alternative HTML version deleted]] > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > > > > > > > > -- > > Joshua Wiley > > Ph.D. Student, Health Psychology > > University of California, Los Angeles > > http://www.joshuawiley.com/ > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.