Re: [R] Using summaryBy with weighted data

Solomon Messing Mon, 17 Jan 2011 14:41:19 -0800

Thanks Dennis, looks like there's even less boiler plate code with plyr.  By 
the way, what I labelled "W.SE" is meant to represent the weighted standard 
error of the mean.  Your "WSE" calculations appear to be providing the weighted 
standard deviation of the variable.  Is this a matter of needing to change my 
labels to W.SEM to avoid this kind of confusion, or is there literature 
suggesting that I should be using the standard deviation of the variable to 
estimate the weighted standard error of the mean?


Thanks,

-Solomon

On Jan 17, 2011, at 1:11 PM, Dennis Murphy wrote:

> Hi:
> 
> Does this do what you need?
> 
> wstats <- function(d) {
>              require(Hmisc)
>              N <- length(d$response[!is.na(d$response)])
>              c(WM = wtd.mean(d$response, d$weights),
>                WSE = sqrt(wtd.var(d$response, d$weights)),
>                N = N)
>     }
> library(plyr)
> ddply(mydata, .(group), wstats)
>   group            WM       WSE  N
> 1     1  0.1302255752 1.1911298 20
> 2     2 -0.2814664362 0.8582928 20
> 3     3 -0.3640550516 1.2618343 20
> 4     4  0.0002852392 1.1463205 20
> 5     5 -0.0070283053 1.2315683 20
> 
> The trick to writing this function for input into plyr is that the argument 
> is a data frame. When called in ddply(), the function wstats() will be 
> applied to each sub-frame corresponding to the grouping factor(s). Inside it, 
> the variables of interest are extracted relative to the input data frame and 
> the three quantities are computed. I used wtd.mean() and wtd.var() from 
> Hmisc, as both will remove NAs by default. In the ddply call, the function 
> name is simply cited since a sub-data frame is the sole argument of the 
> function.
> 
> I couldn't figure out how to get doBy to get this to work, as it seems best 
> suited to functions of one argument (a single response), but here's an 
> alternative using the data.table package:
> 
> library(data.table)
> # Assumes Hmisc is already loaded...
> myDT <- data.table(mydata, key = 'group')
> myDT[, list(N = length(response[!is.na(response)]),
>              wtdMean = wtd.mean(response, weights),
>              wtdSE = sqrt(wtd.var(response, weights))), by = 'group']
>      group  N       wtdMean     wtdSE
> [1,]     1 20  0.1302255752 1.1911298
> [2,]     2 20 -0.2814664362 0.8582928
> [3,]     3 20 -0.3640550516 1.2618343
> [4,]     4 20  0.0002852392 1.1463205
> [5,]     5 20 -0.0070283053 1.2315683
> 
> data.table uses a different model of data organization from data frames. A 
> simplistic description is that it you can think of a data.table as analogous 
> to a table in a DBMS. Notice that the 'function call' is indexed inside the 
> data table: the first 'subscript' corresponds to what are called I() 
> operations  (analogous to 'select' statements in an SQL); the second 
> 'subscript' corresponds to J() operations, (analogous to  'where' 
> statements), while the third argument is the by group(s), or sub-data tables, 
> to which (in this case) the J() operations apply. 
> 
> For functions that take multiple arguments and that are meant to be applied 
> in a groupwise fashion, I find plyr and data.table to be very good options. 
> There are also base package alternatives (e.g., some combination of lapply(), 
> mapply() and do.call()) and several other packages, but plyr and data.table 
> are generally pretty good at handling most of the niggling details. Having 
> said that, both have learning curves - data.table, in particular, will be 
> much easier to pick up if you have some background in SQLs, since its syntax 
> uses primary principles of SQL. 
> 
> data.table has a vignette and FAQ, along with an independent help list - for 
> details, see its page on R-forge:
> http://r-forge.r-project.org/projects/datatable/
> For plyr's documentation, see
> http://had.co.nz/plyr/
> A link to its mailing list is found on that page as well.
> 
> HTH,
> Dennis
> 
> On Mon, Jan 17, 2011 at 10:24 AM, Solomon Messing <solomon.mess...@gmail.com> 
> wrote:
> Thanks Josh.  I built on your example and ended up with the code below--if 
> you or anyone sees any issues please let me know.  It would be great if there 
> were a slicker way to get these kinds of summary stats in R, but this gets 
> the job done.
> 
> # takes data frame z with weights w and data x, returns weighted mean, 
> weighted SE, and N
> msenw = function(z){
>        N = length(na.omit(z)$response)
>        i = which(!is.na(z$response))
>        return(
>                        c( W.M = weighted.mean(z$response, z$weights, na.rm=T),
>                        W.SE = sqrt(wtd.var(z$response, weights = 
> z$weights))/sqrt(sum(z$weights[i])),
>                        N=N ) )
> }
> 
> library(doBy)
> library(Hmisc)
> ## make up some data (easier)
> mydata <- data.frame(response = rnorm(100),
>                group = rep(1:5, each = 20), weights = runif(100, 0, 1))
> 
> xy <- by(mydata, mydata$group, msenw)
> data.frame( group = names(c(xy)), do.call(rbind, xy) )
> 
> ## can be extended to other data using:
> xy <- by(data.frame(response = mydata$response, weights = mydata$weights), 
> mydata$group, msenw)
> 
> 
> Solomon Messing
> www.stanford.edu/~messing
> 
> 
> 
> On Jan 16, 2011, at 11:16 PM, Joshua Wiley wrote:
> 
> > Dear Solomon,
> >
> > On Sun, Jan 16, 2011 at 10:27 PM, Solomon Messing
> > <solomon.mess...@gmail.com> wrote:
> >> Dear Soren and R users:
> >>
> >> I am trying to use the summaryBy function with weights.  Is this possible? 
> >>  An example that illustrates what I am trying to do follows:
> >>
> >> library(doBy)
> >> ## make up some data
> >> response = rnorm(100)
> >> group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20))
> >> weights = runif(100, 0, 1)
> >> mydata = data.frame(response,group,weights)
> >>
> >> ## run summaryBy without weights:
> >> summaryBy(response~group, data = mydata, FUN = mean)
> >>
> >> ## attempt to run summaryBy with weights, throws error
> >> summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights )
> >>
> >> ## throws the error:
> >> # Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { :
> >> #                                       arguments must have same length
> >>
> >> My guess is that summaryBy is not giving weighted.mean() each group of 
> >> weights, but instead is passing all of the weights in the data set each 
> >> time it calls weighted.mean().
> >
> > Yes, of course.  It has no way of knowing that the weights should also
> > be being broken down by group....they are not in the formula.
> >
> >>  Do you know if there is some way to get summaryBy to pass weights to 
> >> weighted.mean() only for each group?
> >
> > Ideally there would be a way to pass more than one variable to a
> > function (e.g., response and weights) or just an entire object
> > (mydata) broken down by group.  Then you would just make a wrapper
> > function to pass the right values to the x and w arguments of
> > weighted.mean.  Instead here is a somewhat hacked version:
> >
> > library(doBy)
> > ## make up some data (easier)
> > mydata <- data.frame(response = rnorm(100),
> > group = rep(1:5, each = 20), weights = runif(100, 0, 1))
> >
> > ## manually compute weighted mean
> > tmp <- summaryBy(response*weights ~ group, data = mydata, FUN = sum)
> > tmp[,2] <- tmp[,2]/with(mydata, tapply(weights, group, sum))
> > tmp ## weighted means
> >
> > ## here's the 'problem', if you will, even with  +, they are passed
> > one at a time
> > summaryBy(response + weights ~ group, data = mydata, FUN = str)
> > summaryBy(mydata ~ group, data = mydata, FUN = str)
> >
> > ## here is an option using by():
> > xy <- by(mydata, mydata$group, function(z) weighted.mean(z$response, 
> > z$weights))
> > xy
> > ## if you don't like the formatting....
> > data.frame(group = names(c(xy)), weighted.mean = c(xy))
> >
> > HTH,
> >
> > Josh
> >
> >>
> >> I suspect this functionality would be a tremendous benefit to R users who 
> >> regularly work with weighted data, such as myself.
> >>
> >> Thanks,
> >>
> >> Solomon Messing
> >> www.stanford.edu/~messing
> >>
> >> PS I know this basic example can be done using lapply(split(...)) approach 
> >> referenced here:
> >>
> >> http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg12349.html
> >>
> >> but for more complex tasks the lapply approach will mean writing a lot of 
> >> extra code to run everything and then to get things formatted as nicely as 
> >> summaryBy() was designed to do.
> >>
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide 
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> >
> > --
> > Joshua Wiley
> > Ph.D. Student, Health Psychology
> > University of California, Los Angeles
> > http://www.joshuawiley.com/
> 
> 
>        [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Using summaryBy with weighted data

Reply via email to