Lets say there are 10 students in the first group and denote x1 as (say) the number of red balls for student 1 and s1 the total balls. Then I was calculating the average the proportion ( x1/s1 + x2/s2 + ... + x10/s10 ) and you were calculating the average number of events (x1+x2 +...+x10)/(s1+s2+...+s10).
On second thoughts I think it is much better to calculate the a weighted average of the proportions. The weights should reflect the variance of the estimate of the proportions. ( w1*x1/s1 + w2*x2/s2 + ... + w10*x10/s10 ) On Mon, 2006-03-20 at 15:11 +0000, Dan Bolser wrote: > Adaikalavan Ramasamy wrote: > > Are you saying that your data might look like this ? > > > > set.seed(1) # For reproducibility only - remove this > > mydf <- data.frame( age=round(runif(100, min=5, max=65), digits=1), > > nred=rpois(100, lambda=10), > > nblue=rpois(100, lambda=5), > > ngreen=rpois(100, lambda=15) ) > > mydf$total <- rowSums( mydf[ , c("nred", "nblue", "ngreen")] ) > > > > head(mydf) > > age nred nblue ngreen total > > 1 20.9 11 7 15 33 > > 2 27.3 8 2 18 28 > > 3 39.4 11 4 8 23 > > 4 59.5 6 5 8 19 > > 5 17.1 10 3 16 29 > > 6 58.9 11 5 14 30 > > > > > > If so, then try this : > > > > mydf <- mydf[order(mydf$age), ] ## re-order by age > > mydf$cumtotal <- cumsum(mydf$total) ## cummulative total > > > > brk.pts <- seq(from=0, to=sum(mydf$total), len=9) > > mydf$grp <- cut( mydf$cumtotal , brk.pts, labels=F ) > > > > age nred nblue ngreen total cumtotal grp > > 27 5.8 9 5 8 22 22 1 > > 47 6.4 6 5 13 24 46 1 > > 92 8.5 8 4 18 30 76 1 > > 10 8.7 12 5 8 25 101 1 > > 55 9.2 10 7 13 30 131 1 > > 69 10.1 9 3 18 30 161 1 > > > > > > So here your 'grp' column is what you really want. Just to check > > > > tapply( mydf$total, mydf$grp, sum ) > > 1 2 3 4 5 6 7 8 > > 352 363 372 387 358 377 377 370 > > > > sapply( tapply( mydf$age, mydf$grp, range ), c ) > > 1 2 3 4 5 6 7 8 > > [1,] 5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7 > > [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5 > > > > The last command says that your youngest student in group 1 is aged 5.8 > > and oldest is aged 16.2. > > > > > > Taking this one step further, you can calculate the proportion of the > > red, green and blue for each of the 8 groups. > > > > props <- mydf[ , c("nred", "nblue", "ngreen")]/mydf$total # proportions > > apply( props, 2, function(v) tapply( v, mydf$grp, mean ) ) > > nred nblue ngreen > > 1 0.3459898 0.1776441 0.4763661 > > 2 0.3280712 0.1730796 0.4988492 > > 3 0.3061429 0.1748149 0.5190422 > > 4 0.3759380 0.2084694 0.4155926 > > 5 0.3548805 0.1587353 0.4863842 > > 6 0.3106835 0.1829349 0.5063816 > > 7 0.3525933 0.1599737 0.4874330 > > 8 0.3133796 0.1795567 0.5070637 > > > > Hope this of some use. > > Yes, this is very useful! I have just one remaining question, above you > take the mean of the group proportion... > > apply( props, 2, function(v) tapply( v, mydf$grp, mean ) ) > > > instead of explicitly recalculating the proportion for the group (what I > couldn't script real good) ... > > rbind( > colSums(mydf[ mydf$grp==1, c("nred", "nblue", "ngreen")])/ > sum (mydf[ mydf$grp==1, c("nred", "nblue", "ngreen")]), > ... > colSums(mydf[ mydf$grp==8, c("nred", "nblue", "ngreen")])/ > sum (mydf[ mydf$grp==8, c("nred", "nblue", "ngreen")]) > ) > > > Giving (from the same seed)... > > nred nblue ngreen > [1,] 0.3465909 0.1704545 0.4829545 > [2,] 0.3250689 0.1735537 0.5013774 > [3,] 0.3064516 0.1774194 0.5161290 > [4,] 0.3746770 0.2067183 0.4186047 > [5,] 0.3519553 0.1564246 0.4916201 > [6,] 0.3103448 0.1830239 0.5066313 > [7,] 0.3501326 0.1644562 0.4854111 > [8,] 0.3081081 0.1837838 0.5081081 > > > Which is *slightly* different from the 'mean' approach. > > > round(former-latter,4) > nred nblue ngreen > 1 -0.0006 0.0072 -0.0066 > 2 0.0030 -0.0005 -0.0025 > 3 -0.0003 -0.0026 0.0029 > 4 0.0013 0.0018 -0.0030 > 5 0.0029 0.0023 -0.0052 > 6 0.0003 -0.0001 -0.0002 > 7 0.0025 -0.0045 0.0020 > 8 0.0053 -0.0042 -0.0010 > > > I know this less a question about R, and more a question about general > stats, but why did you choose the former and not the latter method? Is > one wrong and one right? Or did the former better fit the situation as > described? > > Thanks for any insight into your decision, as this is something that has > always puzzled me. > > Thanks for the beautifully clear examples! > > > Dan. > > > > > Regards, Adai > > > > > > > > On Sun, 2006-03-19 at 18:58 +0000, Dan Bolser wrote: > > > >>Adaikalavan Ramasamy wrote: > >> > >>>Do you by any chance want to sample from each group equally to get an > >>>equal representation matrix ? > >> > >>No. > >> > >>I want to make groups of equal sizes, where size isn't simply number of > >>rows (allowing a simple 'gl'), but a sum of the variable. > >> > >>Thanks for the code though, it looks useful. > >> > >> > >> > >>Here is an analogy for what I want to do (in case it helps). > >> > >>A group of students have some bags of marbles - The marbles have > >>different colours. Each student has one bag, but can have between 5 and > >>50 marbles per bag with any given strange distribution you like. I line > >>the students up by age, and want to see if there is any systematic > >>difference between the number of each color of marble by age (older > >>students may find primary colours less 'cool'). > >> > >>Because the statistics of each individual student are bad (like the > >>proportion of each color per student -- has a high variance) I first put > >>all the students into 8 groups (for example). > >> > >>Thing is, for one reason or another, the number of marbles per bag may > >>systematically vary with age too. However, I am not interested in the > >>number of marbles per bag, so I would like to group the students into 8 > >>groups such that each group has the same total number of marbles. (Each > >>group having a different sized age range, none the less ordered by age). > >> > >>Then I can look at the proportion (or count) of colours in each group, > >>and I can compare the groups or any trend accross the groups. > >> > >>Does that make sense? > >> > >>Cheers, > >>Dan. > >> > >> > >> > >> > >> > >> > >> > >>>Here is an example of the input : > >>> > >>> mydf <- data.frame( value=1:100, value2=rnorm(100), > >>> grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) ) > >>> > >>>which has 35 observations from A, 15 from B, 30 from C and 20 from D. > >>> > >>> > >>>And here is a function that I wrote: > >>> > >>> sample.by.group <- function(df, grp, k, replace=FALSE){ > >>> > >>> if(length(k)==1){ k <- rep(k, length(unique(grp))) } > >>> > >>> if(!replace && any(k > table(grp))) > >>> stop( paste("Cannot take a sample larger than the population when > >>> 'replace = FALSE'.\n", "Please specify a value greater than", > >>> min(table(grp)), "or use 'replace = TRUE'.\n") ) > >>> > >>> > >>> ind <- model.matrix( ~ -1 + grp ) > >>> w.mat <- list(NULL) > >>> > >>> for(i in 1:ncol(ind)){ > >>> w.mat[[i]] <- sample( which( ind[,i]==1 ), k[i], replace=replace ) > >>> } > >>> > >>> out <- df[ unlist(w.mat), ] > >>> return(out) > >>> } > >>> > >>> > >>>And here are some examples of how to use it : > >>> > >>>mydf <- mydf[ sample(1:nrow(mydf)), ] # scramble it for fun > >>> > >>> > >>>out1 <- sample.by.group(mydf, mydf$grp, k=10 ) > >>>table( out1$grp ) > >>> > >>> out2 <- sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap > >>> table( out2$grp ) > >>> > >>>and you can even do bootstrapping or sampling with weights via: > >>> > >>> out3 <- sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T) > >>> table( out3$grp ) > >>> > >>> > >>>Regards, Adai > >>> > >>> > >>> > >>>On Fri, 2006-03-17 at 16:01 +0000, Dan Bolser wrote: > >>> > >>> > >>>>Hi, > >>>> > >>>>I have tuples of data in rows of a data.frame, each column is a variable > >>>>for the 'items' (one per row). > >>>> > >>>>One of the variables is the 'size' of the item (row). > >>>> > >>>>I would like to cut my data.frame into groups such that each group has > >>>>the same *total size*. So, assuming that we order by size, some groups > >>>>should have several small items while other groups have a few large > >>>>items. All the groups should have approximately the same total size. > >>>> > >>>>I have tried various combinations of cut, quantile, and ecdf, and I just > >>>>can't work out how to do this! > >>>> > >>>>Any help is greatly appreciated! > >>>> > >>>>All the best, > >>>>Dan. > >>>> > >>>>______________________________________________ > >>>>R-help@stat.math.ethz.ch mailing list > >>>>https://stat.ethz.ch/mailman/listinfo/r-help > >>>>PLEASE do read the posting guide! > >>>>http://www.R-project.org/posting-guide.html > >>>> > >>> > >>> > >> > > > > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html