On Aug 6, 2013, at 4:02 PM, Mike Miller wrote: > I received two additional suggestions, one off-list, both appended below. > Both helped me to learn a bit more about how to get what I want. > > First, the aggregate() function is in package:stats, it provides the numbers > I needed, but I don't like the output format as much as I liked the format > from doBy:summaryBy(). Here it is: > >> aggregate(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, >> function(x) c(mean=mean(x), sd=sd(x), quantile(x), N=length(x))) > Generation Zygosity Sex Cohort ESstatus Age.mean Age.sd > Age.0% Age.25% Age.50% Age.75% Age.100% Age.N > 1 Offspring DZ Female 11 ES 17.7852830 0.3535863 > 16.9300000 17.6000000 17.7750000 17.9650000 18.9200000 106.0000000 > 2 Parent DZ Female 11 ES 44.6151240 5.1246314 > 32.1700000 41.3400000 44.6800000 48.2800000 57.9500000 121.0000000 > snipped > 23 Offspring MZ Male 17 notES 17.4911446 0.3961757 > 16.6500000 17.1775000 17.5000000 17.8100000 18.3500000 332.0000000 > 24 Parent MZ Male 17 notES 46.6929771 5.2421896 > 34.4500000 43.1500000 45.8900000 49.0050000 63.8000000 131.0000000 > > That's great but there are two things I didn't like: (1) There too many > digits, especially on the integers in the last column. I thought five digits > to the right of the decimal was more than enough but here we have seven, even > for integers. (2) The ordering of levels within factors implied by the right > side of the formula is not honored -- it looks like it used the order Cohort, > ESstatus, Sex, Zygosity, Generation. Unlike doBy::summaryBy(), it does not > accept an order=T argument (that is the default in doBy::summaryBy()). > > One thing both suggestions taught me was to use names in function definitions > so that I always get correct column headings on output. This was in the > documentation for doBy::summaryBy(), but I didn't understand it when I first > read it. Using that naming concept, I created this function: > > descriptivefun <- function(x, ...){c(mean=mean(x, ...), sd=sd(x, ...), > quantile(x, ...), N=sum(!is.na(x)), NAs=sum(is.na(x)))} > > That will allow me to feed the na.rm=T argument to the mean, sd and quantile > functions. By not naming the quantile function (e.g., not using > q=quantile(x, ...)) I allow the builtin column names to be used unaltered > (i.e., 0%, 25%, 50%, 75%, 100%). I also did not use length() because it will > count NA values and I want to see the sample sizes used for mean, sd and > quantile. To deal with that problem I created a function with output named > "N" to count those sample sizes and one with output named "NAs" to count the > number of NAs. Then I do this: > >> summaryBy(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, >> FUN=descriptivefun, na.rm=T) > Generation Zygosity Sex Cohort ESstatus Age.mean Age.sd Age.0% > Age.25% Age.50% Age.75% Age.100% Age.N Age.NAs > 1 Offspring DZ Female 11 ES 17.78528 0.3535863 16.93 > 17.6000 17.775 17.9650 18.92 106 0 > 2 Offspring DZ Female 11 notES 18.13679 0.5555968 16.76 > 17.8525 18.190 18.4575 19.50 162 0 > snipped > 22 Parent MZ Male 11 ES 43.40787 5.3507439 31.28 > 39.9700 43.440 46.4800 64.65 197 0 > 23 Parent MZ Male 11 notES 41.56363 4.6564818 32.10 > 38.0250 41.390 44.6450 65.29 331 0 > 24 Parent MZ Male 17 notES 46.69298 5.2421896 34.45 > 43.1500 45.890 49.0050 63.80 131 0 > > I think that output looks very nice. One thing that I don't understand is > why my function produces %.5f output for every value but the > doBy::summaryBy() function uses different formats in different columns.
Look at the code. You are attributing behavior to `summaryBy` that should be ascribed to `print.data.frame`, and to `format.data.frame`. Your function is returning a numeric vector and getting displayed by `print.default`. -- David. > Compare the above output with this output: > >> descriptivefun(x$Age) > mean sd 0% 25% 50% 75% 100% > N NAs > 28.49302 13.29077 16.55000 17.65000 18.23000 42.25500 65.29000 > 4434.00000 0.00000 > > It's not a big deal, but it would be cool if I could tell doBy::summaryBy() > how to format the numbers using something like format=c(rep("%.2f",7), > rep("%d",2)). > > Mike > > -- > Michael B. Miller, Ph.D. > Minnesota Center for Twin and Family Research > Department of Psychology > University of Minnesota > > > > On Mon, 5 Aug 2013, David Carlson wrote: > >> This is a bit simpler. The function quantile() labels the output whereas >> fivenum() does not: >> >> aggregate(Age ~ Generation + Zygosity + Sex + Cohort + >> ESstatus, data=x, >> function(x) c(mean=mean(x), sd=sd(x), quantile(x))) > > > On Mon, 5 Aug 2013, Dr. Thomas W. MacFarland wrote: > >> Dear Dr. Miller: >> >> Pasted below is syntax that should mostly answer your recent question to the >> R mailing list: >> >> descriptivefun <- function(x, ...){ >> c(m=mean(x, ...), sd=sd(x, ...), l=length(x)) >> } >> >> doBy::summaryBy(Final ~ Method.recode + >> ComCol.recode, >> data=Final.table, >> FUN=descriptivefun, >> na.rm=TRUE, >> keep.names=TRUE, >> order=TRUE) >> >> I go into far more detail on this package::function and similar functions in >> my recent text on Twoway ANOVA, >> http://www.springer.com/statistics/social+sciences+%26+law/book/978-1-4614-2133-7. >> >> Best wishes. >> >> Tom David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.