Re: [R] descriptive stats by cells in factorial design

David Winsemius Tue, 06 Aug 2013 17:23:31 -0700

On Aug 6, 2013, at 4:02 PM, Mike Miller wrote:

> I received two additional suggestions, one off-list, both appended below. 
> Both helped me to learn a bit more about how to get what I want.
> 
> First, the aggregate() function is in package:stats, it provides the numbers 
> I needed, but I don't like the output format as much as I liked the format 
> from doBy:summaryBy().  Here it is:
> 
>> aggregate(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, 
>> function(x) c(mean=mean(x), sd=sd(x), quantile(x), N=length(x)))
>   Generation Zygosity    Sex Cohort ESstatus    Age.mean      Age.sd      
> Age.0%     Age.25%     Age.50%     Age.75%    Age.100%       Age.N
> 1   Offspring       DZ Female     11       ES  17.7852830   0.3535863  
> 16.9300000  17.6000000  17.7750000  17.9650000  18.9200000 106.0000000
> 2      Parent       DZ Female     11       ES  44.6151240   5.1246314  
> 32.1700000  41.3400000  44.6800000  48.2800000  57.9500000 121.0000000
> 
snipped
> 23  Offspring       MZ   Male     17    notES  17.4911446   0.3961757  
> 16.6500000  17.1775000  17.5000000  17.8100000  18.3500000 332.0000000
> 24     Parent       MZ   Male     17    notES  46.6929771   5.2421896  
> 34.4500000  43.1500000  45.8900000  49.0050000  63.8000000 131.0000000
> 
> That's great but there are two things I didn't like:  (1) There too many 
> digits, especially on the integers in the last column.  I thought five digits 
> to the right of the decimal was more than enough but here we have seven, even 
> for integers.  (2) The ordering of levels within factors implied by the right 
> side of the formula is not honored -- it looks like it used the order Cohort, 
> ESstatus, Sex, Zygosity, Generation.  Unlike doBy::summaryBy(), it does not 
> accept an order=T argument (that is the default in doBy::summaryBy()).
> 
> One thing both suggestions taught me was to use names in function definitions 
> so that I always get correct column headings on output.  This was in the 
> documentation for doBy::summaryBy(), but I didn't understand it when I first 
> read it.  Using that naming concept, I created this function:
> 
> descriptivefun <- function(x, ...){c(mean=mean(x, ...), sd=sd(x, ...), 
> quantile(x, ...), N=sum(!is.na(x)), NAs=sum(is.na(x)))}
> 
> That will allow me to feed the na.rm=T argument to the mean, sd and quantile 
> functions.  By not naming the quantile function (e.g., not using 
> q=quantile(x, ...)) I allow the builtin column names to be used unaltered 
> (i.e., 0%, 25%, 50%, 75%, 100%).  I also did not use length() because it will 
> count NA values and I want to see the sample sizes used for mean, sd and 
> quantile.  To deal with that problem I created a function with output named 
> "N" to count those sample sizes and one with output named "NAs" to count the 
> number of NAs.  Then I do this:
> 
>> summaryBy(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, 
>> FUN=descriptivefun, na.rm=T)
>   Generation Zygosity    Sex Cohort ESstatus Age.mean    Age.sd Age.0% 
> Age.25% Age.50% Age.75% Age.100% Age.N Age.NAs
> 1   Offspring       DZ Female     11       ES 17.78528 0.3535863  16.93 
> 17.6000  17.775 17.9650    18.92   106       0
> 2   Offspring       DZ Female     11    notES 18.13679 0.5555968  16.76 
> 17.8525  18.190 18.4575    19.50   162       0
> 
snipped
> 22     Parent       MZ   Male     11       ES 43.40787 5.3507439  31.28 
> 39.9700  43.440 46.4800    64.65   197       0
> 23     Parent       MZ   Male     11    notES 41.56363 4.6564818  32.10 
> 38.0250  41.390 44.6450    65.29   331       0
> 24     Parent       MZ   Male     17    notES 46.69298 5.2421896  34.45 
> 43.1500  45.890 49.0050    63.80   131       0
> 
> I think that output looks very nice.  One thing that I don't understand is 
> why my function produces %.5f output for every value but the 
> doBy::summaryBy() function uses different formats in different columns.


Look at the code. You are attributing behavior to `summaryBy` that should be 
ascribed to `print.data.frame`, and to `format.data.frame`. Your function is 
returning a numeric vector and getting displayed by `print.default`.

-- 
David.

> Compare the above output with this output:
> 
>> descriptivefun(x$Age)
>      mean         sd         0%        25%        50%        75%       100%   
>        N        NAs
>  28.49302   13.29077   16.55000   17.65000   18.23000   42.25500   65.29000 
> 4434.00000    0.00000
> 
> It's not a big deal, but it would be cool if I could tell doBy::summaryBy() 
> how to format the numbers using something like format=c(rep("%.2f",7), 
> rep("%d",2)).
> 
> Mike
> 
> --
> Michael B. Miller, Ph.D.
> Minnesota Center for Twin and Family Research
> Department of Psychology
> University of Minnesota
> 
> 
> 
> On Mon, 5 Aug 2013, David Carlson wrote:
> 
>> This is a bit simpler. The function quantile() labels the output whereas 
>> fivenum() does not:
>> 
>> aggregate(Age ~ Generation + Zygosity + Sex + Cohort +
>> ESstatus, data=x,
>>   function(x) c(mean=mean(x), sd=sd(x), quantile(x)))
> 
> 
> On Mon, 5 Aug 2013, Dr. Thomas W. MacFarland wrote:
> 
>> Dear Dr. Miller:
>> 
>> Pasted below is syntax that should mostly answer your recent question to the 
>> R mailing list:
>> 
>> descriptivefun <- function(x, ...){
>> c(m=mean(x, ...), sd=sd(x, ...), l=length(x))
>> }
>> 
>> doBy::summaryBy(Final ~ Method.recode +
>> ComCol.recode,
>> data=Final.table,
>> FUN=descriptivefun,
>> na.rm=TRUE,
>> keep.names=TRUE,
>> order=TRUE)
>> 
>> I go into far more detail on this package::function and similar functions in 
>> my recent text on Twoway ANOVA,
>> http://www.springer.com/statistics/social+sciences+%26+law/book/978-1-4614-2133-7.
>> 
>> Best wishes.
>> 
>> Tom

David Winsemius
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] descriptive stats by cells in factorial design

Reply via email to