Re: [R] How to do aggregate operations with non-scalar functions

2005-04-07 Thread Gabor Grothendieck
On Apr 7, 2005 1:18 AM, Itay Furman [EMAIL PROTECTED] wrote:
 
 On Tue, 5 Apr 2005, Gabor Grothendieck wrote:
 
  On Apr 5, 2005 6:59 PM, Itay Furman [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I have a data set, the structure of which is something like this:
 
  a - rep(c(a, b), c(6,6))
  x - rep(c(x, y, z), c(4,4,4))
  df - data.frame(a=a, x=x, r=rnorm(12))
 
  The true data set has 1 million rows. The factors a and x
  have about 70 levels each; combined together they subset 'df'
  into ~900 data frames.
  For each such subset I'd like to compute various statistics
  including quantiles, but I can't find an efficient way of
 
 [snip]
 
  I would like to end up with a data frame like this:
 
a x 0%25%
  1 a x -0.7727268  0.1693188
  2 a y -0.3410671  0.1566322
  3 b y -0.2914710 -0.2677410
  4 b z -0.8502875 -0.6505710
 
 [snip]
 
  One can use
 
do.call(rbind, by(df, list(a = a, x = x), f))
 
  where f is the appropriate function.
 
  In this case f can be described in terms of df.quantile which
  is like quantile except it returns a one row data frame:
 
df.quantile - function(x,p)
as.data.frame(t(data.matrix(quantile(x, p
 
f - function(df, p = c(0.25, 0.5))
cbind(df[1,1:2], df.quantile(df[,r], p))
 
 
 Thanks!  Just what I wanted.
 
 A minor point is that for some reason the row numbers in the
 final data frame are not sequential (see below -- this is not a
 consequence of my changes).

These are the original row numbers of the first row of
each combo of a and x.  If z is the result of do.call
you can always do this:   row.names(z) - 1:nrow(z)
if this its needed.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] How to do aggregate operations with non-scalar functions

2005-04-06 Thread Itay Furman
On Tue, 5 Apr 2005, Gabor Grothendieck wrote:
On Apr 5, 2005 6:59 PM, Itay Furman [EMAIL PROTECTED] wrote:
Hi,
I have a data set, the structure of which is something like this:
a - rep(c(a, b), c(6,6))
x - rep(c(x, y, z), c(4,4,4))
df - data.frame(a=a, x=x, r=rnorm(12))
The true data set has 1 million rows. The factors a and x
have about 70 levels each; combined together they subset 'df'
into ~900 data frames.
For each such subset I'd like to compute various statistics
including quantiles, but I can't find an efficient way of
[snip]
I would like to end up with a data frame like this:
  a x 0%25%
1 a x -0.7727268  0.1693188
2 a y -0.3410671  0.1566322
3 b y -0.2914710 -0.2677410
4 b z -0.8502875 -0.6505710
[snip]
One can use
do.call(rbind, by(df, list(a = a, x = x), f))
where f is the appropriate function.
In this case f can be described in terms of df.quantile which
is like quantile except it returns a one row data frame:
df.quantile - function(x,p)
as.data.frame(t(data.matrix(quantile(x, p
f - function(df, p = c(0.25, 0.5))
cbind(df[1,1:2], df.quantile(df[,r], p))
Thanks!  Just what I wanted.
A minor point is that for some reason the row numbers in the 
final data frame are not sequential (see below -- this is not a 
consequence of my changes).

Actually, seeing your code I became greedy and decided to 
extract more summary statistics in one blow like this:

df.summary - function(x, qtils=(0:4)/4)
cbind(data.frame(mean=mean(x), var=var(x),
 length=length(x)),
as.data.frame(t(data.matrix(quantile(x, qtils)
f - function(x, qtils=(0:4)/4)
cbind(x[1,1:2], df.summary(x[,r], qtils))
do.call(rbind, by(df, list(a = a, x = x), f))
  a x   mean var length 0%25%50%
1 a x  0.2901207 0.522191469  4 -0.7727268  0.1693188  0.5523356
5 a y  0.6543314 1.981636402  2 -0.3410671  0.1566322  0.6543314
7 b y -0.2440109 0.004504928  2 -0.2914710 -0.2677410 -0.2440109
9 b z  0.4523763 1.841469995  4 -0.8502875 -0.6505710  0.4717093
 75%   100%
1  0.6731375  0.8285385
5  1.1520307  1.6497299
7 -0.2202808 -0.1965508
9  1.5746565  1.7163741
What remains a puzzle to me is why R has a native subsetting 
function that returns a scalar per subset [aggregate()],  another 
one that returns a list [by()],  but no function that is able to 
return a vector per subset.  Is there a less demand to such 
operation (like extracting summary statistics in one blow)?  Is 
it less general?  Or technically more difficult to achieve?
I'm just curious.

Itay

[EMAIL PROTECTED]  /  +1 (206) 543 9040  /  U of Washington
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] How to do aggregate operations with non-scalar functions

2005-04-06 Thread Itay Furman

On Wed, 6 Apr 2005, Rich FitzJohn wrote:
[snip]
## This does the hard work of calculating the statistics over your
## combinations, and over the values in `p'
y - lapply(p, function(y)
   tapply(df$r, list(a=a, x=x), quantile, probs=y))
Rich, thank you for your reply.  Gabor G has proposed a different 
solution that seem to me to be easier to maintain and scale up.
Please see my follow up to his reply.

Your solution introduced to me some R functions I was not 
familiar with: expand.grid(), colSums(), and names().  Thanks for 
that, too.

## Then, we need to work out what combinations of a  x are possible:
## these are the header columns.  aggregate() does this in a much more
## complicated way, which may handle more difficult cases than this
## (e.g. if there are lots of missing values points, or something).
vars - expand.grid(dimnames(y[[1]]))
In Gabor G's solution this is magically done (I think!) by 
do.call().

Thanks,
Itay

[EMAIL PROTECTED]  /  +1 (206) 543 9040  /  U of Washington
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] How to do aggregate operations with non-scalar functions

2005-04-06 Thread Itay Furman
On Wed, 6 Apr 2005 [EMAIL PROTECTED] wrote:
Here is a method that I use in this situation.  I work with the indices of
the rows so that copies are not made and it is fast.
Result - lapply(split(seq(nrow(df)), df$a), function(.a){  # partition on
the first variable
 lapply(split(.a, df$z[.a]), function(.z){   # partition on the second
variable -- notice the subsetting
   c(quantile(df$r[.z]), ...anything else you want to compute)
 })
})
Result - do.call('rbind', Result)  # create a matrix - now you have your
results
Jim
Jim,
Thank you for your reply.  For some reason, when I try your 
proposed solution I get:

Error in sort(unique.default(x), na.last = TRUE) :
`x' must be atomic
Eventually, I used the solution proposed by Gabor G in this 
thread.  One advantage of his solution is that it is easier to 
scale up I believe;  for example in the case you have 3 factors 
that together subset the data frame.

Regards,
Itay

[EMAIL PROTECTED]  /  +1 (206) 543 9040  /  U of Washington
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] How to do aggregate operations with non-scalar functions

2005-04-05 Thread Rich FitzJohn
Hi Itay,

Not sure if by() can do it directly, but this does it from first
principles, using lapply() and tapply() (which aggregate uses
internally).  It would be reasonably straightforward to wrap this up
in a function.

a - rep(c(a, b), c(6,6))
x - rep(c(x, y, z), c(4,4,4))
df - data.frame(a=a, x=x, r=rnorm(12))
## Probabilities for quantile
p - c(.25, .5, .75)

## This does the hard work of calculating the statistics over your
## combinations, and over the values in `p'
y - lapply(p, function(y)
tapply(df$r, list(a=a, x=x), quantile, probs=y))

## Then, we need to work out what combinations of a  x are possible:
## these are the header columns.  aggregate() does this in a much more
## complicated way, which may handle more difficult cases than this
## (e.g. if there are lots of missing values points, or something).
vars - expand.grid(dimnames(y[[1]]))

## Finish up by converting `y' into a true data.frame, and ommiting
## all the cases where all the values in `y' are NA: these are
## combinations of a and x that we did not encounter.
y - as.data.frame(lapply(y, as.vector))
names(y) - paste(p, %, sep=)
i - colSums(apply(y, 1, is.na)) != ncol(y)
y - cbind(vars, y)[i,]

Cheers,
Rich

On Apr 6, 2005 10:59 AM, Itay Furman [EMAIL PROTECTED] wrote:
 
 Hi,
 
 I have a data set, the structure of which is something like this:
 
  a - rep(c(a, b), c(6,6))
  x - rep(c(x, y, z), c(4,4,4))
  df - data.frame(a=a, x=x, r=rnorm(12))
 
 The true data set has 1 million rows. The factors a and x
 have about 70 levels each; combined together they subset 'df'
 into ~900 data frames.
 For each such subset I'd like to compute various statistics
 including quantiles, but I can't find an efficient way of
 doing this.  Aggregate() gives me the desired structure -
 namely, one row per subset - but I can use it only to compute
 a single quantile.
 
  aggregate(df[,r], list(a=a, x=x), quantile, probs=0.25)
   a x  x
 1 a x  0.1693188
 2 a y  0.1566322
 3 b y -0.2677410
 4 b z -0.6505710
 
 With by() I could compute several quantiles per subset at
 each shot, but the structure of the output is not
 convenient for further analysis and visualization.
 
  by(df[,r], list(a=a, x=x), quantile, probs=c(0, 0.25))
 a: a
 x: x
 0%25%
 -0.7727268  0.1693188
 --
 a: b
 x: x
 NULL
 --
 
 [snip]
 
 I would like to end up with a data frame like this:
 
   a x 0%25%
 1 a x -0.7727268  0.1693188
 2 a y -0.3410671  0.1566322
 3 b y -0.2914710 -0.2677410
 4 b z -0.8502875 -0.6505710
 
 I checked sweep() and apply() and didn't see how to harness
 them for that purpose.
 
 So, is there a simple way to convert the object returned
 by by() into a data.frame?
 Or, is there a better way to go with this?
 Finally, if I should roll my own coercion function: any tips?
 
Thank you very much in advance,
Itay
 
 
 [EMAIL PROTECTED]  /  +1 (206) 543 9040  /  U of Washington
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 


-- 
Rich FitzJohn
rich.fitzjohn at gmail.com   |http://homepages.paradise.net.nz/richa183
  You are in a maze of twisty little functions, all alike

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html