Just in case anyone is still interested, here are some
comparisons of the time it says to compute grouped medians
via sapply(split(x,group),median) and gm(x,group), which
uses the trick used by rle() to find the first and last
entries in each group.
Which method is fastest depends on the nature
Another application of that technique can be used to quickly compute
medians by groups:
gm - function(x, group){ # medians by group:
sapply(split(x,group),median)
o-order(group, x)
group - group[o]
x - x[o]
changes - group[-1] != group[-length(group)]
first - which(c(TRUE,
Arg, the 'sapply(...)' in the function was in the initial
comment,
gm - function(x, group){ # medians by group:
sapply(split(x,group),median)
but someone's mailer put a newline before the sapply
gm - function(x, group){ # medians by group:
sapply(split(x,group),median)
so it got
-Original Message-
From: hadley wickham [mailto:h.wick...@gmail.com]
Sent: Sunday, January 04, 2009 8:56 PM
To: William Dunlap
Cc: gallon...@gmail.com; R help
Subject: Re: [R] the first and last observation for each subject
library(plyr)
# ddply is for splitting up data
Here's some more timing's of Bill's function. Although in this
example sapply has a clear performance advantage for smaller numbers
of groups (k) , gm is substantially faster for k 1000:
gm - function(x, group){ # medians by group:
o-order(group, x)
group - group[o]
x - x[o]
whoops -- I left the group size unchanged so k became greather than
the length of the group vector. When I increase the size to 1e7,
sapply is faster until it gets to k = 1e6.
warning: this takes awhile (particularly on my machine which seems to
be using just 1 of it's 2 cpus)
for(k in
[R] the first and last observation for each subject
hadley wickham h.wickham at gmail.com
Fri Jan 2 14:52:42 CET 2009
On Fri, Jan 2, 2009 at 3:20 AM, gallon li gallon.li at gmail.com
wrote:
I have the following data
ID x y time
1 10 20 0
1 10 30 1
1 10 40 2
2 12 23 0
2 12
I have the following data
ID x y time
1 10 20 0
1 10 30 1
1 10 40 2
2 12 23 0
2 12 25 1
2 12 28 2
2 12 38 3
3 5 10 0
3 5 15 2
.
x is time invariant, ID is the subject id number, y is changing over time.
I want to find out the difference between the first and last observed y
value for each
Hello,
First, order your data by ID and time.
The columns you want in your output dataframe are then
unique(ID),
tapply( x, ID, function( z ) z[ 1 ] )
and
tapply( y, ID, function( z ) z[ lenght( z ) ] - z[ 1 ] )
Best regards,
Carlos J. Gil Bellosta
http://www.datanalytics.com
On Fri,
Try this:
Lines - ID x y time
+ 1 10 20 0
+ 1 10 30 1
+ 1 10 40 2
+ 2 12 23 0
+ 2 12 25 1
+ 2 12 28 2
+ 2 12 38 3
+ 3 5 10 0
+ 3 5 15 2
DF - read.table(textConnection(Lines), header = TRUE)
aggregate(DF[3], DF[1:2], function(x) tail(x, 1) - head(x, 1))
ID x y
1 3 5 5
2 1 10 20
3 2
Dear Gallon,
Assuming that your data is called mydata, something like this should do
the job:
newdf-data.frame(
ID = unique(mydata$ID),
x = unique(mydata$x),
y = with(mydata,tapply(y,ID,function(m) tail(m,1)-head(m,1)))
)
newdf
HTH,
Jorge
On Fri, Jan
On Fri, Jan 2, 2009 at 3:20 AM, gallon li gallon...@gmail.com wrote:
I have the following data
ID x y time
1 10 20 0
1 10 30 1
1 10 40 2
2 12 23 0
2 12 25 1
2 12 28 2
2 12 38 3
3 5 10 0
3 5 15 2
.
x is time invariant, ID is the subject id number, y is changing over time.
I
Here is a fast approach using the Hmisc package's summarize function.
g - function(w) {
+ time - w[,'time']; y - w[,'y']
+ c(y[which.min(time)], y[which.max(time)])}
with(DF, summarize(DF, ID, g, stat.name=c('first','last')))
ID first last
1 120 40
2 223 38
3 310 15
I think there's a pretty simple solution here, though probably not the
most efficient:
t(sapply(split(a,a$ID),
function(q) with(q,c(ID=unique(ID),x=unique(x),y=max(y)-min(y)
Using 'unique' instead of min or [[1]] has the advantage that if x is
in fact not time-invariant, this gives an
Hello,
Is is truly
y=max(y)-min(y)
what you want below?
Best regards,
Carlos J. Gil Bellosta
http://www.datanalytics.com
On Fri, 2009-01-02 at 13:16 -0500, Stavros Macrakis wrote:
I think there's a pretty simple solution here, though probably not the
most efficient:
15 matches
Mail list logo