HI Steve,
Thanks for testing.
When I run a slightly bigger dataset:
set.seed(1254)
name<- sample(letters,1e7,replace=TRUE)
number<- sample(1:10,1e7,replace=TRUE)
datTest<- data.frame(name,number,stringsAsFactors=FALSE)
library(data.table)
dtTest<- data.table(datTest)
system.time(res3<- dtTest[,list(Sum_Number=sum(number)),by=name])
#user system elapsed
# 0.592 0.028 0.623
#Then I tried this:
dtTest1<- data.table(datTest,key=name)
#Error: C stack usage is too close to the limit
Cstack_info()
# size current direction eval_depth
# 8388608 7320 1 2
#So, I tried this:
dtTest<- data.table(datTest)
setkey(dtTest,name)
system.time(res3<- dtTest[,list(Sum_Number=sum(number)),by=name])
# user system elapsed
# 0.104 0.040 0.144
# I thought the problem might be related to directly using
data.table(...,key='name'), but...
#Previous example with "setkey(..)"
dt1<- data.table(dat2)
keycols=c("Date","Time")
system.time({
setkeyv(dt1,keycols)
ans <- dt1[, .SD[.N], by='Date']
})
# user system elapsed ##
# 40.532 0.008 40.614
dt2<- data.table(dat2)
keycols=c("Date","Time")
system.time({
setkeyv(dt2,keycols)
i1 <- dt2[,which.max(Time),by=Date][[2]]
i2<- dt2[,.N,by=Date][[2]]
res<-dt2[i1+cumsum(i2)-i2,]
})
# user system elapsed
# 0.256 0.000 0.258 #still it is not the same as you got.
resNew<- as.data.frame(res)
system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,])
row.names(resNew)<- row.names(res1)
attr(resNew,"row.names")<- attr(res1,"row.names")
identical(resNew,res1)
#[1] TRUE
A.K.
----- Original Message -----
From: Steve Lianoglou <[email protected]>
To: arun <[email protected]>
Cc: R help <[email protected]>
Sent: Thursday, August 15, 2013 4:48 PM
Subject: Re: [R] How to extract last value in each group
Hi,
On Thu, Aug 15, 2013 at 1:38 PM, arun <[email protected]> wrote:
> I tried it again on a fresh start using the data.table alone:
> Now.
>
> dt1 <- data.table(dat2, key=c('Date', 'Time'))
> system.time(ans <- dt1[, .SD[.N], by='Date'])
> # user system elapsed
> # 40.908 0.000 40.981
> #Then tried:
> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
> # user system elapsed
> # 0.148 0.000 0.151 #same time as before
Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
specs to your machine):
R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
R> system.time(ans <- dt1[, .SD[.N], by='Date'])
user system elapsed
0.064 0.009 0.073
R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
user system elapsed
0.148 0.016 0.165
On one of our compute server running who knows what processor on some
version of linux, but shouldn't really matter as we're talking
relative time to each other here:
R> system.time(ans <- dt1[, .SD[.N], by='Date'])
user system elapsed
0.160 0.012 0.170
R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
user system elapsed
0.292 0.004 0.294
There's got to be some other explanation for the heavily degraded
performance you're observing... our R & data.table versions also
match.
-steve
--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.