HI Steve,

Thanks for testing.

When I run a slightly bigger dataset:
set.seed(1254)
name<- sample(letters,1e7,replace=TRUE)
number<- sample(1:10,1e7,replace=TRUE)

datTest<- data.frame(name,number,stringsAsFactors=FALSE)
library(data.table)

dtTest<- data.table(datTest)

system.time(res3<- dtTest[,list(Sum_Number=sum(number)),by=name])
 #user  system elapsed 
 # 0.592   0.028   0.623 

#Then I tried this:

dtTest1<- data.table(datTest,key=name)
#Error: C stack usage is too close to the limit

Cstack_info()
#      size    current  direction eval_depth 
 #  8388608       7320          1          2 

#So, I tried this:

 dtTest<- data.table(datTest)
setkey(dtTest,name)

system.time(res3<- dtTest[,list(Sum_Number=sum(number)),by=name])
#   user  system elapsed 
 # 0.104   0.040   0.144 

# I thought the problem might be related to directly using 
data.table(...,key='name'), but...

#Previous example with "setkey(..)"

dt1<- data.table(dat2)
keycols=c("Date","Time")
system.time({
setkeyv(dt1,keycols)
ans <- dt1[, .SD[.N], by='Date']
})
# user  system elapsed ##
# 40.532   0.008  40.614 

dt2<- data.table(dat2)
keycols=c("Date","Time")
system.time({
setkeyv(dt2,keycols)
i1 <- dt2[,which.max(Time),by=Date][[2]]
i2<- dt2[,.N,by=Date][[2]]
res<-dt2[i1+cumsum(i2)-i2,]
})
 # user  system elapsed 
 # 0.256   0.000   0.258 #still it is not the same as you got.
resNew<- as.data.frame(res)

system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,])

 row.names(resNew)<- row.names(res1)
 attr(resNew,"row.names")<- attr(res1,"row.names")
 identical(resNew,res1)
#[1] TRUE

A.K.









----- Original Message -----
From: Steve Lianoglou <lianoglou.st...@gene.com>
To: arun <smartpink...@yahoo.com>
Cc: R help <r-help@r-project.org>
Sent: Thursday, August 15, 2013 4:48 PM
Subject: Re: [R] How to extract last value in each group

Hi,

On Thu, Aug 15, 2013 at 1:38 PM, arun <smartpink...@yahoo.com> wrote:
> I tried it again on a fresh start using the data.table alone:
> Now.
>
>  dt1 <- data.table(dat2, key=c('Date', 'Time'))
>  system.time(ans <- dt1[, .SD[.N], by='Date'])
> #   user  system elapsed
> # 40.908   0.000  40.981
> #Then tried:
> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
>  #  user  system elapsed
>  # 0.148   0.000   0.151  #same time as before

Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
specs to your machine):

R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
R> system.time(ans <- dt1[, .SD[.N], by='Date'])
   user  system elapsed
  0.064   0.009   0.073

R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
   user  system elapsed
  0.148   0.016   0.165

On one of our compute server running who knows what processor on some
version of linux, but shouldn't really matter as we're talking
relative time to each other here:

R> system.time(ans <- dt1[, .SD[.N], by='Date'])
   user  system elapsed
  0.160   0.012   0.170

R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
   user  system elapsed
  0.292   0.004   0.294

There's got to be some other explanation for the heavily degraded
performance you're observing... our R & data.table versions also
match.

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to