Re: [R] problem of data manipulation

Matthew Dowle Wed, 20 Jan 2010 05:50:03 -0800

The user wrote in their first post :
> I have a lot of observations in my dataset


Heres one way to do it with a data.table :
         a=data.table(a)
         ans = a[ , list(dt=dt[dt-min(dt)<7]) , by="var1,var2,var3"]
         class(ans$dt) = "Date"

Timings are below comparing the 3 methods. In this example, data.table 
appears to be 28 times faster than plyr, and 24 times faster than sqldf.  I 
excluded the one off time to build the key, since thats realistic, but even 
including that time, data.table is still 16 times faster than plyr (134 / 
(1.03+2.16+4.71)).  With even more rows, it should be even bigger speedups.

> a <- structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = 
> c("c",
"n", "s"), class = "factor"), var2 = c(1L, 1L, 1L, 2L, 2L, 2L), var3 = c(2L,
2L, 2L, 1L, 1L, 1L), dt = structure(c(10592, 10997, 11000, 10998,
11002, 11010), class = "Date")), .Names = c("var1", "var2", "var3",
"dt"), row.names = c(NA, -6L), class = "data.frame")
>
> a = data.frame(lapply(a,function(x)rep(x,each=1000000)))
> dim(a)
[1] 6000000       4
> library(plyr)
> system.time({ans1 <<- ddply(a, c("var1", "var2", "var3"), subset, dt - 
> min(dt) < 7)})
   user  system elapsed
 131.39    3.11  134.80
> library(sqldf)
> system.time({ans2 <<- sqldf("select var1, var2, var3, dt from a, (select 
> var1, var2, var3, min(dt) mindt from a group by var1, var2, var3) 
> using(var1, var2, var3) where dt - mindt < 7")})
   user  system elapsed
 110.26    2.24  113.32
> mapply(identical,ans1,ans2[order(ans2$var1),])
var1 var2 var3   dt
TRUE TRUE TRUE TRUE
>
> library(data.table)
> system.time({adt<<-data.table(a)})
   user  system elapsed
   0.90    0.13    1.03
> system.time({setkey(adt,var1,var2,var3)})
   user  system elapsed
   1.89    0.27    2.16
> system.time({ans3 <<- 
> adt[,list(dt=dt[dt-min(dt)<7]),by="var1,var2,var3"]})
   user  system elapsed
   3.92    0.78    4.71
> class(ans3$dt) = "Date"
> mapply(identical,ans1,ans3)
var1 var2 var3   dt
TRUE TRUE TRUE TRUE

Note that in the documentaton ?"[.data.table" where I say that 'by' is slow, 
I mean relative to how fast it could be.  Its seems, in this specific 
example anyway, and with the code posted so far, to be significantly faster 
than sqldf and plyr.


"Gabor Grothendieck" <ggrothendi...@gmail.com> wrote in message 
news:971536df1001191350x3bd5d982j9879e05453760...@mail.gmail.com...
> Using data frame, a, from the post below this is how it would be done
> in SQL using sqldf.  We join together the original table, a,  with a
> table of minimums (computed by the nested select) and then choose only
> the rows where dt - mindt < 7 (in the where clause).
>
>> library(sqldf)
>> sqldf("select var1, var2, var3, dt from a, (select var1, var2, var3, 
>> min(dt) mindt from a group by var1, var2, var3) using(var1, var2, var3) 
>> where dt - mindt < 7")
>  var1 var2 var3         dt
> 1    s    1    2 1999-01-01
> 2    c    1    2 2000-02-10
> 3    c    1    2 2000-02-13
> 4    n    2    1 2000-02-11
> 5    n    2    1 2000-02-15
>
>
> On Tue, Jan 19, 2010 at 4:22 PM, hadley wickham <h.wick...@gmail.com> 
> wrote:
>> On Mon, Jan 18, 2010 at 1:54 PM, Bert Gunter <gunter.ber...@gene.com> 
>> wrote:
>>> One way to do it:
>>>
>>> 1. Convert your date column to the Date class using the as.Date() 
>>> function.
>>> This allows you to do the necessary arithmetic on the dates below.
>>> dt <- as.Date(a[,4],"%d/%m/%Y")
>>>
>>> 2. Create a factor out of your first three columns whose levels are in 
>>> the
>>> same order as the unique rows. Something likes the following should do 
>>> it:
>>> fac <- do.call(paste,a[,-4])
>>> fac <- factor(fac, levels=unique(fac))
>>>
>>> This allows you to choose the groups of rows whose dates you wish to 
>>> compare
>>> and maintain their correct order in the data frame
>>>
>>> 3. Then use tapply:
>>> a[unlist(tapply(dt,fac,function(x)x-min(x) < 7)),]
>>>
>>> (unlist is needed to remove the list structure and concatenate the 
>>> logical
>>> indices to obtain the subscripting vector).
>>
>> Here's the same basic approach with the plyr package:
>>
>> a <- structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = 
>> c("c",
>> "n", "s"), class = "factor"), var2 = c(1, 1, 1, 2, 2, 2), var3 = c(2,
>> 2, 2, 1, 1, 1), dt = structure(c(10592, 10997, 11000, 10998,
>> 11002, 11010), class = "Date")), .Names = c("var1", "var2", "var3",
>> "dt"), row.names = c(NA, -6L), class = "data.frame")
>>
>> library(plyr)
>> ddply(a, c("var1", "var2", "var3"), subset, dt - min(dt) < 7)
>>
>> Hadley
>

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] problem of data manipulation

Reply via email to