Re: [R] Faster Subsetting

Doran, Harold Wed, 28 Sep 2016 09:31:25 -0700

Thank you very much. I don’t know tidyverse, I’ll look at that now. I did some 
tests with data.table package, but it was much slower on my machine, see 
examples below


tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))

idList <- unique(tmp$id)

system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))

system.time(replicate(500, subset(tmp, id == idList[1])))


library(data.table)

tmp2 <- as.data.table(tmp)     # data.table

system.time(replicate(500, tmp2[which(tmp$id == idList[1]),]))

system.time(replicate(500, subset(tmp2, id == idList[1])))

From: Dominik Schneider [mailto:dosc3...@colorado.edu]
Sent: Wednesday, September 28, 2016 12:27 PM
To: Doran, Harold <hdo...@air.org>
Cc: r-help@r-project.org
Subject: Re: [R] Faster Subsetting

I regularly crunch through this amount of data with tidyverse. You can also try 
the data.table package. They are optimized for speed, as long as you have the 
memory.
Dominik

On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold 
<hdo...@air.org<mailto:hdo...@air.org>> wrote:
I have an extremely large data frame (~13 million rows) that resembles the 
structure of the object tmp below in the reproducible code. In my real data, 
the variable, 'id' may or may not be ordered, but I think that is irrelevant.

I have a process that requires subsetting the data by id and then running each 
smaller data frame through a set of functions. One example below uses indexing 
and the other uses an explicit call to subset(), both return the same result, 
but indexing is faster.

Problem is in my real data, indexing must parse through millions of rows to 
evaluate the condition and this is expensive and a bottleneck in my code.  I'm 
curious if anyone can recommend an improvement that would somehow be less 
expensive and faster?

Thank you
Harold


tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))

idList <- unique(tmp$id)

### Fast, but not fast enough
system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))

### Not fast at all, a big bottleneck
system.time(replicate(500, subset(tmp, id == idList[1])))

______________________________________________
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Faster Subsetting

Reply via email to