Hi,

I'm surprised nobody suggested split(). Splitting the data.frame
upfront is faster than repeatedly subsetting it:

  tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000))
  idList <- unique(tmp$id)

  system.time(for (i in idList) tmp[which(tmp$id == i),])
  #   user  system elapsed
  # 16.286   0.000  16.305

  system.time(split(tmp, tmp$id))
  #   user  system elapsed
  #  5.637   0.004   5.647

Cheers,
H.

On 09/28/2016 09:09 AM, Doran, Harold wrote:
I have an extremely large data frame (~13 million rows) that resembles the 
structure of the object tmp below in the reproducible code. In my real data, 
the variable, 'id' may or may not be ordered, but I think that is irrelevant.

I have a process that requires subsetting the data by id and then running each 
smaller data frame through a set of functions. One example below uses indexing 
and the other uses an explicit call to subset(), both return the same result, 
but indexing is faster.

Problem is in my real data, indexing must parse through millions of rows to 
evaluate the condition and this is expensive and a bottleneck in my code.  I'm 
curious if anyone can recommend an improvement that would somehow be less 
expensive and faster?

Thank you
Harold


tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))

idList <- unique(tmp$id)

### Fast, but not fast enough
system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))

### Not fast at all, a big bottleneck
system.time(replicate(500, subset(tmp, id == idList[1])))

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to