On 09/28/2016 02:53 PM, Hervé Pagès wrote:
Hi,

I'm surprised nobody suggested split(). Splitting the data.frame
upfront is faster than repeatedly subsetting it:

  tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000))
  idList <- unique(tmp$id)

  system.time(for (i in idList) tmp[which(tmp$id == i),])
  #   user  system elapsed
  # 16.286   0.000  16.305

  system.time(split(tmp, tmp$id))
  #   user  system elapsed
  #  5.637   0.004   5.647

an odd speed-up is to provide (non-sequential) row names, e.g.,

> system.time(split(tmp, tmp$id))
   user  system elapsed
  4.472   0.648   5.122
> row.names(tmp) = rev(seq_len(nrow(tmp)))
> system.time(split(tmp, tmp$id))
   user  system elapsed
  0.588   0.000   0.587

for reasons explained here


http://stackoverflow.com/questions/39545400/why-is-split-inefficient-on-large-data-frames-with-many-groups/39548316#39548316

Martin



Cheers,
H.

On 09/28/2016 09:09 AM, Doran, Harold wrote:
I have an extremely large data frame (~13 million rows) that resembles
the structure of the object tmp below in the reproducible code. In my
real data, the variable, 'id' may or may not be ordered, but I think
that is irrelevant.

I have a process that requires subsetting the data by id and then
running each smaller data frame through a set of functions. One
example below uses indexing and the other uses an explicit call to
subset(), both return the same result, but indexing is faster.

Problem is in my real data, indexing must parse through millions of
rows to evaluate the condition and this is expensive and a bottleneck
in my code.  I'm curious if anyone can recommend an improvement that
would somehow be less expensive and faster?

Thank you
Harold


tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))

idList <- unique(tmp$id)

### Fast, but not fast enough
system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))

### Not fast at all, a big bottleneck
system.time(replicate(500, subset(tmp, id == idList[1])))

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




This email message may contain legally privileged and/or...{{dropped:2}}

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to