On Thu, 17 Sep 2015, Berend Hasselman wrote:


On 17 Sep 2015, at 01:42, Dénes Tóth <toth.de...@ttk.mta.hu> wrote:



On 09/16/2015 04:41 PM, Bert Gunter wrote:
Yes! Chuck's use of mapply is exactly the split/combine strategy I was
looking for. In retrospect, exactly how one should think about it.
Many thanks to all for a constructive discussion .

-- Bert


Bert Gunter


Use mapply like this on large problems:

unsplit(
  mapply(
      function(x,z) eval( x, list( y=z )),
      expression( A=y*2, B=y+3, C=sqrt(y) ),
      split( dat$Flow, dat$ASB ),
      SIMPLIFY=FALSE),
  dat$ASB)

Chuck



Is there any reason not to use data.table for this purpose, especially if 
efficiency is of concern?

---

# load data.table and microbenchmark
library(data.table)
library(microbenchmark)
#
# prepare data
DF <- data.frame(
   ASB = rep_len(factor(LETTERS[1:3]), 3e5),
   Flow = rnorm(3e5)^2)
DT <- as.data.table(DF)
DT[, ASB := as.character(ASB)]
#
# define functions
#
# Chuck's version
fnSplit <- function(dat) {
   unsplit(
       mapply(
           function(x,z) eval( x, list( y=z )),
           expression( A=y*2, B=y+3, C=sqrt(y) ),
           split( dat$Flow, dat$ASB ),
           SIMPLIFY=FALSE),
       dat$ASB)
}
#
# data.table-way (IMHO, much easier to read)
fnDataTable <- function(dat) {
   dat[,
       result :=
           if (.BY == "A") {
               2 * Flow
           } else if (.BY == "B") {
               3 + Flow
           } else if (.BY == "C") {
               sqrt(Flow)
           },
       by = ASB]
}
#
# benchmark
#
microbenchmark(fnSplit(DF), fnDataTable(DT))
identical(fnSplit(DF), fnDataTable(DT)[, result])

---

Actually, in Chuck's version the unsplit() part is slow. If the order is not of 
concern (e.g., DF is reordered before calling fnSplit), fnSplit is comparable 
to the DT-version.


But David’s version is faster than Chuck’s fnSplit. I modified David’s solution 
slightly to get a result that is identical to fnSplit.

# David's version
# my modification to return a vector just like fnSplit
fnDavid <- function(dat) {
   z <- mapply(
         function(x,z) eval( x, list( y=z )),
         expression(A= y*2, B=y+3, C=sqrt(y) ),
         split( dat$Flow, dat$ASB ),
         USE.NAMES=FALSE, SIMPLIFY=TRUE
       )
   as.vector(t(z))
}

Added this to Dénes's code.
Benchmarking  with R package rbenchmark and testing result like this

library(rbenchmark)
benchmark(fnSplit(DF), fnDataTable(DT),fnDavid(DF))
identical(fnSplit(DF), fnDataTable(DT)[, result])
identical(fnSplit(DF), fnDavid(DF))

gave this:

            test replications elapsed relative user.self sys.self user.child
2 fnDataTable(DT)          100   0.829    1.000     0.762    0.066          0
3     fnDavid(DF)          100   1.615    1.948     1.515    0.098          0
1     fnSplit(DF)          100   2.878    3.472     2.685    0.190          0
 sys.child
2         0
3         0
1         0

identical(fnSplit(DF), fnDataTable(DT)[, result])
[1] TRUE
identical(fnSplit(DF), fnDavid(DF))
[1] TRUE

The above `TRUE' depends on the structure of ASB here. identical(...) is often FALSE in the general case. A permutation of ASB is enough to show this:

DF$ASB <- sample(DF$ASB)
identical(fnSplit(DF), fnDavid(DF))
[1] FALSE


unsplit() is the price you pay to cope with general orderings.

Chuck
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to