On Thu, 17 Sep 2015, Berend Hasselman wrote:
On 17 Sep 2015, at 01:42, Dénes Tóth <toth.de...@ttk.mta.hu> wrote:
On 09/16/2015 04:41 PM, Bert Gunter wrote:
Yes! Chuck's use of mapply is exactly the split/combine strategy I was
looking for. In retrospect, exactly how one should think about it.
Many thanks to all for a constructive discussion .
-- Bert
Bert Gunter
Use mapply like this on large problems:
unsplit(
mapply(
function(x,z) eval( x, list( y=z )),
expression( A=y*2, B=y+3, C=sqrt(y) ),
split( dat$Flow, dat$ASB ),
SIMPLIFY=FALSE),
dat$ASB)
Chuck
Is there any reason not to use data.table for this purpose, especially if
efficiency is of concern?
---
# load data.table and microbenchmark
library(data.table)
library(microbenchmark)
#
# prepare data
DF <- data.frame(
ASB = rep_len(factor(LETTERS[1:3]), 3e5),
Flow = rnorm(3e5)^2)
DT <- as.data.table(DF)
DT[, ASB := as.character(ASB)]
#
# define functions
#
# Chuck's version
fnSplit <- function(dat) {
unsplit(
mapply(
function(x,z) eval( x, list( y=z )),
expression( A=y*2, B=y+3, C=sqrt(y) ),
split( dat$Flow, dat$ASB ),
SIMPLIFY=FALSE),
dat$ASB)
}
#
# data.table-way (IMHO, much easier to read)
fnDataTable <- function(dat) {
dat[,
result :=
if (.BY == "A") {
2 * Flow
} else if (.BY == "B") {
3 + Flow
} else if (.BY == "C") {
sqrt(Flow)
},
by = ASB]
}
#
# benchmark
#
microbenchmark(fnSplit(DF), fnDataTable(DT))
identical(fnSplit(DF), fnDataTable(DT)[, result])
---
Actually, in Chuck's version the unsplit() part is slow. If the order is not of
concern (e.g., DF is reordered before calling fnSplit), fnSplit is comparable
to the DT-version.
But David’s version is faster than Chuck’s fnSplit. I modified David’s solution
slightly to get a result that is identical to fnSplit.
# David's version
# my modification to return a vector just like fnSplit
fnDavid <- function(dat) {
z <- mapply(
function(x,z) eval( x, list( y=z )),
expression(A= y*2, B=y+3, C=sqrt(y) ),
split( dat$Flow, dat$ASB ),
USE.NAMES=FALSE, SIMPLIFY=TRUE
)
as.vector(t(z))
}
Added this to Dénes's code.
Benchmarking with R package rbenchmark and testing result like this
library(rbenchmark)
benchmark(fnSplit(DF), fnDataTable(DT),fnDavid(DF))
identical(fnSplit(DF), fnDataTable(DT)[, result])
identical(fnSplit(DF), fnDavid(DF))
gave this:
test replications elapsed relative user.self sys.self user.child
2 fnDataTable(DT) 100 0.829 1.000 0.762 0.066 0
3 fnDavid(DF) 100 1.615 1.948 1.515 0.098 0
1 fnSplit(DF) 100 2.878 3.472 2.685 0.190 0
sys.child
2 0
3 0
1 0
identical(fnSplit(DF), fnDataTable(DT)[, result])
[1] TRUE
identical(fnSplit(DF), fnDavid(DF))
[1] TRUE
The above `TRUE' depends on the structure of ASB here. identical(...) is
often FALSE in the general case. A permutation of ASB is enough to show
this:
DF$ASB <- sample(DF$ASB)
identical(fnSplit(DF), fnDavid(DF))
[1] FALSE
unsplit() is the price you pay to cope with general orderings.
Chuck
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.