whoops -- I left the group size unchanged so k became greather than the length of the group vector. When I increase the size to 1e7, sapply is faster until it gets to k = 1e6.
warning: this takes awhile (particularly on my machine which seems to be using just 1 of it's 2 cpus) > for(k in 10^(1:6)){ + group<-sample(1:k, size=1e7, replace=TRUE) + x<-rnorm(length(group))*10 + group + cat('number of groups:', k, '\n') + cat('sapply\n') + print(s <- unix.time(sapply(split(x,group), median))) + cat('gm\n') + print(g <- unix.time(-gm(x,group))) + cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n') + } number of groups: 10 sapply user system elapsed 1.18 0.38 1.56 gm user system elapsed 53.43 0.70 55.05 sapply/gm user ratio: 0.02208497 number of groups: 100 sapply user system elapsed 1.17 0.23 1.42 gm user system elapsed 49.89 0.61 51.60 sapply/gm user ratio: 0.02345159 number of groups: 1000 sapply user system elapsed 1.29 0.25 1.72 gm user system elapsed 45.23 0.34 46.55 sapply/gm user ratio: 0.02852089 number of groups: 10000 sapply user system elapsed 2.80 0.09 2.87 gm user system elapsed 41.04 0.55 42.85 sapply/gm user ratio: 0.06822612 number of groups: 1e+05 sapply user system elapsed 14.65 0.16 15.18 gm user system elapsed 38.28 0.65 39.55 sapply/gm user ratio: 0.3827064 number of groups: 1e+06 sapply user system elapsed 126.63 0.42 129.21 gm user system elapsed 37.56 0.84 38.78 sapply/gm user ratio: 3.371406 On Mon, Jan 5, 2009 at 2:37 PM, Kingsford Jones <kingsfordjo...@gmail.com> wrote: > Here's some more timing's of Bill's function. Although in this > example sapply has a clear performance advantage for smaller numbers > of groups (k) , gm is substantially faster for k >> 1000: > > gm <- function(x, group){ # medians by group: > o<-order(group, x) > group <- group[o] > x <- x[o] > changes <- group[-1] != group[-length(group)] > first <- which(c(TRUE, changes)) > last <- which(c(changes, TRUE)) > lowerMedian <- x[floor((first+last)/2)] > upperMedian <- x[ceiling((first+last)/2)] > median <- (lowerMedian+upperMedian)/2 > names(median) <- group[first] > median > } > > ## > > for(k in 10^(1:6)){ > group<-sample(1:k, size=100000, replace=TRUE) > x<-rnorm(length(group))*10 + group > cat('number of groups:', k, '\n') > cat('sapply\n') > print(s <- unix.time(sapply(split(x,group), median))) > cat('gm\n') > print(g <- unix.time(-gm(x,group))) > cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n') > } > > number of groups: 10 > sapply > user system elapsed > 0.01 0.00 0.01 > gm > user system elapsed > 0.14 0.00 0.16 > sapply/gm user ratio: 0.07142857 > > > number of groups: 100 > sapply > user system elapsed > 0.02 0.00 0.02 > gm > user system elapsed > 0.11 0.00 0.09 > sapply/gm user ratio: 0.1818182 > > > number of groups: 1000 > sapply > user system elapsed > 0.11 0.00 0.11 > gm > user system elapsed > 0.11 0.00 0.11 > sapply/gm user ratio: 1 > > > number of groups: 10000 > sapply > user system elapsed > 1.00 0.00 1.04 > gm > user system elapsed > 0.10 0.00 0.09 > sapply/gm user ratio: 10 > > > number of groups: 1e+05 > sapply > user system elapsed > 6.24 0.01 6.31 > gm > user system elapsed > 0.16 0.00 0.17 > sapply/gm user ratio: 39 > > > number of groups: 1e+06 > sapply > user system elapsed > 9.00 0.03 8.92 > gm > user system elapsed > 0.15 0.00 0.20 > sapply/gm user ratio: 60 > > >> R.Version() > $platform > [1] "i386-pc-mingw32" > > $arch > [1] "i386" > > $os > [1] "mingw32" > > $system > [1] "i386, mingw32" > > $status > [1] "" > > $major > [1] "2" > > $minor > [1] "8.0" > > $year > [1] "2008" > > $month > [1] "10" > > $day > [1] "20" > > $`svn rev` > [1] "46754" > > $language > [1] "R" > > $version.string > [1] "R version 2.8.0 (2008-10-20)" > > > Kingsford Jones > > > > On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap <wdun...@tibco.com> wrote: >> Arg, the 'sapply(...)' in the function was in the initial >> comment, >> gm <- function(x, group){ # medians by group: >> sapply(split(x,group),median) >> but someone's mailer put a newline before the sapply >> gm <- function(x, group){ # medians by group: >> sapply(split(x,group),median) >> so it got executed, wasting all that time. (Of course the >> same mailer will mess up this message in the same way - >> just delete the sapply() call from gm().) >> >> Bill Dunlap >> TIBCO Software Inc - Spotfire Division >> wdunlap tibco.com >> >>> -----Original Message----- >>> From: hadley wickham [mailto:h.wick...@gmail.com] >>> Sent: Monday, January 05, 2009 9:10 AM >>> To: William Dunlap >>> Cc: gallon...@gmail.com; R help >>> Subject: Re: [R] the first and last observation for each subject >>> >>> > Another application of that technique can be used to quickly compute >>> > medians by groups: >>> > >>> > gm <- function(x, group){ # medians by group: >>> > sapply(split(x,group),median) >>> > o<-order(group, x) >>> > group <- group[o] >>> > x <- x[o] >>> > changes <- group[-1] != group[-length(group)] >>> > first <- which(c(TRUE, changes)) >>> > last <- which(c(changes, TRUE)) >>> > lowerMedian <- x[floor((first+last)/2)] >>> > upperMedian <- x[ceiling((first+last)/2)] >>> > median <- (lowerMedian+upperMedian)/2 >>> > names(median) <- group[first] >>> > median >>> > } >>> > >>> > For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups >>> > (in random order) the times are: >>> > >>> >> group<-sample(1:30000, size=100000, replace=TRUE) >>> >> x<-rnorm(length(group))*10 + group >>> >> unix.time(z0<-sapply(split(x,group), median)) >>> > user system elapsed >>> > 2.72 0.00 3.20 >>> >> unix.time(z1<-gm(x,group)) >>> > user system elapsed >>> > 0.12 0.00 0.16 >>> >> identical(z1,z0) >>> > [1] TRUE >>> >>> I get: >>> >>> > unix.time(z0<-sapply(split(x,group), median)) >>> user system elapsed >>> 2.733 0.017 2.766 >>> > unix.time(z1<-gm(x,group)) >>> user system elapsed >>> 2.897 0.032 2.946 >>> >>> >>> Hadley >>> >>> >>> -- >>> http://had.co.nz/ >>> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.