>>>>> "BDR" == Prof Brian Ripley <[EMAIL PROTECTED]> >>>>> on Fri, 27 Feb 2004 18:22:29 +0000 (GMT) writes:
BDR> On 27 Feb 2004, Douglas Bates wrote: >> Martin Maechler <[EMAIL PROTECTED]> writes: >> >> > >>>>> "PD" == Peter Dalgaard <[EMAIL PROTECTED]> >> > >>>>> on 26 Feb 2004 15:44:16 +0100 writes: >> > >> > PD> Douglas Bates <[EMAIL PROTECTED]> writes: >> > >> Have you tried configuring R with Goto's BLAS >> > >> http://www.cs.utexas.edu/users/kgoto/ >> > >> >> > >> I haven't worked with Opteron or Athlon64 computers but I understand >> > >> that Goto's BLAS are very effective on those machines. Furthermore >> > >> Goto's BLAS are (only) available as .so libraries so you don't need to >> > >> mess with creating the .so version. >> > >> > PD> I tried it, yes. Somewhat to my surprise, it seemed to be not quite as >> > PD> fast as the threaded ATLAS, but I wasn't very systematic about the >> > PD> benchmarking. >> > >> > PD> (and the Goto items have license issues, which get in the way for >> > PD> binary distributions.) >> > >> > Thanks a lot, Peter, Brian, Doug, for your feedbacks! >> > In the mean time, I have three running versions of R(-devel) on >> > the 64-Opteron >> > - "plain" >> > - linked against threaded GOTO >> > - linked against threaded (static) ATLAS (using -fPIC for compilation; >> > "large" Rlapack) >> > and I find that GOTO is faster than ATLAS >> > consistently (between ~ 5-20%) for several tests >> > (square matrices; %*% and solve). >> > ATLAS is still an order of magnitude faster than "plain" for >> > 3000x3000 matrices. >> >> Would you be willing to post a brief summary of comparative timings? >> >> I have thought at times that it may be worthwhile collecting >> comparative timings for different combinations of >> processor/OS/memory size and speed/ >> on "typical" tasks in R. As with any benchmark the results will >> artificial but they can be of some help when considering what hardware >> to purchase. Bioconductor users may find it particularly helpful to >> be able to evaluate how much they will need to pay to be able to >> analyze large data sets reasonably quickly. >> >> One easily-obtained timing is at the end of >> $RSRC/tests/Examples/base-Ex.Rout after 'make;make check'. BDR> That one is I think rather too artificial, as it contains few even BDR> moderately large examples, and is dominated by a few atypical tasks. BDR> I tend to use the sum of the MASS scripts as an BDR> informal timing: ch06.R is also a pretty good indicator. BDR> I think you will find that BLAS differences are pretty BDR> small in real-life analyses, or at least I always have. I've now done a bit more systematic testing using more realistic code than the large-matrix (1000^2 and 3000^2) number crunching I did last week. As expected, the differences disappear for VR/scripts "ch06.R" (there's even a slight indication of GOTO being worse than no optimized BLAS, but probably that was a random fluctuation) and also for the "make check" outputs. Here is a nice R function that can be used by others as well for getting the numbers for the "make check" (or better "make check-all") outputs. Note that it's interesting to also get the times for the recommended packages. #### After "make check-all" there are quite some files with timings #### -------------- #### Get at these ## In a Unix shell, it's as simple as ## cd `R RHOME`/tests ## grep '^Time elapsed' *.Rout Examples/*.Rout *.Rcheck/*.Rout checkTimes <- function(Rhome = R.home()) { ## Purpose: Collect the "Time elapsed" timings of R's "make check-all" ## into a numeric N x 3 matrix (with rownames!) ## ---------------------------------------------------------------------- ## Author: Martin Maechler, Date: 1 Mar 2004, 15:27 tDir <- file.path(Rhome,"tests") dirLs <- c(tDir, file.path(tDir,"Examples"), file.path(tDir, list.files(tDir, pattern="\\.Rcheck$"))) iniStr <- "^Time elapsed:" endPat <- "\\.Rout$" ir <- length(rr <- list()) for(d in dirLs) { files <- list.files(d, pattern = endPat) for(f in files) { lls <- readLines(file.path(d,f)) if(length(i <- grep(iniStr, lls))) { tC <- textConnection(sub(iniStr,'', lls[i])) nCPU <- scan(tC, quiet=TRUE) close(tC) f <- sub(endPat,'', f) rr[[(ir <- ir+1)]] <- list(f, nCPU[1:3]) } } } ## tranform list to matrix t(matrix(sapply(rr,"[[",2), 3, length(rr), dimnames = list(NULL, sapply(rr,"[[",1)))) } ----------- Now I did measure on the AMD Opteron (64-bit, dual proc; 4 GB RAM) rM <- checkTimes() nn <- nrow(rM) ## Look at the values --- in sorted order iS <- sort.list(rM[,1], decreasing = TRUE) rM[iS ,] plot(rM[iS, 3] / rM[iS,1]) ## not systematically looking --> only use "CPU[1]" plot(rM[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = "Time elapsed", main = paste("CPU used for checks in", tDir)) rM.A <- checkTimes("/usr/local/app/R/R-devel-ATLAS-inst") rM.G <- checkTimes("/usr/local/app/R/R-devel-GOTO-inst") rM.s <- checkTimes("/usr/local/app/R/R-devel-inst") iS <- sort.list(rM.A[,1], decreasing = TRUE) cbind(ATLAS = rM.A[iS,1], GOTO = rM.G[iS,1], std = rM.s[iS,1]) ## gives ## ATLAS GOTO std ## boot-Ex 73.38 73.71 73.62 ## nlme-Ex 31.92 34.18 31.91 ## mgcv-Ex 29.20 31.69 29.35 ## MASS-Ex 21.54 20.49 20.29 ## stats-Ex 17.80 17.69 17.91 ## lattice-Ex 11.38 11.37 11.05 ## methods-Ex 6.87 6.53 6.58 ## base-Ex 5.48 5.28 5.26 ## graphics-Ex 4.71 4.73 4.70 ## tools-Ex 3.86 3.66 3.82 ## cluster-Ex 3.78 3.74 3.65 ## utils-Ex 2.73 2.60 2.60 ## p-r-random-tests 2.60 2.58 2.55 ## survival-Ex 2.48 2.49 2.30 ## ... ## ......... ## Graphic: pdf("CPU-checks.pdf") plot(rM.A[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = "Time elapsed", main= "AMD Opteron 246: CPU for R 'make check-all' tests & Examples") iS. <- iS[1:12] text(1:12, rM.A[iS., 1], rownames(rM)[iS.], adj = c(-.15, -.15), cex = 0.8) points(1:nn+.1, rM.G[iS, 1], type = 'h', col=2) points(1:nn+.2, rM.s[iS, 1], type = 'h', col=3) legend(par("usr")[2], par("usr")[4], c("ATLAS", "GOTO", " std "), col=1:3, lwd=1, xjust=1.1, yjust=1.1) if(.Device == "pdf") dev.off() ### Are ATLAS or GOTO better than "standard": matplot(1:nn, cbind(rM.A[iS,1]/ rM.s[iS,1], rM.G[iS,1]/ rM.s[iS,1]), type ='p', col=1:8) abline(h = 1, lty=3, col = "gray") ## to the contrary! the points would have to be *below* 1 and are rather above ------------------- The PDF graphic is available as ftp://ftp.stat.math.ethz.ch/U/maechler/R/CPU-checks.pdf --- When I however run something like the following "non-small" lm problem, ----------------------------------------------------------------------------- ### Take a relative large model.matrix() --- as in ./predict-lm.R ### "R BATCH --vanilla <this>" if(paste(R.version$major, R.version$minor, sep=".") >= 1.7) RNGversion("1.6") set.seed(47) ## Here: Want usual "noisy" model; almost no printing n <- 5000 x <- rnorm(n) ldat <- data.frame(x1 = x, x2 = sort(5*x - rnorm(n)), f1 = factor(pmin(12, rpois(n, lam= 5))), f2 = factor(pmin(20, rpois(n, lam= 9))), f3 = factor(pmin(32, rpois(n, lam= 12)))) with(ldat, ldat$y <<- 10 + 4*x1 + 2*x2 + rnorm(n) + ## no rounding here: + 10 * rnorm(nlevels(f1))[f1] + + 100* rnorm(nlevels(f2))[f2]) str(ldat) mylm <- lm(y ~ .^2, data = ldat) proc.time() ## (~= 100 sec on P4 1.6 GHz "lynne") str(mm <- model.matrix(mylm)) smlm <- summary(mylm) p1 <- predict(mylm) p2 <- predict(mylm, type = "terms") str(myim <- influence.measures(mylm)) ## R BATCH gives another "total" proc.time() here: ----------------------------------------------------------------------------- Things look a bit different : Timings (the first 3 of proc.time()) -- ATLAS measured only 3x, the others 5x : 1. after lm() # grep -n '^\[1\] [^1]' lm-tst-2.Rout-opteron-* ATLAS: 34.56 0.56 35.57 33.90 0.59 34.57 34.55 0.61 35.33 GOTO: 28.17 1.82 34.68 29.13 1.61 35.56 26.90 2.05 32.99 28.11 1.83 34.64 28.26 1.92 34.90 std: 34.61 0.62 35.62 33.46 0.61 34.26 34.79 0.65 35.58 33.78 0.67 34.62 35.49 0.70 36.37 2. total for the above R script # grep -n '^\[1\] 1' lm-tst-2.Rout-opteron-* ATLAS: 127.71 1.56 129.92 130.42 1.66 132.28 131.89 1.39 133.57 GOTO: 129.51 25.17 212.02 129.56 26.93 215.06 137.36 27.43 221.95 139.83 28.76 226.64 137.40 27.98 221.86 std: 159.58 1.59 161.88 155.65 1.48 157.59 159.01 1.67 161.21 167.13 1.57 168.97 166.70 1.58 168.70 Which is a bit confusing to me: The picture differs considerably if I "believe" the first number proc.time(), say PT[1], or the third one, PT[3]. Only using PT[1] - which I usually have done - may be quite wrong here: Contrary to ATLAS and "std", GOTO has a difference between PT[3] and PT[1], which may be because of the way threading and the use of the two CPUs happen: PT[1]: GOTO is about 20% faster than ATLAS (which is basically the same as "standard", i.e. R-internal BLAS/LAPACK) for the first lm() measurement, but for the overall time {which adds summary.lm(), influence.lm() etc} GOTO and ATLAS are basically the same speed, both 20% faster than "standard": PT[3]: For the lm() part itself: no difference For the total: ATLAS >> std >> GOTO ~~~~~~~~~~~~~~~~~~~~ (' >> ' := "clearly better than) --- Comments welcome Martin Maechler <[EMAIL PROTECTED]> http://stat.ethz.ch/~maechler/ Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27 ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND phone: x-41-1-632-3408 fax: ...-1228 <>< ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel