Hi Mario: On Sat, Jan 5, 2013 at 9:09 PM, Mario Bourgoin <m.bourg...@gmail.com> wrote: > Hadley's guide to high performance computing using Rcpp was recommended on > R-bloggers last November: > > https://github.com/hadley/devtools/wiki/Rcpp > > Should you say whether you're read it or whether it was useful for your > purpose? >
I agree, that writeup is a nice survey. I'm stuck at a more fine-grained situation, where I really want a profile of a large function using Armadillo and am as frustrated as hell trying to get one. Following R Writing Extensions, I am trying sprof and oprofile. sprof seems to run, but never generates the profile output. I follow this approach (http://greg-n-blog.blogspot.com/2010/01/profiling-shared-library-on-linux-using.html) for sprof. It makes the R test job run slower, I just can't find the output file. There are no error messages. $ export LD_PROFILE=/home/pauljohn/.../Amelia.so $ export LD_PROFILE_OUTPUT=`pwd` $ R -f testCode.R But, for whatever reason, no profile file appears. I'm still checking into it. I've tried the advice in Writing R Extensions about writing the profile into /var/tmp/... after creating a directory as root, still same outcome. I feel sure I'm missing something very obvious. oprofile looks exciting, if you have the newer edition (0.9.8). In Debian, we don't have a pre-built package for oprofile-0.9.8, so I've compiled that and it is interesting, but very complicated. It gives a nice general overview that says the program is spending most of its time in BLAS and LAPACK, but it doesn't help me very much. I need to know which parts of the CPP code are causing all that access to BLAS or LAPACK. In case you have not seen oprofile... oprofile-0.9.8 has a new function "operf" that can run WITHOUT ROOT ACCESS and it generates a large-ish folder of profiler output: $ operf --callgraph R -f profile-AmeliaSpeedup-4.R After that, however, is where I'm frustrated because I can't manage the output. In case you've never seen this output, here's a quick paste. Perhaps other list readers will have pointers for me on how to more wisely use opreport to trim down the time spent in the functions inside the package shared library that I'm studying right now, which is Amelia.so. $ opreport --callgraph | more Using /home/pauljohn/sync/work/R-Amelia-Armadillo/oprofile_data/samples/ for samples directory. warning: /no-vmlinux could not be found. CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 90000 warning: dropping hyperspace sample at offset 21e4c8 >= 221d8 for binary /lib/x86_64-linux-gnu/ld-2.13.so warning: dropping hyperspace sample at offset 21e4c8 >= 221d8 for binary /lib/x86_64-linux-gnu/ld-2.13.so warning: dropping hyperspace sample at offset 10fd0 >= fb7e for binary /home/pauljohn/R/x86_64-pc-linux-gnu-library/2.15/Amelia/libs/Amelia.so warning: dropping hyperspace sample at offset 1100 >= 1048 for binary /lib/x86_64-linux-gnu/libnss_compat-2.13.so warning: dropping hyperspace sample at offset 11a0 >= 1048 for binary /lib/x86_64-linux-gnu/libnss_compat-2.13.so samples % image name app name symbol name ------------------------------------------------------------------------------- 11 100.000 libc-2.13.so R malloc 31109 31.5168 libblas.so.3.0 R /usr/lib/atlas-base/atlas/libblas.so.3.0 31109 100.000 libblas.so.3.0 R /usr/lib/atlas-base/atlas/libblas.so.3.0 [self] ------------------------------------------------------------------------------- 7 100.000 libRcpp.so R virtual thunk to Rcpp::Rostream<true>::~Rostream() 18613 18.8570 libR.so R /usr/lib/R/lib/libR.so 18613 100.000 libR.so R /usr/lib/R/lib/libR.so [self] ------------------------------------------------------------------------------- 37 100.000 libc-2.13.so R malloc 17443 17.6717 liblapack.so.3.0 R /usr/lib/lapack/liblapack.so.3.0 17443 100.000 liblapack.so.3.0 R /usr/lib/lapack/liblapack.so.3.0 [self] ------------------------------------------------------------------------------- 4521 100.000 libc-2.13.so R _int_malloc 4521 4.5803 libc-2.13.so R _int_malloc 4521 48.4306 libc-2.13.so R _int_malloc 4521 48.4306 libc-2.13.so R _int_malloc [self] 293 3.1387 no-vmlinux R /no-vmlinux ------------------------------------------------------------------------------- 2789 100.000 Amelia.so R arma::subview_elem2<double, arma::Mat<unsigned int>, arma::Mat<unsigned int> >::extract(arma::Mat<double>&, arma::subview_elem2<double, arma::Mat<unsigned int>, arma::Mat<uns igned int> > const&) 2789 49.9105 Amelia.so R arma::subview_elem2<double, arma::Mat<unsigned int>, arma::Mat<unsigned int> >::extract(arma::Mat<double>&, arma::subview_elem2<double, arma::Mat<unsigned int>, arma::Mat<uns igned int> > const&) 2789 49.9105 Amelia.so R arma::subview_elem2<double, arma::Mat<unsigned int>, arma::Mat<unsigned int> >::extract(arma::Mat<double>&, arma::subview_elem2<double, arma::Mat<unsigned int>, arma::Mat<uns igned int> > const&) [self] 10 0.1790 no-vmlinux R /no-vmlinux ... THe output has a bunch of stanzas like this, and I have trouble understanding what the output means. The top line says there are 621 calls to line in the shared library that uses arma::subview_elem2. What do the lines after that mean, where it has 0.6291 on line 2 and 49.9568 (repeated on 2 lines in a row) ------------------------------------------------------------------------------- 621 100.000 Amelia.so R void arma::subview_elem2<double, arma::Mat<unsigned int>, arma::Mat<unsigned int> >::inplace_op<arma::op_subview_elem_equ, arma::Op<arma::Mat<double>, arma::op_inv_sympd> >(a rma::Base<double, arma::Op<arma::Mat<double>, arma::op_inv_sympd> > const&) 621 0.6291 Amelia.so R void arma::subview_elem2<double, arma::Mat<unsigned in t>, arma::Mat<unsigned int> >::inplace_op<arma::op_subview_elem_equ, arma::Op<arma::Mat<double>, arma::op_inv_sympd> >(arm a::Base<double, arma::Op<arma::Mat<double>, arma::op_inv_sympd> > const&) 621 49.9598 Amelia.so R void arma::subview_elem2<double, arma::Mat<unsigned int>, arma::Mat<unsigned int> >::inplace_op<arma::op_subview_elem_equ, arma::Op<arma::Mat<double>, arma::op_inv_sympd> >(a rma::Base<double, arma::Op<arma::Mat<double>, arma::op_inv_sympd> > const&) 621 49.9598 Amelia.so R void arma::subview_elem2<double, arma::Mat<unsigned int>, arma::Mat<unsigned int> >::inplace_op<arma::op_subview_elem_equ, arma::Op<arma::Mat<double>, arma::op_inv_sympd> >(a rma::Base<double, arma::Op<arma::Mat<double>, arma::op_inv_sympd> > const&) [self] 1 0.0805 no-vmlinux R /no-vmlinux Oh,well. It is fun when little glimmers of insight flash bye :) pj > Best, > Mario > > > On Sat, Jan 5, 2013 at 9:50 PM, Paul Johnson <pauljoh...@gmail.com> wrote: >> >> I received a big pile of Rcpp code from some folks and am using it as >> a way to learn about Armadillo and Rcpp. >> >> From gazing at the code, and guessing about style and usage, I some >> ways to make a program run faster. But it is all by guessing and >> testing. >> >> I wish I could get a profile (via gcc) to find out where the Rcpp part >> is spending its time. >> >> Has anybody made a list of "dos and don'ts" for making faster >> RcppArmadillo code? >> >> So far, I'm absolutely sure that RcppArmadillo follows many of the >> usual rules about fast code. Don't allocate huge matrices over and >> over for no good reason, for example. Armadillo has the delayed >> processing feature that automatically accelerates some calculations. >> That's quite amazing to me. >> >> But some of the practical details of writing code are uncertain. In >> particular, I wonder about the slowdown in making calculations on >> submatrices and the relative penalty you pay for re-calculating things >> versus allocating space to store them. (I once saw a demonstration >> that it is faster to recalculate multiplication problems than it is to >> store the results in some cases....) >> >> Here are some of the things I would like to read about. I have noticed >> in R that accessing subsections of a matrix is slow, compared to >> analyzing the whole thing. For example. It is SLOWER to repeatedly >> work on a subsection of a matrix x than it is to extract that piece >> one time and work on the copy. I'm pasting in the R code to >> demonstrate this at the end of this message. >> >> Does Armadillo follow that same logic, so that repeatedly accessing a >> subview of a matrix is slower than creating a new copy of the >> submatrix and then working on it? >> >> Along those lines, has the performance of the Armadillo non-contiguous >> matrix access been assessed? Are we well advised to just allocate new >> matrices and access the non-contiguous subviews one time, than to >> access them over and over? >> >> Given a matrix g and vectors k1 and k2 that contain row numbers, we >> want to do math on these sub matrices. >> >> g(k1, k1) >> g(k1, k2) >> g(k2, k2) >> >> I am *guessing* that repeated access to a non-contiguous matrix >> subview is slower than accessing a contiguous block out of the matrix. >> And accessing a whole matrix would be faster still. So maybe a person >> ought to allocate. >> >> arma::mat g11 = g(k1, k1); >> >> rather than grabbing g(k1, k1) over and over. Know what I mean? >> >> Oh, well. here's the example about R submatrix access >> >> ## fasterR-matrix-1.R >> ## Paul Johnson >> ## 2012-12-23 >> >> nReps <- 5000 >> nVars <- 100 >> nRows <- 200 >> >> >> ## fn1 with elements is the fastest case scenario >> ## >> fn1 <- function(x){ >> solve(x) >> } >> >> >> ## extract 2:(nVars+1) rows and columns, returns "anonymously" >> fn2 <- function(x){ >> y <- x[2:(nVars+1), 2:(nVars+1)] >> solve(y) >> } >> >> >> ## names new matrix to return >> fn3 <- function(x){ >> y <- x[2:(nVars+1), 2:(nVars+1)] >> yinv <- solve(y) >> } >> >> ## manufacture new matrix, copy results to it >> fn4 <- function(x){ >> y <- x[2:(nVars+1), 2:(nVars+1)] >> yinv <- solve(y) >> z <- matrix(0, nVars, nVars) >> z <- yinv >> } >> >> ## manufacuture bigger matrix, copy results into part of it >> fn5 <- function(x){ >> y <- x[2:(nVars+1), 2:(nVars+1)] >> yinv <- solve(y) >> z <- matrix(0, (nVars+1), (nVars+1)) >> z[2:(nVars+1), 2:(nVars+1)] <- yinv >> } >> >> >> ## Now make up test data. >> ## Manufacture symmetric matrices >> Xlist1 <- vector("list", nReps) >> for(i in 1:nReps){ >> x <- matrix(rnorm(nRows * nVars), nRows, nVars) >> Xlist1[[i]] <- crossprod(x) >> } >> >> ## create padded theta-like matrix >> Xlist2 <- vector("list", nReps) >> for(i in 1:nReps){ >> y <- Xlist1[[i]] >> y <- rbind(c(rep(1, nVars)), y) >> y <- cbind(c(-1, rep(1,nVars)), y) >> Xlist2[[i]] <- y >> } >> >> >> system.time( >> fn1l <- lapply(Xlist1, fn1) >> ) >> >> system.time( >> fn2l <- lapply(Xlist2, fn2) >> ) >> >> system.time( >> fn3l <- lapply(Xlist2, fn3) >> ) >> >> >> system.time( >> fn4l <- lapply(Xlist2, fn4) >> ) >> >> >> system.time( >> fn5l <- lapply(Xlist2, fn5) >> ) >> >> ## Conclusion. It IS slower to repeatedly extract rows & columns >> 2:(nVars+1) >> ## than to just work with a whole matrix of nVars x nVars values. >> >> >> -- >> Paul E. Johnson >> Professor, Political Science Assoc. Director >> 1541 Lilac Lane, Room 504 Center for Research Methods >> University of Kansas University of Kansas >> http://pj.freefaculty.org http://quant.ku.edu >> _______________________________________________ >> Rcpp-devel mailing list >> Rcpp-devel@lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu _______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel