Hi Robin, On 2 March 2024 at 16:34, Robin Liu wrote: | sessionInfo() was the right clue. Indeed the version of R on machine B was not | linked to OpenBLAS. Switching to a version with OpenBLAS allows the test code | to use all cores. | | A clear way to check which library is linked is to run the following: | | > extSoftVersion()["BLAS"]
Ah yes -- I keep forgetting about that one. Good reminder! | Thanks for your help! Always a pleasure. Glad you are all set. Dirk | On Sat, Feb 24, 2024 at 9:17 AM Dirk Eddelbuettel <e...@debian.org> wrote: | | | On 24 February 2024 at 11:44, Robin Liu wrote: | | Thank you Dirk for the response. | | | | I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both | machines | | and correctly see that machine A and B have 20 and 40 cores, | respectively. I | | also see that calling the setter changes this value. | | | | However, calling the setter does not seem to change the number of cores | used on | | either machine A or B. I have updated my code example as below: the | execution | | uses 20 cores on machine A and 1 core on machine B as before, despite my | | setting the number of omp threads to 5. Do you have any further hints? | | I fear you need to debug that on the machine 'B' in question. It's all open | source. I do not think either Conrad or myself put code in to constrain | you | to one core on 'B' (and then doesn't as you see on 'A'). | | You can grep around both the RcppArmadillo wrapper code and the include | Armadillo code, I suggest making a local copy and peppering in some print | statements. | | Also keep in mind that (Rcpp)Armadillo hands off to computation to the | actual | LAPACK / BLAS implementation on that machine. Lots of things can go wrong | there: maybe R was compiled with its own embedded BLAS/LAPACK sources | (preventing a call out to OpenBLAS even when the machine has it). Or maybe | R | was compiled correctly but a single-threaded set of libraries is on the | machine. | | You have not supplied any of that information. Many bug report suggestions | hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK | libraries. You are not forced to show us this, but by not showing us you | prevent us from being more focussed on suggestions. So maybe start at your | end by glancing at sessionInfo() on A and B? | | Dirk | | | | library(RcppArmadillo) | | library(Rcpp) | | | | RcppArmadillo::armadillo_set_number_of_omp_threads(5) | | print(sprintf("There are %d threads", | | RcppArmadillo::armadillo_get_number_of_omp_threads())) | | | | src <- | | r"(#include <RcppArmadillo.h> | | | | // [[Rcpp::depends(RcppArmadillo)]] | | | | // [[Rcpp::export]] | | arma::vec getEigenValues(arma::mat M) { | | return arma::eig_sym(M); | | })" | | | | size <- 10000 | | m <- matrix(rnorm(size^2), size, size) | | m <- m * t(m) | | | | # This line compiles the above code with the -fopenmp flag. | | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) | | result <- getEigenValues(m) | | print(result[1:10]) | | | | On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel <e...@debian.org> | wrote: | | | | | | On 23 February 2024 at 09:35, Robin Liu wrote: | | | Hi all, | | | | | | Here is an R script that uses Armadillo to decompose a large matrix | and | | print | | | the first 10 eigenvalues. | | | | | | library(RcppArmadillo) | | | library(Rcpp) | | | | | | src <- | | | r"(#include <RcppArmadillo.h> | | | | | | // [[Rcpp::depends(RcppArmadillo)]] | | | | | | // [[Rcpp::export]] | | | arma::vec getEigenValues(arma::mat M) { | | | return arma::eig_sym(M); | | | })" | | | | | | size <- 10000 | | | m <- matrix(rnorm(size^2), size, size) | | | m <- m * t(m) | | | | | | # This line compiles the above code with the -fopenmp flag. | | | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) | | | result <- getEigenValues(m) | | | print(result[1:10]) | | | | | | When I run this code on server A, I see that arma can implicitly | leverage | | all | | | available cores by running top -H. However, on server B it can only | use | | one | | | core despite multiple being available: there is just one process | entry in | | top | | | -H. Both processes successfully exit and return an answer. The | process on | | | server B is of course much slower. | | | | It is documented in the package how this is applied and the policy is | to | | NOT | | blindly enforce one use case (say all cores, or half, or a magically | chosen | | value of N for whatever value of N) but to follow the local admin | setting | | and | | respecting standard environment variables. | | | | So I suspect that your machine 'B' differs from machine 'A' in this | | regards. | | | | Not that this is a _run-time_ and not _compile-time_ behavior. As it | is for | | multicore-enabled LAPACK and BLAS libraries, the OpenMP library and | | basically | | most software of this type. | | | | You can override it, see | | RcppArmadillo::armadillo_set_number_of_omp_threads | | RcppArmadillo::armadillo_get_number_of_omp_threads | | | | Can you try and see if these help you? | | | | Dirk | | | | | Here is the compilation on server A: | | | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so' | | | 'file197c21cbec564.cpp' | | | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/ | include | | | -fopenmp -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/ | local/ | | lib/R/ | | | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/ | | | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include -fpic | -g -O2 | | | -fstack-protector-strong -Wformat -Werror=format-security | -Wdate-time | | | -D_FORTIFY_SOURCE=2 -g -c file197c21cbec564.cpp -o | file197c21cbec564.o | | | g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o | | | sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas | -lgfortran | | -lm | | | -lquadmath -L/usr/local/lib/R/lib -lR | | | | | | and here it is for server B: | | | /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o | 'sourceCpp_2.so' | | | 'file158165b9c4ae1.cpp' | | | g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG -I../ | inst/ | | include | | | -fopenmp -I"/home/my_username/.R/library/Rcpp/include" -I"/home/ | | my_username | | | /.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/ | | | sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include -fpic | -g | | -O2 -c | | | file158165b9c4ae1.cpp -o file158165b9c4ae1.o | | | g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/ | lib64 | | -o | | | sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas | -lgfortran | | -lm | | | -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR | | | | | | I thought that the -fopenmp flag should let arma implicitly | parallelize | | matrix | | | computations. Any hints as to why this may not work on server B? | | | | | | The actual code I'm running is an R package that includes | RcppArmadillo | | and | | | RcppEnsmallen. Server B is the login node to an hpc cluster, but | the code | | does | | | not use all cores on the compute nodes either. | | | | | | Best, | | | Robin | | | _______________________________________________ | | | Rcpp-devel mailing list | | | Rcpp-devel@lists.r-forge.r-project.org | | | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/ | rcpp-devel | | | | -- | | dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org | | | | -- | dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org | -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org _______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel