Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores
Hi Robin, On 2 March 2024 at 16:34, Robin Liu wrote: | sessionInfo() was the right clue. Indeed the version of R on machine B was not | linked to OpenBLAS. Switching to a version with OpenBLAS allows the test code | to use all cores. | | A clear way to check which library is linked is to run the following: | | > extSoftVersion()["BLAS"] Ah yes -- I keep forgetting about that one. Good reminder! | Thanks for your help! Always a pleasure. Glad you are all set. Dirk | On Sat, Feb 24, 2024 at 9:17 AM Dirk Eddelbuettel wrote: | | | On 24 February 2024 at 11:44, Robin Liu wrote: | | Thank you Dirk for the response. | | | | I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both | machines | | and correctly see that machine A and B have 20 and 40 cores, | respectively. I | | also see that calling the setter changes this value. | | | | However, calling the setter does not seem to change the number of cores | used on | | either machine A or B. I have updated my code example as below: the | execution | | uses 20 cores on machine A and 1 core on machine B as before, despite my | | setting the number of omp threads to 5. Do you have any further hints? | | I fear you need to debug that on the machine 'B' in question. It's all open | source. I do not think either Conrad or myself put code in to constrain | you | to one core on 'B' (and then doesn't as you see on 'A'). | | You can grep around both the RcppArmadillo wrapper code and the include | Armadillo code, I suggest making a local copy and peppering in some print | statements. | | Also keep in mind that (Rcpp)Armadillo hands off to computation to the | actual | LAPACK / BLAS implementation on that machine. Lots of things can go wrong | there: maybe R was compiled with its own embedded BLAS/LAPACK sources | (preventing a call out to OpenBLAS even when the machine has it). Or maybe | R | was compiled correctly but a single-threaded set of libraries is on the | machine. | | You have not supplied any of that information. Many bug report suggestions | hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK | libraries. You are not forced to show us this, but by not showing us you | prevent us from being more focussed on suggestions. So maybe start at your | end by glancing at sessionInfo() on A and B? | | Dirk | | | | library(RcppArmadillo) | | library(Rcpp) | | | | RcppArmadillo::armadillo_set_number_of_omp_threads(5) | | print(sprintf("There are %d threads", | | RcppArmadillo::armadillo_get_number_of_omp_threads())) | | | | src <- | | r"(#include | | | | // [[Rcpp::depends(RcppArmadillo)]] | | | | // [[Rcpp::export]] | | arma::vec getEigenValues(arma::mat M) { | | return arma::eig_sym(M); | | })" | | | | size <- 1 | | m <- matrix(rnorm(size^2), size, size) | | m <- m * t(m) | | | | # This line compiles the above code with the -fopenmp flag. | | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) | | result <- getEigenValues(m) | | print(result[1:10]) | | | | On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel | wrote: | | | | | | On 23 February 2024 at 09:35, Robin Liu wrote: | | | Hi all, | | | | | | Here is an R script that uses Armadillo to decompose a large matrix | and | | print | | | the first 10 eigenvalues. | | | | | | library(RcppArmadillo) | | | library(Rcpp) | | | | | | src <- | | | r"(#include | | | | | | // [[Rcpp::depends(RcppArmadillo)]] | | | | | | // [[Rcpp::export]] | | | arma::vec getEigenValues(arma::mat M) { | | | return arma::eig_sym(M); | | | })" | | | | | | size <- 1 | | | m <- matrix(rnorm(size^2), size, size) | | | m <- m * t(m) | | | | | | # This line compiles the above code with the -fopenmp flag. | | | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) | | | result <- getEigenValues(m) | | | print(result[1:10]) | | | | | | When I run this code on server A, I see that arma can implicitly | leverage | | all | | | available cores by running top -H. However, on server B it can only | use | | one | | | core despite multiple being available: there is just one process | entry in | | top | | | -H. Both processes successfully exit and return an answer. The | process on | | | server B is of course much slower. | | | | It is documented in the package how this is applied and the policy is | to | | NOT | | blindly enforce one use case (say all cores, or half, or a
Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores
Hi Dirk, sessionInfo() was the right clue. Indeed the version of R on machine B was not linked to OpenBLAS. Switching to a version with OpenBLAS allows the test code to use all cores. A clear way to check which library is linked is to run the following: > extSoftVersion()["BLAS"] Thanks for your help! On Sat, Feb 24, 2024 at 9:17 AM Dirk Eddelbuettel wrote: > > On 24 February 2024 at 11:44, Robin Liu wrote: > | Thank you Dirk for the response. > | > | I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both > machines > | and correctly see that machine A and B have 20 and 40 cores, > respectively. I > | also see that calling the setter changes this value. > | > | However, calling the setter does not seem to change the number of cores > used on > | either machine A or B. I have updated my code example as below: the > execution > | uses 20 cores on machine A and 1 core on machine B as before, despite my > | setting the number of omp threads to 5. Do you have any further hints? > > I fear you need to debug that on the machine 'B' in question. It's all open > source. I do not think either Conrad or myself put code in to constrain > you > to one core on 'B' (and then doesn't as you see on 'A'). > > You can grep around both the RcppArmadillo wrapper code and the include > Armadillo code, I suggest making a local copy and peppering in some print > statements. > > Also keep in mind that (Rcpp)Armadillo hands off to computation to the > actual > LAPACK / BLAS implementation on that machine. Lots of things can go wrong > there: maybe R was compiled with its own embedded BLAS/LAPACK sources > (preventing a call out to OpenBLAS even when the machine has it). Or maybe > R > was compiled correctly but a single-threaded set of libraries is on the > machine. > > You have not supplied any of that information. Many bug report suggestions > hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK > libraries. You are not forced to show us this, but by not showing us you > prevent us from being more focussed on suggestions. So maybe start at your > end by glancing at sessionInfo() on A and B? > > Dirk > > > | library(RcppArmadillo) > | library(Rcpp) > | > | RcppArmadillo::armadillo_set_number_of_omp_threads(5) > | print(sprintf("There are %d threads", > | RcppArmadillo::armadillo_get_number_of_omp_threads())) > | > | src <- > | r"(#include > | > | // [[Rcpp::depends(RcppArmadillo)]] > | > | // [[Rcpp::export]] > | arma::vec getEigenValues(arma::mat M) { > | return arma::eig_sym(M); > | })" > | > | size <- 1 > | m <- matrix(rnorm(size^2), size, size) > | m <- m * t(m) > | > | # This line compiles the above code with the -fopenmp flag. > | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) > | result <- getEigenValues(m) > | print(result[1:10]) > | > | On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel > wrote: > | > | > | On 23 February 2024 at 09:35, Robin Liu wrote: > | | Hi all, > | | > | | Here is an R script that uses Armadillo to decompose a large > matrix and > | print > | | the first 10 eigenvalues. > | | > | | library(RcppArmadillo) > | | library(Rcpp) > | | > | | src <- > | | r"(#include > | | > | | // [[Rcpp::depends(RcppArmadillo)]] > | | > | | // [[Rcpp::export]] > | | arma::vec getEigenValues(arma::mat M) { > | | return arma::eig_sym(M); > | | })" > | | > | | size <- 1 > | | m <- matrix(rnorm(size^2), size, size) > | | m <- m * t(m) > | | > | | # This line compiles the above code with the -fopenmp flag. > | | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) > | | result <- getEigenValues(m) > | | print(result[1:10]) > | | > | | When I run this code on server A, I see that arma can implicitly > leverage > | all > | | available cores by running top -H. However, on server B it can > only use > | one > | | core despite multiple being available: there is just one process > entry in > | top > | | -H. Both processes successfully exit and return an answer. The > process on > | | server B is of course much slower. > | > | It is documented in the package how this is applied and the policy > is to > | NOT > | blindly enforce one use case (say all cores, or half, or a magically > chosen > | value of N for whatever value of N) but to follow the local admin > setting > | and > | respecting standard environment variables. > | > | So I suspect that your machine 'B' differs from machine 'A' in this > | regards. > | > | Not that this is a _run-time_ and not _compile-time_ behavior. As it > is for > | multicore-enabled LAPACK and BLAS libraries, the OpenMP library and > | basically > | most software of this type. > | > | You can override it, see > | RcppArmadillo::armadillo_set_number_of_omp_threads > | RcppArmadillo::armadillo_get_number_of_omp_threads
Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores
On 24 February 2024 at 11:44, Robin Liu wrote: | Thank you Dirk for the response. | | I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both machines | and correctly see that machine A and B have 20 and 40 cores, respectively. I | also see that calling the setter changes this value. | | However, calling the setter does not seem to change the number of cores used on | either machine A or B. I have updated my code example as below: the execution | uses 20 cores on machine A and 1 core on machine B as before, despite my | setting the number of omp threads to 5. Do you have any further hints? I fear you need to debug that on the machine 'B' in question. It's all open source. I do not think either Conrad or myself put code in to constrain you to one core on 'B' (and then doesn't as you see on 'A'). You can grep around both the RcppArmadillo wrapper code and the include Armadillo code, I suggest making a local copy and peppering in some print statements. Also keep in mind that (Rcpp)Armadillo hands off to computation to the actual LAPACK / BLAS implementation on that machine. Lots of things can go wrong there: maybe R was compiled with its own embedded BLAS/LAPACK sources (preventing a call out to OpenBLAS even when the machine has it). Or maybe R was compiled correctly but a single-threaded set of libraries is on the machine. You have not supplied any of that information. Many bug report suggestions hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK libraries. You are not forced to show us this, but by not showing us you prevent us from being more focussed on suggestions. So maybe start at your end by glancing at sessionInfo() on A and B? Dirk | library(RcppArmadillo) | library(Rcpp) | | RcppArmadillo::armadillo_set_number_of_omp_threads(5) | print(sprintf("There are %d threads", | RcppArmadillo::armadillo_get_number_of_omp_threads())) | | src <- | r"(#include | | // [[Rcpp::depends(RcppArmadillo)]] | | // [[Rcpp::export]] | arma::vec getEigenValues(arma::mat M) { | return arma::eig_sym(M); | })" | | size <- 1 | m <- matrix(rnorm(size^2), size, size) | m <- m * t(m) | | # This line compiles the above code with the -fopenmp flag. | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) | result <- getEigenValues(m) | print(result[1:10]) | | On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel wrote: | | | On 23 February 2024 at 09:35, Robin Liu wrote: | | Hi all, | | | | Here is an R script that uses Armadillo to decompose a large matrix and | print | | the first 10 eigenvalues. | | | | library(RcppArmadillo) | | library(Rcpp) | | | | src <- | | r"(#include | | | | // [[Rcpp::depends(RcppArmadillo)]] | | | | // [[Rcpp::export]] | | arma::vec getEigenValues(arma::mat M) { | | return arma::eig_sym(M); | | })" | | | | size <- 1 | | m <- matrix(rnorm(size^2), size, size) | | m <- m * t(m) | | | | # This line compiles the above code with the -fopenmp flag. | | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) | | result <- getEigenValues(m) | | print(result[1:10]) | | | | When I run this code on server A, I see that arma can implicitly leverage | all | | available cores by running top -H. However, on server B it can only use | one | | core despite multiple being available: there is just one process entry in | top | | -H. Both processes successfully exit and return an answer. The process on | | server B is of course much slower. | | It is documented in the package how this is applied and the policy is to | NOT | blindly enforce one use case (say all cores, or half, or a magically chosen | value of N for whatever value of N) but to follow the local admin setting | and | respecting standard environment variables. | | So I suspect that your machine 'B' differs from machine 'A' in this | regards. | | Not that this is a _run-time_ and not _compile-time_ behavior. As it is for | multicore-enabled LAPACK and BLAS libraries, the OpenMP library and | basically | most software of this type. | | You can override it, see | RcppArmadillo::armadillo_set_number_of_omp_threads | RcppArmadillo::armadillo_get_number_of_omp_threads | | Can you try and see if these help you? | | Dirk | | | Here is the compilation on server A: | | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so' | | 'file197c21cbec564.cpp' | | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/include | | -fopenmp -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/local/ | lib/R/ | | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/ | | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include -fpic -g -O2 | | -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time |
Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores
Thank you Dirk for the response. I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both machines and correctly see that machine A and B have 20 and 40 cores, respectively. I also see that calling the setter changes this value. However, calling the setter does not seem to change the number of cores used on either machine A or B. I have updated my code example as below: the execution uses 20 cores on machine A and 1 core on machine B as before, despite my setting the number of omp threads to 5. Do you have any further hints? library(RcppArmadillo) library(Rcpp) RcppArmadillo::armadillo_set_number_of_omp_threads(5) print(sprintf("There are %d threads", RcppArmadillo::armadillo_get_number_of_omp_threads())) src <- r"(#include // [[Rcpp::depends(RcppArmadillo)]] // [[Rcpp::export]] arma::vec getEigenValues(arma::mat M) { return arma::eig_sym(M); })" size <- 1 m <- matrix(rnorm(size^2), size, size) m <- m * t(m) # This line compiles the above code with the -fopenmp flag. sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) result <- getEigenValues(m) print(result[1:10]) On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel wrote: > > On 23 February 2024 at 09:35, Robin Liu wrote: > | Hi all, > | > | Here is an R script that uses Armadillo to decompose a large matrix and > print > | the first 10 eigenvalues. > | > | library(RcppArmadillo) > | library(Rcpp) > | > | src <- > | r"(#include > | > | // [[Rcpp::depends(RcppArmadillo)]] > | > | // [[Rcpp::export]] > | arma::vec getEigenValues(arma::mat M) { > | return arma::eig_sym(M); > | })" > | > | size <- 1 > | m <- matrix(rnorm(size^2), size, size) > | m <- m * t(m) > | > | # This line compiles the above code with the -fopenmp flag. > | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) > | result <- getEigenValues(m) > | print(result[1:10]) > | > | When I run this code on server A, I see that arma can implicitly > leverage all > | available cores by running top -H. However, on server B it can only use > one > | core despite multiple being available: there is just one process entry > in top > | -H. Both processes successfully exit and return an answer. The process on > | server B is of course much slower. > > It is documented in the package how this is applied and the policy is to > NOT > blindly enforce one use case (say all cores, or half, or a magically chosen > value of N for whatever value of N) but to follow the local admin setting > and > respecting standard environment variables. > > So I suspect that your machine 'B' differs from machine 'A' in this > regards. > > Not that this is a _run-time_ and not _compile-time_ behavior. As it is for > multicore-enabled LAPACK and BLAS libraries, the OpenMP library and > basically > most software of this type. > > You can override it, see > RcppArmadillo::armadillo_set_number_of_omp_threads > RcppArmadillo::armadillo_get_number_of_omp_threads > > Can you try and see if these help you? > > Dirk > > | Here is the compilation on server A: > | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so' > | 'file197c21cbec564.cpp' > | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/include > | -fopenmp -I"/usr/local/lib/R/site-library/Rcpp/include" > -I"/usr/local/lib/R/ > | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/ > | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include -fpic -g -O2 > | -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time > | -D_FORTIFY_SOURCE=2 -g -c file197c21cbec564.cpp -o file197c21cbec564.o > | g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o > | sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas -lgfortran > -lm > | -lquadmath -L/usr/local/lib/R/lib -lR > | > | and here it is for server B: > | /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so' > | 'file158165b9c4ae1.cpp' > | g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG > -I../inst/include > | -fopenmp -I"/home/my_username/.R/library/Rcpp/include" > -I"/home/ my_username > | /.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/ > | sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include -fpic -g > -O2 -c > | file158165b9c4ae1.cpp -o file158165b9c4ae1.o > | g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/lib64 > -o > | sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas -lgfortran > -lm > | -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR > | > | I thought that the -fopenmp flag should let arma implicitly parallelize > matrix > | computations. Any hints as to why this may not work on server B? > | > | The actual code I'm running is an R package that includes RcppArmadillo > and > | RcppEnsmallen. Server B is the login node to an hpc cluster, but the > code does > | not use all cores on the compute nodes either. > | > | Best, > | Robin > | ___ > | Rcpp-devel mailing list > |
Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores
On 23 February 2024 at 09:35, Robin Liu wrote: | Hi all, | | Here is an R script that uses Armadillo to decompose a large matrix and print | the first 10 eigenvalues. | | library(RcppArmadillo) | library(Rcpp) | | src <- | r"(#include | | // [[Rcpp::depends(RcppArmadillo)]] | | // [[Rcpp::export]] | arma::vec getEigenValues(arma::mat M) { | return arma::eig_sym(M); | })" | | size <- 1 | m <- matrix(rnorm(size^2), size, size) | m <- m * t(m) | | # This line compiles the above code with the -fopenmp flag. | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) | result <- getEigenValues(m) | print(result[1:10]) | | When I run this code on server A, I see that arma can implicitly leverage all | available cores by running top -H. However, on server B it can only use one | core despite multiple being available: there is just one process entry in top | -H. Both processes successfully exit and return an answer. The process on | server B is of course much slower. It is documented in the package how this is applied and the policy is to NOT blindly enforce one use case (say all cores, or half, or a magically chosen value of N for whatever value of N) but to follow the local admin setting and respecting standard environment variables. So I suspect that your machine 'B' differs from machine 'A' in this regards. Not that this is a _run-time_ and not _compile-time_ behavior. As it is for multicore-enabled LAPACK and BLAS libraries, the OpenMP library and basically most software of this type. You can override it, see RcppArmadillo::armadillo_set_number_of_omp_threads RcppArmadillo::armadillo_get_number_of_omp_threads Can you try and see if these help you? Dirk | Here is the compilation on server A: | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so' | 'file197c21cbec564.cpp' | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/include | -fopenmp -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/local/lib/R/ | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/ | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include -fpic -g -O2 | -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time | -D_FORTIFY_SOURCE=2 -g -c file197c21cbec564.cpp -o file197c21cbec564.o | g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o | sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas -lgfortran -lm | -lquadmath -L/usr/local/lib/R/lib -lR | | and here it is for server B: | /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so' | 'file158165b9c4ae1.cpp' | g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG -I../inst/include | -fopenmp -I"/home/my_username/.R/library/Rcpp/include" -I"/home/ my_username | /.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/ | sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include -fpic -g -O2 -c | file158165b9c4ae1.cpp -o file158165b9c4ae1.o | g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/lib64 -o | sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas -lgfortran -lm | -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR | | I thought that the -fopenmp flag should let arma implicitly parallelize matrix | computations. Any hints as to why this may not work on server B? | | The actual code I'm running is an R package that includes RcppArmadillo and | RcppEnsmallen. Server B is the login node to an hpc cluster, but the code does | not use all cores on the compute nodes either. | | Best, | Robin | ___ | Rcpp-devel mailing list | Rcpp-devel@lists.r-forge.r-project.org | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org ___ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
[Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores
Hi all, Here is an R script that uses Armadillo to decompose a large matrix and print the first 10 eigenvalues. library(RcppArmadillo) library(Rcpp) src <- r"(#include // [[Rcpp::depends(RcppArmadillo)]] // [[Rcpp::export]] arma::vec getEigenValues(arma::mat M) { return arma::eig_sym(M); })" size <- 1 m <- matrix(rnorm(size^2), size, size) m <- m * t(m) # This line compiles the above code with the -fopenmp flag. sourceCpp(code = src, verbose = TRUE, rebuild = TRUE) result <- getEigenValues(m) print(result[1:10]) When I run this code on server A, I see that arma can implicitly leverage all available cores by running top -H. However, on server B it can only use one core despite multiple being available: there is just one process entry in top -H. Both processes successfully exit and return an answer. The process on server B is of course much slower. Here is the compilation on server A: /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so' 'file197c21cbec564.cpp' g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/include -fopenmp -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/local/lib/R/site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c file197c21cbec564.cpp -o file197c21cbec564.o g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas -lgfortran -lm -lquadmath -L/usr/local/lib/R/lib -lR and here it is for server B: /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so' 'file158165b9c4ae1.cpp' g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG -I../inst/include -fopenmp -I"/home/my_username/.R/library/Rcpp/include" -I"/home/ my_username/.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include -fpic -g -O2 -c file158165b9c4ae1.cpp -o file158165b9c4ae1.o g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/lib64 -o sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas -lgfortran -lm -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR I thought that the -fopenmp flag should let arma implicitly parallelize matrix computations. Any hints as to why this may not work on server B? The actual code I'm running is an R package that includes RcppArmadillo and RcppEnsmallen. Server B is the login node to an hpc cluster, but the code does not use all cores on the compute nodes either. Best, Robin ___ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel