Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores

2024-03-02 Thread Dirk Eddelbuettel

Hi Robin,

On 2 March 2024 at 16:34, Robin Liu wrote:
| sessionInfo() was the right clue. Indeed the version of R on machine B was not
| linked to OpenBLAS. Switching to a version with OpenBLAS allows the test code
| to use all cores.
| 
| A clear way to check which library is linked is to run the following:
| 
| > extSoftVersion()["BLAS"]

Ah yes -- I keep forgetting about that one. Good reminder!
 
| Thanks for your help!

Always a pleasure. Glad you are all set.

Dirk

 
| On Sat, Feb 24, 2024 at 9:17 AM Dirk Eddelbuettel  wrote:
| 
| 
| On 24 February 2024 at 11:44, Robin Liu wrote:
| | Thank you Dirk for the response.
| |
| | I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both
| machines
| | and correctly see that machine A and B have 20 and 40 cores,
| respectively. I
| | also see that calling the setter changes this value.
| |
| | However, calling the setter does not seem to change the number of cores
| used on
| | either machine A or B. I have updated my code example as below: the
| execution
| | uses 20 cores on machine A and 1 core on machine B as before, despite my
| | setting the number of omp threads to 5. Do you have any further hints?
| 
| I fear you need to debug that on the machine 'B' in question. It's all 
open
| source.  I do not think either Conrad or myself put code in to constrain
| you
| to one core on 'B' (and then doesn't as you see on 'A').
| 
| You can grep around both the RcppArmadillo wrapper code and the include
| Armadillo code, I suggest making a local copy and peppering in some print
| statements.
| 
| Also keep in mind that (Rcpp)Armadillo hands off to computation to the
| actual
| LAPACK / BLAS implementation on that machine. Lots of things can go wrong
| there: maybe R was compiled with its own embedded BLAS/LAPACK sources
| (preventing a call out to OpenBLAS even when the machine has it). Or maybe
| R
| was compiled correctly but a single-threaded set of libraries is on the
| machine.
| 
| You have not supplied any of that information. Many bug report suggestions
| hint that showing `sessionInfo()` helps -- and it does show the 
BLAS/LAPACK
| libraries. You are not forced to show us this, but by not showing us you
| prevent us from being more focussed on suggestions.  So maybe start at 
your
| end by glancing at sessionInfo() on A and B?
| 
| Dirk
| 
| 
| | library(RcppArmadillo)
| | library(Rcpp)
| |
| | RcppArmadillo::armadillo_set_number_of_omp_threads(5)
| | print(sprintf("There are %d threads",
| |       RcppArmadillo::armadillo_get_number_of_omp_threads()))
| |
| | src <-
| | r"(#include 
| |
| | // [[Rcpp::depends(RcppArmadillo)]]
| |
| | // [[Rcpp::export]]
| | arma::vec getEigenValues(arma::mat M) {
| |   return arma::eig_sym(M);
| | })"
| |
| | size <- 1
| | m <- matrix(rnorm(size^2), size, size)
| | m <- m * t(m)
| |
| | # This line compiles the above code with the -fopenmp flag.
| | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
| | result <- getEigenValues(m)
| | print(result[1:10])
| |
| | On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel 
| wrote:
| |
| |
| |     On 23 February 2024 at 09:35, Robin Liu wrote:
| |     | Hi all,
| |     |
| |     | Here is an R script that uses Armadillo to decompose a large 
matrix
| and
| |     print
| |     | the first 10 eigenvalues.
| |     |
| |     | library(RcppArmadillo)
| |     | library(Rcpp)
| |     |
| |     | src <-
| |     | r"(#include 
| |     |
| |     | // [[Rcpp::depends(RcppArmadillo)]]
| |     |
| |     | // [[Rcpp::export]]
| |     | arma::vec getEigenValues(arma::mat M) {
| |     |   return arma::eig_sym(M);
| |     | })"
| |     |
| |     | size <- 1
| |     | m <- matrix(rnorm(size^2), size, size)
| |     | m <- m * t(m)
| |     |
| |     | # This line compiles the above code with the -fopenmp flag.
| |     | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
| |     | result <- getEigenValues(m)
| |     | print(result[1:10])
| |     |
| |     | When I run this code on server A, I see that arma can implicitly
| leverage
| |     all
| |     | available cores by running top -H. However, on server B it can 
only
| use
| |     one
| |     | core despite multiple being available: there is just one process
| entry in
| |     top
| |     | -H. Both processes successfully exit and return an answer. The
| process on
| |     | server B is of course much slower.
| |
| |     It is documented in the package how this is applied and the policy 
is
| to
| |     NOT
| |     blindly enforce one use case (say all cores, or half, or a 

Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores

2024-03-02 Thread Robin Liu
Hi Dirk,

sessionInfo() was the right clue. Indeed the version of R on machine B was
not linked to OpenBLAS. Switching to a version with OpenBLAS allows the
test code to use all cores.

A clear way to check which library is linked is to run the following:

> extSoftVersion()["BLAS"]

Thanks for your help!

On Sat, Feb 24, 2024 at 9:17 AM Dirk Eddelbuettel  wrote:

>
> On 24 February 2024 at 11:44, Robin Liu wrote:
> | Thank you Dirk for the response.
> |
> | I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both
> machines
> | and correctly see that machine A and B have 20 and 40 cores,
> respectively. I
> | also see that calling the setter changes this value.
> |
> | However, calling the setter does not seem to change the number of cores
> used on
> | either machine A or B. I have updated my code example as below: the
> execution
> | uses 20 cores on machine A and 1 core on machine B as before, despite my
> | setting the number of omp threads to 5. Do you have any further hints?
>
> I fear you need to debug that on the machine 'B' in question. It's all open
> source.  I do not think either Conrad or myself put code in to constrain
> you
> to one core on 'B' (and then doesn't as you see on 'A').
>
> You can grep around both the RcppArmadillo wrapper code and the include
> Armadillo code, I suggest making a local copy and peppering in some print
> statements.
>
> Also keep in mind that (Rcpp)Armadillo hands off to computation to the
> actual
> LAPACK / BLAS implementation on that machine. Lots of things can go wrong
> there: maybe R was compiled with its own embedded BLAS/LAPACK sources
> (preventing a call out to OpenBLAS even when the machine has it). Or maybe
> R
> was compiled correctly but a single-threaded set of libraries is on the
> machine.
>
> You have not supplied any of that information. Many bug report suggestions
> hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK
> libraries. You are not forced to show us this, but by not showing us you
> prevent us from being more focussed on suggestions.  So maybe start at your
> end by glancing at sessionInfo() on A and B?
>
> Dirk
>
>
> | library(RcppArmadillo)
> | library(Rcpp)
> |
> | RcppArmadillo::armadillo_set_number_of_omp_threads(5)
> | print(sprintf("There are %d threads",
> |   RcppArmadillo::armadillo_get_number_of_omp_threads()))
> |
> | src <-
> | r"(#include 
> |
> | // [[Rcpp::depends(RcppArmadillo)]]
> |
> | // [[Rcpp::export]]
> | arma::vec getEigenValues(arma::mat M) {
> |   return arma::eig_sym(M);
> | })"
> |
> | size <- 1
> | m <- matrix(rnorm(size^2), size, size)
> | m <- m * t(m)
> |
> | # This line compiles the above code with the -fopenmp flag.
> | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
> | result <- getEigenValues(m)
> | print(result[1:10])
> |
> | On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel 
> wrote:
> |
> |
> | On 23 February 2024 at 09:35, Robin Liu wrote:
> | | Hi all,
> | |
> | | Here is an R script that uses Armadillo to decompose a large
> matrix and
> | print
> | | the first 10 eigenvalues.
> | |
> | | library(RcppArmadillo)
> | | library(Rcpp)
> | |
> | | src <-
> | | r"(#include 
> | |
> | | // [[Rcpp::depends(RcppArmadillo)]]
> | |
> | | // [[Rcpp::export]]
> | | arma::vec getEigenValues(arma::mat M) {
> | |   return arma::eig_sym(M);
> | | })"
> | |
> | | size <- 1
> | | m <- matrix(rnorm(size^2), size, size)
> | | m <- m * t(m)
> | |
> | | # This line compiles the above code with the -fopenmp flag.
> | | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
> | | result <- getEigenValues(m)
> | | print(result[1:10])
> | |
> | | When I run this code on server A, I see that arma can implicitly
> leverage
> | all
> | | available cores by running top -H. However, on server B it can
> only use
> | one
> | | core despite multiple being available: there is just one process
> entry in
> | top
> | | -H. Both processes successfully exit and return an answer. The
> process on
> | | server B is of course much slower.
> |
> | It is documented in the package how this is applied and the policy
> is to
> | NOT
> | blindly enforce one use case (say all cores, or half, or a magically
> chosen
> | value of N for whatever value of N) but to follow the local admin
> setting
> | and
> | respecting standard environment variables.
> |
> | So I suspect that your machine 'B' differs from machine 'A' in this
> | regards.
> |
> | Not that this is a _run-time_ and not _compile-time_ behavior. As it
> is for
> | multicore-enabled LAPACK and BLAS libraries, the OpenMP library and
> | basically
> | most software of this type.
> |
> | You can override it, see
> |   RcppArmadillo::armadillo_set_number_of_omp_threads
> |   RcppArmadillo::armadillo_get_number_of_omp_threads

Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores

2024-02-24 Thread Dirk Eddelbuettel

On 24 February 2024 at 11:44, Robin Liu wrote:
| Thank you Dirk for the response.
| 
| I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both machines
| and correctly see that machine A and B have 20 and 40 cores, respectively. I
| also see that calling the setter changes this value.
| 
| However, calling the setter does not seem to change the number of cores used 
on
| either machine A or B. I have updated my code example as below: the execution
| uses 20 cores on machine A and 1 core on machine B as before, despite my
| setting the number of omp threads to 5. Do you have any further hints?

I fear you need to debug that on the machine 'B' in question. It's all open
source.  I do not think either Conrad or myself put code in to constrain you
to one core on 'B' (and then doesn't as you see on 'A').

You can grep around both the RcppArmadillo wrapper code and the include
Armadillo code, I suggest making a local copy and peppering in some print
statements.

Also keep in mind that (Rcpp)Armadillo hands off to computation to the actual
LAPACK / BLAS implementation on that machine. Lots of things can go wrong
there: maybe R was compiled with its own embedded BLAS/LAPACK sources
(preventing a call out to OpenBLAS even when the machine has it). Or maybe R
was compiled correctly but a single-threaded set of libraries is on the
machine.

You have not supplied any of that information. Many bug report suggestions
hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK
libraries. You are not forced to show us this, but by not showing us you
prevent us from being more focussed on suggestions.  So maybe start at your
end by glancing at sessionInfo() on A and B?

Dirk

 
| library(RcppArmadillo)
| library(Rcpp)
| 
| RcppArmadillo::armadillo_set_number_of_omp_threads(5)
| print(sprintf("There are %d threads",
|       RcppArmadillo::armadillo_get_number_of_omp_threads()))
| 
| src <-
| r"(#include 
| 
| // [[Rcpp::depends(RcppArmadillo)]]
| 
| // [[Rcpp::export]]
| arma::vec getEigenValues(arma::mat M) {
|   return arma::eig_sym(M);
| })"
| 
| size <- 1
| m <- matrix(rnorm(size^2), size, size)
| m <- m * t(m)
| 
| # This line compiles the above code with the -fopenmp flag.
| sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
| result <- getEigenValues(m)
| print(result[1:10])
| 
| On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel  wrote:
| 
| 
| On 23 February 2024 at 09:35, Robin Liu wrote:
| | Hi all,
| |
| | Here is an R script that uses Armadillo to decompose a large matrix and
| print
| | the first 10 eigenvalues.
| |
| | library(RcppArmadillo)
| | library(Rcpp)
| |
| | src <-
| | r"(#include 
| |
| | // [[Rcpp::depends(RcppArmadillo)]]
| |
| | // [[Rcpp::export]]
| | arma::vec getEigenValues(arma::mat M) {
| |   return arma::eig_sym(M);
| | })"
| |
| | size <- 1
| | m <- matrix(rnorm(size^2), size, size)
| | m <- m * t(m)
| |
| | # This line compiles the above code with the -fopenmp flag.
| | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
| | result <- getEigenValues(m)
| | print(result[1:10])
| |
| | When I run this code on server A, I see that arma can implicitly 
leverage
| all
| | available cores by running top -H. However, on server B it can only use
| one
| | core despite multiple being available: there is just one process entry 
in
| top
| | -H. Both processes successfully exit and return an answer. The process 
on
| | server B is of course much slower.
| 
| It is documented in the package how this is applied and the policy is to
| NOT
| blindly enforce one use case (say all cores, or half, or a magically 
chosen
| value of N for whatever value of N) but to follow the local admin setting
| and
| respecting standard environment variables.
| 
| So I suspect that your machine 'B' differs from machine 'A' in this
| regards.
| 
| Not that this is a _run-time_ and not _compile-time_ behavior. As it is 
for
| multicore-enabled LAPACK and BLAS libraries, the OpenMP library and
| basically
| most software of this type.
| 
| You can override it, see
|   RcppArmadillo::armadillo_set_number_of_omp_threads
|   RcppArmadillo::armadillo_get_number_of_omp_threads
| 
| Can you try and see if these help you?
| 
| Dirk
| 
| | Here is the compilation on server A:
| | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
| | 'file197c21cbec564.cpp'
| | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/include
| | -fopenmp  -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/local/
| lib/R/
| | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/
| | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include   -fpic  -g 
-O2
| | -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time
|  

Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores

2024-02-24 Thread Robin Liu
Thank you Dirk for the response.

I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both
machines and correctly see that machine A and B have 20 and 40 cores,
respectively. I also see that calling the setter changes this value.

However, calling the setter does not seem to change the number of cores
used on either machine A or B. I have updated my code example as below: the
execution uses 20 cores on machine A and 1 core on machine B as before,
despite my setting the number of omp threads to 5. Do you have any further
hints?

library(RcppArmadillo)
library(Rcpp)

RcppArmadillo::armadillo_set_number_of_omp_threads(5)
print(sprintf("There are %d threads",
  RcppArmadillo::armadillo_get_number_of_omp_threads()))

src <-
r"(#include 

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::vec getEigenValues(arma::mat M) {
  return arma::eig_sym(M);
})"

size <- 1
m <- matrix(rnorm(size^2), size, size)
m <- m * t(m)

# This line compiles the above code with the -fopenmp flag.
sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
result <- getEigenValues(m)
print(result[1:10])

On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel  wrote:

>
> On 23 February 2024 at 09:35, Robin Liu wrote:
> | Hi all,
> |
> | Here is an R script that uses Armadillo to decompose a large matrix and
> print
> | the first 10 eigenvalues.
> |
> | library(RcppArmadillo)
> | library(Rcpp)
> |
> | src <-
> | r"(#include 
> |
> | // [[Rcpp::depends(RcppArmadillo)]]
> |
> | // [[Rcpp::export]]
> | arma::vec getEigenValues(arma::mat M) {
> |   return arma::eig_sym(M);
> | })"
> |
> | size <- 1
> | m <- matrix(rnorm(size^2), size, size)
> | m <- m * t(m)
> |
> | # This line compiles the above code with the -fopenmp flag.
> | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
> | result <- getEigenValues(m)
> | print(result[1:10])
> |
> | When I run this code on server A, I see that arma can implicitly
> leverage all
> | available cores by running top -H. However, on server B it can only use
> one
> | core despite multiple being available: there is just one process entry
> in top
> | -H. Both processes successfully exit and return an answer. The process on
> | server B is of course much slower.
>
> It is documented in the package how this is applied and the policy is to
> NOT
> blindly enforce one use case (say all cores, or half, or a magically chosen
> value of N for whatever value of N) but to follow the local admin setting
> and
> respecting standard environment variables.
>
> So I suspect that your machine 'B' differs from machine 'A' in this
> regards.
>
> Not that this is a _run-time_ and not _compile-time_ behavior. As it is for
> multicore-enabled LAPACK and BLAS libraries, the OpenMP library and
> basically
> most software of this type.
>
> You can override it, see
>   RcppArmadillo::armadillo_set_number_of_omp_threads
>   RcppArmadillo::armadillo_get_number_of_omp_threads
>
> Can you try and see if these help you?
>
> Dirk
>
> | Here is the compilation on server A:
> | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
> | 'file197c21cbec564.cpp'
> | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/include
> | -fopenmp  -I"/usr/local/lib/R/site-library/Rcpp/include"
> -I"/usr/local/lib/R/
> | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/
> | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include   -fpic  -g -O2
> | -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time
> | -D_FORTIFY_SOURCE=2 -g  -c file197c21cbec564.cpp -o file197c21cbec564.o
> | g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o
> | sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas -lgfortran
> -lm
> | -lquadmath -L/usr/local/lib/R/lib -lR
> |
> | and here it is for server B:
> | /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
> | 'file158165b9c4ae1.cpp'
> | g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG
> -I../inst/include
> | -fopenmp  -I"/home/my_username/.R/library/Rcpp/include"
> -I"/home/ my_username
> | /.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/
> | sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include   -fpic  -g
> -O2  -c
> | file158165b9c4ae1.cpp -o file158165b9c4ae1.o
> | g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/lib64
> -o
> | sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas -lgfortran
> -lm
> | -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR
> |
> | I thought that the -fopenmp flag should let arma implicitly parallelize
> matrix
> | computations. Any hints as to why this may not work on server B?
> |
> | The actual code I'm running is an R package that includes RcppArmadillo
> and
> | RcppEnsmallen. Server B is the login node to an hpc cluster, but the
> code does
> | not use all cores on the compute nodes either.
> |
> | Best,
> | Robin
> | ___
> | Rcpp-devel mailing list
> | 

Re: [Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores

2024-02-23 Thread Dirk Eddelbuettel


On 23 February 2024 at 09:35, Robin Liu wrote:
| Hi all,
| 
| Here is an R script that uses Armadillo to decompose a large matrix and print
| the first 10 eigenvalues.
| 
| library(RcppArmadillo)
| library(Rcpp)
| 
| src <-
| r"(#include 
| 
| // [[Rcpp::depends(RcppArmadillo)]]
| 
| // [[Rcpp::export]]
| arma::vec getEigenValues(arma::mat M) {
|   return arma::eig_sym(M);
| })"
| 
| size <- 1
| m <- matrix(rnorm(size^2), size, size)
| m <- m * t(m)
| 
| # This line compiles the above code with the -fopenmp flag.
| sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
| result <- getEigenValues(m)
| print(result[1:10])
| 
| When I run this code on server A, I see that arma can implicitly leverage all
| available cores by running top -H. However, on server B it can only use one
| core despite multiple being available: there is just one process entry in top
| -H. Both processes successfully exit and return an answer. The process on
| server B is of course much slower.

It is documented in the package how this is applied and the policy is to NOT
blindly enforce one use case (say all cores, or half, or a magically chosen
value of N for whatever value of N) but to follow the local admin setting and
respecting standard environment variables.

So I suspect that your machine 'B' differs from machine 'A' in this regards.

Not that this is a _run-time_ and not _compile-time_ behavior. As it is for
multicore-enabled LAPACK and BLAS libraries, the OpenMP library and basically
most software of this type.

You can override it, see
  RcppArmadillo::armadillo_set_number_of_omp_threads
  RcppArmadillo::armadillo_get_number_of_omp_threads

Can you try and see if these help you?

Dirk

| Here is the compilation on server A:
| /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
| 'file197c21cbec564.cpp'
| g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/include
| -fopenmp  -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/local/lib/R/
| site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/
| sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include   -fpic  -g -O2
| -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time
| -D_FORTIFY_SOURCE=2 -g  -c file197c21cbec564.cpp -o file197c21cbec564.o
| g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o
| sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas -lgfortran -lm
| -lquadmath -L/usr/local/lib/R/lib -lR
| 
| and here it is for server B:
| /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
| 'file158165b9c4ae1.cpp'
| g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG -I../inst/include
| -fopenmp  -I"/home/my_username/.R/library/Rcpp/include" -I"/home/ my_username
| /.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/
| sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include   -fpic  -g -O2  -c
| file158165b9c4ae1.cpp -o file158165b9c4ae1.o
| g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/lib64 -o
| sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas -lgfortran -lm
| -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR
| 
| I thought that the -fopenmp flag should let arma implicitly parallelize matrix
| computations. Any hints as to why this may not work on server B?
| 
| The actual code I'm running is an R package that includes RcppArmadillo and
| RcppEnsmallen. Server B is the login node to an hpc cluster, but the code does
| not use all cores on the compute nodes either.
| 
| Best,
| Robin
| ___
| Rcpp-devel mailing list
| Rcpp-devel@lists.r-forge.r-project.org
| https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
___
Rcpp-devel mailing list
Rcpp-devel@lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel


[Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores

2024-02-23 Thread Robin Liu
Hi all,

Here is an R script that uses Armadillo to decompose a large matrix and
print the first 10 eigenvalues.

library(RcppArmadillo)
library(Rcpp)

src <-
r"(#include 

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::vec getEigenValues(arma::mat M) {
  return arma::eig_sym(M);
})"

size <- 1
m <- matrix(rnorm(size^2), size, size)
m <- m * t(m)

# This line compiles the above code with the -fopenmp flag.
sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
result <- getEigenValues(m)
print(result[1:10])

When I run this code on server A, I see that arma can implicitly leverage
all available cores by running top -H. However, on server B it can only use
one core despite multiple being available: there is just one process entry
in top -H. Both processes successfully exit and return an answer. The
process on server B is of course much slower.

Here is the compilation on server A:
/usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
'file197c21cbec564.cpp'
g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/include
-fopenmp  -I"/usr/local/lib/R/site-library/Rcpp/include"
-I"/usr/local/lib/R/site-library/RcppArmadillo/include"
-I"/tmp/RtmpwhGRi3/sourceCpp-x86_64-pc-linux-gnu-1.0.9"
-I/usr/local/include   -fpic  -g -O2 -fstack-protector-strong -Wformat
-Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c
file197c21cbec564.cpp -o file197c21cbec564.o
g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o
sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas -lgfortran -lm
-lquadmath -L/usr/local/lib/R/lib -lR

and here it is for server B:
/sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
'file158165b9c4ae1.cpp'
g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG
-I../inst/include -fopenmp  -I"/home/my_username/.R/library/Rcpp/include"
-I"/home/ my_username/.R/library/RcppArmadillo/include"
-I"/tmp/RtmpvfPt4l/sourceCpp-x86_64-pc-linux-gnu-1.0.10"
-I/usr/local/include   -fpic  -g -O2  -c file158165b9c4ae1.cpp -o
file158165b9c4ae1.o
g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/lib64 -o
sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas -lgfortran -lm
-lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR

I thought that the -fopenmp flag should let arma implicitly parallelize
matrix computations. Any hints as to why this may not work on server B?

The actual code I'm running is an R package that includes RcppArmadillo and
RcppEnsmallen. Server B is the login node to an hpc cluster, but the code
does not use all cores on the compute nodes either.

Best,
Robin
___
Rcpp-devel mailing list
Rcpp-devel@lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel