Re: [R-pkg-devel] RFC: an interface to manage use of parallelism in packages

2023-11-21 Thread Ivan Krylov
В Mon, 20 Nov 2023 21:34:55 -0500
Andrew Robbins via R-package-devel 
пишет:

> In my (pending) package, I currently have a compile-time check for
> POSIX-threaded OpenBLAS which calls openblas_set_num_threads before
> and after OpenMP blocks to stop this behavior from occurring.

This may be not enough. In distributions like Debian, it's possible (if
not common) to install multiple BLAS implementations and switch them
between program runs using update-alternatives, so it's possible to
compile and link a package using one BLAS then run it using another
BLAS.

May I suggest using dlsym() to check whether openblas_get_parallel()
is present in the current process and perform the check during run-time?

(Sorry about not returning to this topic earlier. I do think that a
semaphore is the right choice and I do intend to implement this
interface using semaphores within a few weeks if not a few days.)

-- 
Best regards,
Ivan

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] RFC: an interface to manage use of parallelism in packages

2023-11-21 Thread Andrew Robbins via R-package-devel

Hi all,

While we're on the topic of BLAS thread handling, there's also the 
matter of OpenBLAS's two threading implementations. Depending on how the 
library is built, it can either use OpenMP or it can use POSIX threading.


This can cause issues in packages that call into BLAS functions in an 
OpenMP block as the POSIX threading variant will attempt to spawn N 
threads for each OpenMP thread (where N is the system-configured 
OpenBLAS thread count). This explosion of threads causes a pretty severe 
performance degradation. In my (pending) package, I currently have a 
compile-time check for POSIX-threaded OpenBLAS which calls 
openblas_set_num_threads before and after OpenMP blocks to stop this 
behavior from occurring. If R were to implement some kind of internal 
FlexiBLAS, this would need to be taken into account.



Best,

--
Andrew Robbins
Systems Analyst, Welch Lab
University of Michigan
Department of Computational Medicine and Bioinformatics



OpenPGP_signature.asc
Description: OpenPGP digital signature
__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] RFC: an interface to manage use of parallelism in packages

2023-11-03 Thread Vladimir Dergachev




On Wed, 25 Oct 2023, Ivan Krylov wrote:


Summary: at the end of this message is a link to an R package
implementing an interface for managing the use of execution units in R
packages. As a package maintainer, would you agree to use something
like this? Does it look sufficiently reasonable to become a part of R?
Read on for why I made these particular interface choices.

My understanding of the problem stated by Simon Urbanek and Uwe Ligges
[1,2] is that we need a way to set and distribute the CPU core
allowance between multiple packages that could be using very different
methods to achieve parallel execution on the local machine, including
threads and child processes. We could have multiple well-meaning
packages, each of them calling each other using a different parallelism
technology: imagine parallel::makeCluster(getOption('mc.cores'))
combined with parallel::mclapply(mc.cores = getOption('mc.cores')) and
with an OpenMP program that also spawns getOption('mc.cores') threads.
A parallel BLAS or custom multi-threading using std::thread could add
more fuel to the fire.



Hi Ivan,

  Generally, I like the idea. A few comments:

  * from a package developer point of view, I would prefer to have a clear 
idea of how many threads I could use. So having a core R function like 
"getMaxThreads()" or similar would be useful. What that function returns 
could be governed by a package.


  In fact, it might be a good idea to allow to have several packages 
implementing "thread governors" for different situations.


  * it would make sense to think through whether we want (or not) to allow 
package developers to call omp_set_num_threads() or whether this is done 
by R.


  This is hairier than you might think. Allowing it forces every package 
to call omp_set_num_threads() before OMP block, because there is no way to 
know which packaged was called before.


  Not allowing to call omp_set_num_threads() might make it difficult to 
use all the threads, and force R to initialize OpenMP on startup.


 * Speaking of initialization of OpenMP, I have seen situations where 
spawning some regular pthread threads and then initializing OpenMP forces 
all pthread threads to a single CPU.


  I think this is because OpenMP sets thread affinity for all the process 
threads, but only distributes its own.


 * This also raises the question of how affinity is managed. If you have 
called makeForkCluster() to create 10 R instances and then each uses 2 
OpenMP threads, you do not want those occupying only 2 cpu execution 
threads instead of 20.


 * From the user perspective, it might be useful to be able to limit 
number of threads per package by using patterns or regular expressions.
Often, the reason for limiting number of threads is to reduce memory 
usage.


 * Speaking of memory usage, glibc has parameters like MALLOC_ARENA_MAX 
that have great impact on memory usage of multithreaded programs. I 
usually set it to 1, but then I take extra care to make as few memory 
allocation calls as possible within individual threads.


best

Vladimir Dergachev

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] RFC: an interface to manage use of parallelism in packages

2023-11-02 Thread Ivan Krylov
В Wed, 25 Oct 2023 13:54:53 -0700
"Reed A. Cartwright"  пишет:

> For a comparison, I'd recommend looking at how GNU make does parallel
> processing. It uses the concept of job server and job slots. What I
> like about it is that it is implemented at the OS level because make
> needs to support interacting with non-make processes. On Windows it
> uses a named semaphore, and on Unix-like system it uses named pipes
> or simple pipes to pass tokens around.

Thank you for pointing me towards the job server in GNU make! This is
exactly the kind of suggestion I was looking for. I also appreciate you
signing up for an account on Codeberg to give me a more detailed
writeup.

I agree that named semaphores seem to be a better fit for the job than
my first design. There are some corner cases to handle (what if the
parent process starts up and the semaphore already exists? what if the
user wants to reduce the allowance but the semaphore count is lower than
expected? what are the stongest restrictions on a POSIX semaphore name?
I've seen implementations placing them under /dev/shm%s and /tmp/%s.sem
so far), but it should be possible for me to implement the core concept
in a few days.

Additional care might be needed regarding the number of BLAS threads. R
might have to become its own FlexiBLAS and pass an additional
environment variable to children to ensure the limit being taken into
account.

-- 
Best regards,
Ivan

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] RFC: an interface to manage use of parallelism in packages

2023-10-25 Thread Reed A. Cartwright
Hi Ivan,

Interesting package, and I'll provide more feedback later. For a
comparison, I'd recommend looking at how GNU make does parallel processing.
It uses the concept of job server and job slots. What I like about it is
that it is implemented at the OS level because make needs to support
interacting with non-make processes. On Windows it uses a named semaphore,
and on Unix-like system it uses named pipes or simple pipes to pass tokens
around.

https://www.gnu.org/software/make/manual/make.html#Job-Slots

Cheers,
Reed


On Wed, Oct 25, 2023 at 5:55 AM Ivan Krylov  wrote:

> Summary: at the end of this message is a link to an R package
> implementing an interface for managing the use of execution units in R
> packages. As a package maintainer, would you agree to use something
> like this? Does it look sufficiently reasonable to become a part of R?
> Read on for why I made these particular interface choices.
>
> My understanding of the problem stated by Simon Urbanek and Uwe Ligges
> [1,2] is that we need a way to set and distribute the CPU core
> allowance between multiple packages that could be using very different
> methods to achieve parallel execution on the local machine, including
> threads and child processes. We could have multiple well-meaning
> packages, each of them calling each other using a different parallelism
> technology: imagine parallel::makeCluster(getOption('mc.cores'))
> combined with parallel::mclapply(mc.cores = getOption('mc.cores')) and
> with an OpenMP program that also spawns getOption('mc.cores') threads.
> A parallel BLAS or custom multi-threading using std::thread could add
> more fuel to the fire.
>
> Workarounds applied by the package maintainers nowadays are both
> cumbersome (sometimes one has to talk to some package that lives
> downstream in the call stack and isn't even an explicit dependency,
> because it's the one responsible for the threads) and not really enough
> (most maintainers forget to restore the state after they are done, so a
> single example() may slow down the operations that follow).
>
> The problem is complicated by the fact that not every parallel
> operation can explicitly accept the CPU core limit as a parameter. For
> example, data.table's implicit parallelism is very convenient, and so
> are parallel BLASes (which don't have a standard interface to change
> the number of threads), so we shouldn't be prohibiting implicit
> parallelism.
>
> It's also not always obvious how to split the cores between the
> potentially parallel sections. While it's typically best to start with
> the outer loop (e.g. better have 16 R processes solving relatively
> small linear algebra problems back to back than have one R process
> spinning 15 of its 16 OpenBLAS threads in sched_yield()), it may be
> more efficient to give all 16 threads back to BLAS (and save on
> transferring the problems and solutions between processes) once the
> problems become large enough to give enough work to all of the cores.
>
> So as a user, I would like an interface that would both let me give all
> of the cores to the program if that's what I need (something like
> setCPUallowance(parallelly::availableCores())) _and_ let me be more
> detailed when necessary (something like setCPUallowance(overall = 7,
> packages = c(foobar = 1), BLAS = 2) to limit BLAS threads to 2,
> disallow parallelism in the foobar package because it wastes too much
> time, and limit R as a whole to 7 cores because I want to surf the 'net
> on the remaining one while the Monte-Carlo simulation is going on). As
> a package developer, I'd rather not think about any of that and just
> use a function call like getCPUallowance() for the default number of
> cores in every situation.
>
> Can we implement such an interface? The main obstacle here is not being
> able to know when each parallel region beings and ends. Does the
> package call fork()? std::thread{}? Start a local mirai cluster? We
> have to trust (and verify during R CMD check) the package to create the
> given number of units of execution and tells us when they are done.
>
> The closest interface that I see being implementable is a system of
> tokens with reference semantics: getCPUallowance() returns a special
> object containing the number of tokens the caller is allowed to use and
> sets an environment variable with the remaining number of cores. Any R
> child processes pick up the number of cores from the environment
> variable. Any downstream calls to getCPUallowance(), aware of the
> tokens already handed out, return a reduced number of remaining CPU
> cores. Once the package is done executing a parallel section, it
> returns the CPU allowance back to R by calling something like
> close(token), which updates the internal allowance value (and the
> environment variable). (A finalizer can also be set on the tokens to
> ensure that CPU cores won't be lost.)
>
> Here's a package implementing this idea:
> <
> 

[R-pkg-devel] RFC: an interface to manage use of parallelism in packages

2023-10-25 Thread Ivan Krylov
Summary: at the end of this message is a link to an R package
implementing an interface for managing the use of execution units in R
packages. As a package maintainer, would you agree to use something
like this? Does it look sufficiently reasonable to become a part of R?
Read on for why I made these particular interface choices.

My understanding of the problem stated by Simon Urbanek and Uwe Ligges
[1,2] is that we need a way to set and distribute the CPU core
allowance between multiple packages that could be using very different
methods to achieve parallel execution on the local machine, including
threads and child processes. We could have multiple well-meaning
packages, each of them calling each other using a different parallelism
technology: imagine parallel::makeCluster(getOption('mc.cores'))
combined with parallel::mclapply(mc.cores = getOption('mc.cores')) and
with an OpenMP program that also spawns getOption('mc.cores') threads.
A parallel BLAS or custom multi-threading using std::thread could add
more fuel to the fire.

Workarounds applied by the package maintainers nowadays are both
cumbersome (sometimes one has to talk to some package that lives
downstream in the call stack and isn't even an explicit dependency,
because it's the one responsible for the threads) and not really enough
(most maintainers forget to restore the state after they are done, so a
single example() may slow down the operations that follow).

The problem is complicated by the fact that not every parallel
operation can explicitly accept the CPU core limit as a parameter. For
example, data.table's implicit parallelism is very convenient, and so
are parallel BLASes (which don't have a standard interface to change
the number of threads), so we shouldn't be prohibiting implicit
parallelism.

It's also not always obvious how to split the cores between the
potentially parallel sections. While it's typically best to start with
the outer loop (e.g. better have 16 R processes solving relatively
small linear algebra problems back to back than have one R process
spinning 15 of its 16 OpenBLAS threads in sched_yield()), it may be
more efficient to give all 16 threads back to BLAS (and save on
transferring the problems and solutions between processes) once the
problems become large enough to give enough work to all of the cores.

So as a user, I would like an interface that would both let me give all
of the cores to the program if that's what I need (something like
setCPUallowance(parallelly::availableCores())) _and_ let me be more
detailed when necessary (something like setCPUallowance(overall = 7,
packages = c(foobar = 1), BLAS = 2) to limit BLAS threads to 2,
disallow parallelism in the foobar package because it wastes too much
time, and limit R as a whole to 7 cores because I want to surf the 'net
on the remaining one while the Monte-Carlo simulation is going on). As
a package developer, I'd rather not think about any of that and just
use a function call like getCPUallowance() for the default number of
cores in every situation.

Can we implement such an interface? The main obstacle here is not being
able to know when each parallel region beings and ends. Does the
package call fork()? std::thread{}? Start a local mirai cluster? We
have to trust (and verify during R CMD check) the package to create the
given number of units of execution and tells us when they are done.

The closest interface that I see being implementable is a system of
tokens with reference semantics: getCPUallowance() returns a special
object containing the number of tokens the caller is allowed to use and
sets an environment variable with the remaining number of cores. Any R
child processes pick up the number of cores from the environment
variable. Any downstream calls to getCPUallowance(), aware of the
tokens already handed out, return a reduced number of remaining CPU
cores. Once the package is done executing a parallel section, it
returns the CPU allowance back to R by calling something like
close(token), which updates the internal allowance value (and the
environment variable). (A finalizer can also be set on the tokens to
ensure that CPU cores won't be lost.)

Here's a package implementing this idea:
. Currently missing are
terrible hacks to determine the BLAS type at runtime and resolve the
necessary symbols to set the number of BLAS threads, depending on
whether it's OpenBLAS, flexiblas, MKL, or something else. Does it feel
over-engineered? I hope that, even if not a good solution, this would
let us move towards a unified solution that could just work™ on
everything ranging from laptops to CRAN testing machines to HPCs.

-- 
Best regards,
Ivan

[1] https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009484.html

[2] https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009513.html

__
R-package-devel@r-project.org mailing list