Yet another question about cpu binding under SLURM environment..
Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the purpose
of cpu binding?
Full version: When you allocate a job like, e.g., this
salloc --ntasks=2 --cpus-per-task=4
SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks.
This is useful for hybrid jobs, where each MPI process spawns some
internal worker threads (e.g., OpenMP). The intention is that there are
2 MPI procs started, each of them 'bound' to 4 cores. SLURM will also
set an environment variable
SLURM_CPUS_PER_TASK=4
which should (probably?) be taken into account by the method that
launches the MPI processes to figure out the cpuset. In case of OpenMPI
+ mpirun I think something should happen in
orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually
parsed. Unfortunately, it is never really used...
As a result, cpuset of all tasks started on a given compute node
includes all CPU cores of all MPI tasks on that node, just as provided
by SLURM (in the above example - 8). In general, there is no simple way
for the user code in the MPI procs to 'split' the cores between
themselves. I imagine the original intention to support this in OpenMPI
was something like
mpirun --bind-to subtask_cpuset
with an artificial bind target that would cause OpenMPI to divide the
allocated cores between the mpi tasks. Is this right? If so, it seems
that at this point this is not implemented. Is there plans to do this?
If no, does anyone know another way to achieve that?
Thanks a lot!
Marcin