Dear Support,
the answer seems to be simple, but it also seems to be wrong!
Below you can see the description how SLURM_OVERCOMMIT should operate.
SLURM_CPUS_PER_TASK
(default is 1) allows you to assign multiple CPUs to each
(multithreaded) process in your job to improve performance. SRUN's
-c (lowercase) option sets this variable. See the SRUN sections of
the SLURM Reference Manual <https://computing.llnl.gov/LCdocs/slurm>
for usage details.
SLURM_OVERCOMMIT
(default is NO) allows you to assign more than one process per CPU
(the opposite of the previous variable). SRUN's -O (uppercase)
option sets this variable, which is /not/ intended to facilitate
pthreads applications. See the SRUN sections of the SLURM Reference
Manual <https://computing.llnl.gov/LCdocs/slurm> for usage details.
I don't understand how you can derive from the upper description what
you have written below! Not one slot/node is allowed, but more than one
process per CPU(slot) is allowed!!!
Remark: In version 1.2.8 SLURM_OVERCOMMIT=1 did not work wrong!
Sincerly yours
H. Häfner
>>>>>>>>>
The answer is simple: the SLURM environmental variables when you set
SLURM_OVERCOMMIT=1 are telling us that only one slot/node is available
for your use. This is done by the SLURM_TASKS_PER_NODE envar.
So we can only launch 1 proc/node as this is all SLURM is allowing us to
do.
Ralph
>>>>>>>>>
On Mar 25, 2009, at 11:00 AM, Hartmut Häfner wrote:
Dear Support,
there is a problem with OpenMPI in version 1.3 and version 1.3.1 when
using our batch system Slurm. On our parallel computer there are 2
queues - one with exclusive usage of slots (cores) (SLURM_OVERCOMMIT=0)
within nodes and one without shared usage of slots (SLURM_OVERCOMMIT=1)
within nodes. Running a simple MPI-program (I'll send you this program
mpi_hello.c as attachment) with SLURM_OVERCOMMIT set to 0 the
executable works fine, running it with SLURM_OVERCOMMIT set to 1 it does
not work correctly. Please have a look to 2 files with outputs.
Working not correctly means that the MPI-program runs on 1 processor
although I have started it (for example) on 4 processors (it does not
work correctly for any processor number not equal to 1).
This error only occurs for the version 1.3 and 1.3.1. If I am using oder
versions of OpenMPI the program works fine.
In the file Job_101442.out the hostlist (4x icn001) from Slurm is
printed, then the content of the file
/scratch/JMS_tmpdir/Job_101442/tmp.CCaCM22772 is printed, then the
commandline (mpirun ...) is printed, then stdout of the program
mpi_hello.c is printed (it runs only on 1 processor!!!) and the
environment is printed.
In the file Job_101440.out the same program is run. The only difference
is, that SLURM_OVERCOMMIT is'nt set!
Under the hood of job_submit .... salloc -n4 script is started.
In "script" you find the command
mpirun --hostfile ..... as you can see in both output files.
Sincerly yours
H. Häfner
--
Hartmut Häfner
Karlsruhe Institute of Technology (KIT)
University Karlsruhe (TH)
Steinbuch Centre for Computing (SCC)
Scientific Computing and Applications (SCA)
Zirkel 2 (Campus Süd, Geb. 20.21, Raum 204)
D-76128 Karlsruhe
Fon +49(0)721 608 4869
Fax +49(0)721 32550 hartmut.haef...@kit.edu
http://www.rz.uni-karlsruhe.de/personen/hartmut.haefner
------------------------------------------------------------------------
Betreff:
Re: [OMPI devel] Error in the versions 1.3 and 1.3.1 of OpenMPI when
using SLURM_OVERCOMMIT=1
Von:
Ralph Castain <r...@lanl.gov>
Datum:
Wed, 25 Mar 2009 11:14:36 -0600
An:
Open MPI Developers <de...@open-mpi.org>
The answer is simple: the SLURM environmental variables when you set
SLURM_OVERCOMMIT=1 are telling us that only one slot/node is available
for your use. This is done by the SLURM_TASKS_PER_NODE envar.
So we can only launch 1 proc/node as this is all SLURM is allowing us to
do.
Ralph
On Mar 25, 2009, at 11:00 AM, Hartmut Häfner wrote:
Dear Support,
there is a problem with OpenMPI in version 1.3 and version 1.3.1 when
using our batch system Slurm. On our parallel computer there are 2
queues - one with exclusive usage of slots (cores) (SLURM_OVERCOMMIT=0)
within nodes and one without shared usage of slots (SLURM_OVERCOMMIT=1)
within nodes. Running a simple MPI-program (I'll send you this program
mpi_hello.c as attachment) with SLURM_OVERCOMMIT set to 0 the
executable works fine, running it with SLURM_OVERCOMMIT set to 1 it does
not work correctly. Please have a look to 2 files with outputs.
Working not correctly means that the MPI-program runs on 1 processor
although I have started it (for example) on 4 processors (it does not
work correctly for any processor number not equal to 1).
This error only occurs for the version 1.3 and 1.3.1. If I am using oder
versions of OpenMPI the program works fine.
In the file Job_101442.out the hostlist (4x icn001) from Slurm is
printed, then the content of the file
/scratch/JMS_tmpdir/Job_101442/tmp.CCaCM22772 is printed, then the
commandline (mpirun ...) is printed, then stdout of the program
mpi_hello.c is printed (it runs only on 1 processor!!!) and the
environment is printed.
In the file Job_101440.out the same program is run. The only difference
is, that SLURM_OVERCOMMIT is'nt set!
Under the hood of job_submit .... salloc -n4 script is started.
In "script" you find the command
mpirun --hostfile ..... as you can see in both output files.
Sincerly yours
H. Häfner
--
Hartmut Häfner
Karlsruhe Institute of Technology (KIT)
University Karlsruhe (TH)
Steinbuch Centre for Computing (SCC)
Scientific Computing and Applications (SCA)
Zirkel 2 (Campus Süd, Geb. 20.21, Raum 204)
D-76128 Karlsruhe
Fon +49(0)721 608 4869
Fax +49(0)721 32550
hartmut.haef...@kit.edu
http://www.rz.uni-karlsruhe.de/personen/hartmut.haefner