Dear Support,
the answer seems to be simple, but it also seems to be wrong!
Below you can see the description how SLURM_OVERCOMMIT should operate.

SLURM_CPUS_PER_TASK
  (default is 1) allows you to assign multiple CPUs to each
  (multithreaded) process in your job to improve performance. SRUN's
  -c (lowercase) option sets this variable. See the SRUN sections of
  the SLURM Reference Manual <https://computing.llnl.gov/LCdocs/slurm>
  for usage details.
SLURM_OVERCOMMIT
  (default is NO) allows you to assign more than one process per CPU
  (the opposite of the previous variable). SRUN's -O (uppercase)
  option sets this variable, which is /not/ intended to facilitate
  pthreads applications. See the SRUN sections of the SLURM Reference
  Manual <https://computing.llnl.gov/LCdocs/slurm> for usage details.

I don't understand how you can derive from the upper description what you have written below! Not one slot/node is allowed, but more than one process per CPU(slot) is allowed!!!

Remark: In version 1.2.8 SLURM_OVERCOMMIT=1 did not work wrong!

Sincerly yours

H. Häfner

>>>>>>>>>
The answer is simple: the SLURM environmental variables when you set SLURM_OVERCOMMIT=1 are telling us that only one slot/node is available for your use. This is done by the SLURM_TASKS_PER_NODE envar.

So we can only launch 1 proc/node as this is all SLURM is allowing us to do.

Ralph
>>>>>>>>>


On Mar 25, 2009, at 11:00 AM, Hartmut Häfner wrote:

Dear Support,
there is a problem with OpenMPI in version 1.3 and version 1.3.1 when using our batch system Slurm. On our parallel computer there are 2 queues - one with exclusive usage of slots (cores) (SLURM_OVERCOMMIT=0) within nodes and one without shared usage of slots (SLURM_OVERCOMMIT=1) within nodes. Running a simple MPI-program (I'll send you this program mpi_hello.c as attachment) with SLURM_OVERCOMMIT set to 0 the executable works fine, running it with SLURM_OVERCOMMIT set to 1 it does not work correctly. Please have a look to 2 files with outputs. Working not correctly means that the MPI-program runs on 1 processor although I have started it (for example) on 4 processors (it does not work correctly for any processor number not equal to 1).

This error only occurs for the version 1.3 and 1.3.1. If I am using oder versions of OpenMPI the program works fine.

In the file Job_101442.out the hostlist (4x icn001) from Slurm is printed, then the content of the file /scratch/JMS_tmpdir/Job_101442/tmp.CCaCM22772 is printed, then the commandline (mpirun ...) is printed, then stdout of the program mpi_hello.c is printed (it runs only on 1 processor!!!) and the environment is printed.

In the file Job_101440.out the same program is run. The only difference is, that SLURM_OVERCOMMIT is'nt set!

Under the hood of job_submit .... salloc -n4 script is started. In "script" you find the command
mpirun --hostfile .....  as you can see in both output files.

Sincerly yours

H. Häfner

--
Hartmut Häfner
Karlsruhe Institute of Technology (KIT)
University Karlsruhe (TH)
Steinbuch Centre for Computing (SCC)
Scientific Computing and Applications (SCA)
Zirkel 2 (Campus Süd, Geb. 20.21, Raum 204)
D-76128 Karlsruhe

Fon +49(0)721 608 4869
Fax +49(0)721 32550 hartmut.haef...@kit.edu

http://www.rz.uni-karlsruhe.de/personen/hartmut.haefner

------------------------------------------------------------------------

Betreff:
Re: [OMPI devel] Error in the versions 1.3 and 1.3.1 of OpenMPI when using SLURM_OVERCOMMIT=1
Von:
Ralph Castain <r...@lanl.gov>
Datum:
Wed, 25 Mar 2009 11:14:36 -0600

An:
Open MPI Developers <de...@open-mpi.org>


The answer is simple: the SLURM environmental variables when you set SLURM_OVERCOMMIT=1 are telling us that only one slot/node is available for your use. This is done by the SLURM_TASKS_PER_NODE envar.

So we can only launch 1 proc/node as this is all SLURM is allowing us to do.

Ralph


On Mar 25, 2009, at 11:00 AM, Hartmut Häfner wrote:

Dear Support,
there is a problem with OpenMPI in version 1.3 and version 1.3.1 when using our batch system Slurm. On our parallel computer there are 2 queues - one with exclusive usage of slots (cores) (SLURM_OVERCOMMIT=0) within nodes and one without shared usage of slots (SLURM_OVERCOMMIT=1) within nodes. Running a simple MPI-program (I'll send you this program mpi_hello.c as attachment) with SLURM_OVERCOMMIT set to 0 the executable works fine, running it with SLURM_OVERCOMMIT set to 1 it does not work correctly. Please have a look to 2 files with outputs. Working not correctly means that the MPI-program runs on 1 processor although I have started it (for example) on 4 processors (it does not work correctly for any processor number not equal to 1).

This error only occurs for the version 1.3 and 1.3.1. If I am using oder versions of OpenMPI the program works fine.

In the file Job_101442.out the hostlist (4x icn001) from Slurm is printed, then the content of the file /scratch/JMS_tmpdir/Job_101442/tmp.CCaCM22772 is printed, then the commandline (mpirun ...) is printed, then stdout of the program mpi_hello.c is printed (it runs only on 1 processor!!!) and the environment is printed.

In the file Job_101440.out the same program is run. The only difference is, that SLURM_OVERCOMMIT is'nt set!

Under the hood of job_submit .... salloc -n4 script is started. In "script" you find the command
mpirun --hostfile .....  as you can see in both output files.

Sincerly yours

H. Häfner

--
Hartmut Häfner
Karlsruhe Institute of Technology (KIT)
University Karlsruhe (TH)
Steinbuch Centre for Computing (SCC)
Scientific Computing and Applications (SCA)
Zirkel 2 (Campus Süd, Geb. 20.21, Raum 204)
D-76128 Karlsruhe

Fon +49(0)721 608 4869
Fax +49(0)721 32550 hartmut.haef...@kit.edu

http://www.rz.uni-karlsruhe.de/personen/hartmut.haefner

Reply via email to