Folks,

this issue is related to the failures reported by mtt on the trunk when
the ibm test suite invokes MPI_Comm_spawn.

my test bed is made of 3 (virtual) machines with 2 sockets and 8 cpus
per socket each.

if i run on one host (without any batch manager)

mpirun -np 16 --host slurm1 --oversubscribe --mca coll ^ml
./intercomm_create

then the test is a success with the following warning  :

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        slurm2
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------


now if i run on three hosts

mpirun -np 16 --host slurm1,slurm2,slurm3 --oversubscribe --mca coll ^ml
./intercomm_create

then the test is a success without any warning


but now, if i run on two hosts

mpirun -np 16 --host slurm1,slurm2 --oversubscribe --mca coll ^ml
./intercomm_create

then the test is a failure.

first, i get the following same warning :

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        slurm2
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

followed by a crash

[slurm1:2482] *** An error occurred in MPI_Comm_spawn
[slurm1:2482] *** reported by process [2068512769,0]
[slurm1:2482] *** on communicator MPI_COMM_WORLD
[slurm1:2482] *** MPI_ERR_SPAWN: could not spawn processes
[slurm1:2482] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[slurm1:2482] ***    and potentially your MPI job)


that being said, i the following command works :

mpirun -np 16 --host slurm1,slurm2 --mca coll ^ml --bind-to none
./intercomm_create


1) what does the first message means ?
    is it a warning ? /* if yes, why does mpirun on two hosts fail ? */
    is it a fatal error ? /* if yes, why does mpirun on one host success
? */

2) generally speaking, and assuming the first message is a warning,
should --oversubscribe automatically set overload-allowed ?
    /* as far as i am concerned, that would be much more intuitive */

Cheers,

Gilles

Reply via email to