Folks,
this issue is related to the failures reported by mtt on the trunk when
the ibm test suite invokes MPI_Comm_spawn.
my test bed is made of 3 (virtual) machines with 2 sockets and 8 cpus
per socket each.
if i run on one host (without any batch manager)
mpirun -np 16 --host slurm1 --oversubscribe --mca coll ^ml
./intercomm_create
then the test is a success with the following warning :
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: slurm2
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
now if i run on three hosts
mpirun -np 16 --host slurm1,slurm2,slurm3 --oversubscribe --mca coll ^ml
./intercomm_create
then the test is a success without any warning
but now, if i run on two hosts
mpirun -np 16 --host slurm1,slurm2 --oversubscribe --mca coll ^ml
./intercomm_create
then the test is a failure.
first, i get the following same warning :
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: slurm2
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
followed by a crash
[slurm1:2482] *** An error occurred in MPI_Comm_spawn
[slurm1:2482] *** reported by process [2068512769,0]
[slurm1:2482] *** on communicator MPI_COMM_WORLD
[slurm1:2482] *** MPI_ERR_SPAWN: could not spawn processes
[slurm1:2482] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[slurm1:2482] *** and potentially your MPI job)
that being said, i the following command works :
mpirun -np 16 --host slurm1,slurm2 --mca coll ^ml --bind-to none
./intercomm_create
1) what does the first message means ?
is it a warning ? /* if yes, why does mpirun on two hosts fail ? */
is it a fatal error ? /* if yes, why does mpirun on one host success
? */
2) generally speaking, and assuming the first message is a warning,
should --oversubscribe automatically set overload-allowed ?
/* as far as i am concerned, that would be much more intuitive */
Cheers,
Gilles