Folks, this issue is related to the failures reported by mtt on the trunk when the ibm test suite invokes MPI_Comm_spawn.
my test bed is made of 3 (virtual) machines with 2 sockets and 8 cpus per socket each. if i run on one host (without any batch manager) mpirun -np 16 --host slurm1 --oversubscribe --mca coll ^ml ./intercomm_create then the test is a success with the following warning : -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: slurm2 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- now if i run on three hosts mpirun -np 16 --host slurm1,slurm2,slurm3 --oversubscribe --mca coll ^ml ./intercomm_create then the test is a success without any warning but now, if i run on two hosts mpirun -np 16 --host slurm1,slurm2 --oversubscribe --mca coll ^ml ./intercomm_create then the test is a failure. first, i get the following same warning : -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: slurm2 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- followed by a crash [slurm1:2482] *** An error occurred in MPI_Comm_spawn [slurm1:2482] *** reported by process [2068512769,0] [slurm1:2482] *** on communicator MPI_COMM_WORLD [slurm1:2482] *** MPI_ERR_SPAWN: could not spawn processes [slurm1:2482] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [slurm1:2482] *** and potentially your MPI job) that being said, i the following command works : mpirun -np 16 --host slurm1,slurm2 --mca coll ^ml --bind-to none ./intercomm_create 1) what does the first message means ? is it a warning ? /* if yes, why does mpirun on two hosts fail ? */ is it a fatal error ? /* if yes, why does mpirun on one host success ? */ 2) generally speaking, and assuming the first message is a warning, should --oversubscribe automatically set overload-allowed ? /* as far as i am concerned, that would be much more intuitive */ Cheers, Gilles