Folks,

on my single socket four cores VM (no batch manager), i am running the
intercomm_create test from the ibm test suite.

mpirun -np 1 ./intercomm_create
=> OK

mpirun -np 2 ./intercomm_create
=> HANG :-(

mpirun -np 2 --mca coll ^ml  ./intercomm_create
=> OK

basically, this first two tasks will call twice MPI_Comm_spawn(2 tasks)
followed by MPI_Intercomm_merge
and the 4 spawned tasks will call MPI_Intercomm_merge followed by
MPI_Intercomm_create

i digged a bit into that issue and found two distinct issues :

1) binding :
tasks [0-1] (launched with mpirun) are bound on cores [0-1] => OK
tasks[2-3] (first spawn) are bound on cores [0-1] => ODD, i would have
expected [2-3]
tasks[4-5] (second spawn) are not bound at all => ODD again, could have
made sense if tasks[2-3] were bound on cores [2-3]
i observe the same behaviour  with the --oversubscribe mpirun parameter

2) coll/ml
coll/ml hangs when -np 2 (total 6 tasks, including 2 unbound tasks)
i suspect coll/ml is unable to handle unbound tasks.
if i am correct, should coll/ml detect this and simply automatically
disqualify itself ?

Cheers,

Gilles

Reply via email to