Hi all,

there seems to be a host order-dependent timing issue. The issue occurs
when a set of processes is placed on the same node. Mpirun of the job exits
at MPI_Init() with:

  num local peers failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

Non-MPI applications launch just fine, such as:

$ mpirun -np 36 --hostfile job_4.machines --mca rmaps seq --map-by node
--bind-to none --mca btl_openib_allow_ib 1  /usr/bin/hostname

The error with "num local peers failed" happens with already a simple MPI
program that simply invokes MPI_Init(), e.g.,

$ mpirun -np 36 --hostfile job_4.machines --mca rmaps seq --map-by node
--bind-to none --mca btl_openib_allow_ib 1  /cluster/testing/mpihelloworld

  num local peers failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

There is zero documentation in the error and I don't know how to work
around this... The error occurs in OpenMPI 4.1.1 and 4.0.4. We have not
tried other versions yet.

Curiously, if the hostfile is sorted first, MPI_Init() will always succeed,
e.g.,

$ sort job_4.machines  > job_4.machines-sorted
$ mpirun -np 36 --hostfile clock_4.machines-sorted --mca rmaps seq --map-by
node --bind-to none --mca btl_openib_allow_ib 1
 /cluster/difx/DiFX-trunk_64/bin/mpihelloworld
 (--> success)

My guess is when two instances of the same node are in the hostfile, and
the instances are too wide apart (too many other nodes listed in between),
then MPI_Init() of one of the instances might be checking much too soon for
the other instance?

Alas we have a heterogenous cluster where rank-to-node mapping is critical.

Does OpenMPI 4.1 have any "grace time" parameter or similar, which would
allow processes to wait a bit longer for the expected other instance(s) to
eventually come up on the same node?

many thanks,
Jan

Reply via email to