Hello, since we have updated to the new slurm version (19.05) every time a jobstep is launched with mpirun it ends with the following error message:

    An ORTE daemon has unexpectedly failed after launch and before
    communicating back to mpirun. This could be caused by a number
    of factors, including an inability to create a connection back
    to mpirun due to lack of common network interfaces and / or no
    route found between them. Please check network connectivity
    (including firewalls and network routing requirements).

This only happens when it is launched to more than one node. If all tasks run within the same node it works without problems

We have tested with different versions of OpenMPI (2.1.2, 3.1.1, 3.1.3), all they compiled with the flags --with-slurm and --with-pmi. And in all cases if the job is launched to nodes with slurm 18.05 it works with both srun and mpirun. But if it is launched to nodes with slurm 19.05 it works with srun but it fails with mpirun.

Can it be a bug in the new version?

Thank you.


--
 Andrés Marín Díaz
Servicio de Infraestructura e Innovación
 Universidad Politécnica de Madrid
Centro de Supercomputación y Visualización de Madrid (CeSViMa)
 Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
 ama...@cesvima.upm.es | tel 910679676
www.cesvima.upm.es | www.twitter.com/cesvima | www.fb.com/cesvima

Reply via email to