Re: [OMPI users] MPI_ABORT, indirect execution of executables by mpirun, Open MPI 2.1.1

2017-06-27 Thread Ted Sussman
Hello Ralph, Thanks for your quick reply and bug fix. I have obtained the update and tried it in my simple example, and also in the original program from which the simple example was extracted. The update works as expected :) Sincerely, Ted Sussman On 27 Jun 2017 at 12:13,

Re: [OMPI users] Node failure handling

2017-06-27 Thread r...@open-mpi.org
Okay, this should fix it - https://github.com/open-mpi/ompi/pull/3771 > On Jun 27, 2017, at 6:31 AM, r...@open-mpi.org wrote: > > Actually, the error message is coming from mpirun to indicate that it lost > connection to one (or more) of its

Re: [OMPI users] Node failure handling

2017-06-27 Thread r...@open-mpi.org
Actually, the error message is coming from mpirun to indicate that it lost connection to one (or more) of its daemons. This happens because slurm only knows about the remote daemons - mpirun was started outside of “srun”, and so slurm doesn’t know it exists. Thus, when slurm kills the job, it

Re: [OMPI users] Node failure handling

2017-06-27 Thread George Bosilca
I would also be interested in having the slurm keep the remaining processes around, we have been struggling with this on many of the NERSC machines. That being said the error message comes from orted, and it suggest that they are giving up because they lose connection to a peer. I was not aware