Ah - you should have told us you are running under slurm. That does indeed make a difference. When we launch the daemons, we do so with "srun --kill-on-bad-exit” - this means that slurm automatically kills the job if any daemon terminates. We take that measure to avoid leaving zombies behind in the event of a failure.
Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh launcher instead of the slurm one, which gives you more control. > On Jun 26, 2017, at 6:59 PM, Tim Burgess <ozburgess+o...@gmail.com> wrote: > > Hi Ralph, George, > > Thanks very much for getting back to me. Alas, neither of these > options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a > recent master (7002535), with slurm's "--no-kill" and openmpi's > "--enable-recovery", once the node reboots one gets the following > error: > > ``` > -------------------------------------------------------------------------- > ORTE has lost communication with a remote daemon. > > HNP daemon : [[58323,0],0] on node pnod0330 > Remote daemon: [[58323,0],1] on node pnod0331 > > This is usually due to either a failure of the TCP network > connection to the node, or possibly an internal failure of > the daemon itself. We cannot recover from this failure, and > therefore will terminate the job. > -------------------------------------------------------------------------- > [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd > [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd > ``` > > I haven't yet tried the hard reboot case with ULFM (these nodes take > forever to come back up), but earlier experiments SIGKILLing the orted > on a compute node led to a very similar message as above, so at this > point I'm not optimistic... > > I think my next step is to try with several separate mpiruns and use > mpi_comm_{connect,accept} to plumb everything together before the > application starts. I notice this is the subject of some recent work > on ompi master. Even though the mpiruns will all be associated to the > same ompi-server, do you think this could be sufficient to isolate the > failures? > > Cheers, > Tim > > > > On 10 June 2017 at 00:56, r...@open-mpi.org <r...@open-mpi.org> wrote: >> It has been awhile since I tested it, but I believe the --enable-recovery >> option might do what you want. >> >>> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com> wrote: >>> >>> Hi! >>> >>> So I know from searching the archive that this is a repeated topic of >>> discussion here, and apologies for that, but since it's been a year or >>> so I thought I'd double-check whether anything has changed before >>> really starting to tear my hair out too much. >>> >>> Is there a combination of MCA parameters or similar that will prevent >>> ORTE from aborting a job when it detects a node failure? This is >>> using the tcp btl, under slurm. >>> >>> The application, not written by us and too complicated to re-engineer >>> at short notice, has a strictly master-slave communication pattern. >>> The master never blocks on communication from individual slaves, and >>> apparently can itself detect slaves that have silently disappeared and >>> reissue the work to those remaining. So from an application >>> standpoint I believe we should be able to handle this. However, in >>> all my testing so far the job is aborted as soon as the runtime system >>> figures out what is going on. >>> >>> If not, do any users know of another MPI implementation that might >>> work for this use case? As far as I can tell, FT-MPI has been pretty >>> quiet the last couple of years? >>> >>> Thanks in advance, >>> >>> Tim >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users