Re: [OMPI users] Node failure handling

r...@open-mpi.org Mon, 26 Jun 2017 19:20:58 -0700

Ah - you should have told us you are running under slurm. That does indeed make 
a difference. When we launch the daemons, we do so with "srun 
--kill-on-bad-exit” - this means that slurm automatically kills the job if any 
daemon terminates. We take that measure to avoid leaving zombies behind in the 
event of a failure.


Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
launcher instead of the slurm one, which gives you more control.

> On Jun 26, 2017, at 6:59 PM, Tim Burgess <ozburgess+o...@gmail.com> wrote:
> 
> Hi Ralph, George,
> 
> Thanks very much for getting back to me.  Alas, neither of these
> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
> recent master (7002535), with slurm's "--no-kill" and openmpi's
> "--enable-recovery", once the node reboots one gets the following
> error:
> 
> ```
> --------------------------------------------------------------------------
> ORTE has lost communication with a remote daemon.
> 
>  HNP daemon   : [[58323,0],0] on node pnod0330
>  Remote daemon: [[58323,0],1] on node pnod0331
> 
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --------------------------------------------------------------------------
> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
> ```
> 
> I haven't yet tried the hard reboot case with ULFM (these nodes take
> forever to come back up), but earlier experiments SIGKILLing the orted
> on a compute node led to a very similar message as above, so at this
> point I'm not optimistic...
> 
> I think my next step is to try with several separate mpiruns and use
> mpi_comm_{connect,accept} to plumb everything together before the
> application starts.  I notice this is the subject of some recent work
> on ompi master.  Even though the mpiruns will all be associated to the
> same ompi-server, do you think this could be sufficient to isolate the
> failures?
> 
> Cheers,
> Tim
> 
> 
> 
> On 10 June 2017 at 00:56, r...@open-mpi.org <r...@open-mpi.org> wrote:
>> It has been awhile since I tested it, but I believe the --enable-recovery 
>> option might do what you want.
>> 
>>> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com> wrote:
>>> 
>>> Hi!
>>> 
>>> So I know from searching the archive that this is a repeated topic of
>>> discussion here, and apologies for that, but since it's been a year or
>>> so I thought I'd double-check whether anything has changed before
>>> really starting to tear my hair out too much.
>>> 
>>> Is there a combination of MCA parameters or similar that will prevent
>>> ORTE from aborting a job when it detects a node failure?  This is
>>> using the tcp btl, under slurm.
>>> 
>>> The application, not written by us and too complicated to re-engineer
>>> at short notice, has a strictly master-slave communication pattern.
>>> The master never blocks on communication from individual slaves, and
>>> apparently can itself detect slaves that have silently disappeared and
>>> reissue the work to those remaining.  So from an application
>>> standpoint I believe we should be able to handle this.  However, in
>>> all my testing so far the job is aborted as soon as the runtime system
>>> figures out what is going on.
>>> 
>>> If not, do any users know of another MPI implementation that might
>>> work for this use case?  As far as I can tell, FT-MPI has been pretty
>>> quiet the last couple of years?
>>> 
>>> Thanks in advance,
>>> 
>>> Tim
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Node failure handling

Reply via email to