Re: [OMPI users] Node failure handling

George Bosilca Tue, 27 Jun 2017 03:37:44 -0700

I would also be interested in having the slurm keep the remaining processes
around, we have been struggling with this on many of the NERSC machines.
That being said the error message comes from orted, and it suggest that
they are giving up because they lose connection to a peer. I was not aware
that this capability exists in the master version of ORTE, but if it does
then it makes our life easier.


  George.


On Tue, Jun 27, 2017 at 6:14 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> Let me poke at it a bit tomorrow - we should be able to avoid the abort.
> It’s a bug if we can’t.
>
> > On Jun 26, 2017, at 7:39 PM, Tim Burgess <ozburgess+o...@gmail.com>
> wrote:
> >
> > Hi Ralph,
> >
> > Thanks for the quick response.
> >
> > Just tried again not under slurm, but the same result... (though I
> > just did kill -9 orted on the remote node this time)
> >
> > Any ideas?  Do you think my multiple-mpirun idea is worth trying?
> >
> > Cheers,
> > Tim
> >
> >
> > ```
> > [user@bud96 mpi_resilience]$
> > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
> > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
> > --debug-daemons $(pwd)/test
> > ( some output from job here )
> > ( I then do kill -9 `pgrep orted`  on pnod0331 )
> > bash: line 1: 161312 Killed
> > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
> > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
> > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
> > "581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
> > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
> > ------------------------------------------------------------
> --------------
> > ORTE has lost communication with a remote daemon.
> >
> >  HNP daemon   : [[8878,0],0] on node bud96
> >  Remote daemon: [[8878,0],1] on node pnod0331
> >
> > This is usually due to either a failure of the TCP network
> > connection to the node, or possibly an internal failure of
> > the daemon itself. We cannot recover from this failure, and
> > therefore will terminate the job.
> > ------------------------------------------------------------
> --------------
> > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
> > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone -
> exiting
> > ```
> >
> > On 27 June 2017 at 12:19, r...@open-mpi.org <r...@open-mpi.org> wrote:
> >> Ah - you should have told us you are running under slurm. That does
> indeed make a difference. When we launch the daemons, we do so with "srun
> --kill-on-bad-exit” - this means that slurm automatically kills the job if
> any daemon terminates. We take that measure to avoid leaving zombies behind
> in the event of a failure.
> >>
> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the
> rsh launcher instead of the slurm one, which gives you more control.
> >>
> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess <ozburgess+o...@gmail.com>
> wrote:
> >>>
> >>> Hi Ralph, George,
> >>>
> >>> Thanks very much for getting back to me.  Alas, neither of these
> >>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
> >>> recent master (7002535), with slurm's "--no-kill" and openmpi's
> >>> "--enable-recovery", once the node reboots one gets the following
> >>> error:
> >>>
> >>> ```
> >>> ------------------------------------------------------------
> --------------
> >>> ORTE has lost communication with a remote daemon.
> >>>
> >>> HNP daemon   : [[58323,0],0] on node pnod0330
> >>> Remote daemon: [[58323,0],1] on node pnod0331
> >>>
> >>> This is usually due to either a failure of the TCP network
> >>> connection to the node, or possibly an internal failure of
> >>> the daemon itself. We cannot recover from this failure, and
> >>> therefore will terminate the job.
> >>> ------------------------------------------------------------
> --------------
> >>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
> >>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
> >>> ```
> >>>
> >>> I haven't yet tried the hard reboot case with ULFM (these nodes take
> >>> forever to come back up), but earlier experiments SIGKILLing the orted
> >>> on a compute node led to a very similar message as above, so at this
> >>> point I'm not optimistic...
> >>>
> >>> I think my next step is to try with several separate mpiruns and use
> >>> mpi_comm_{connect,accept} to plumb everything together before the
> >>> application starts.  I notice this is the subject of some recent work
> >>> on ompi master.  Even though the mpiruns will all be associated to the
> >>> same ompi-server, do you think this could be sufficient to isolate the
> >>> failures?
> >>>
> >>> Cheers,
> >>> Tim
> >>>
> >>>
> >>>
> >>> On 10 June 2017 at 00:56, r...@open-mpi.org <r...@open-mpi.org> wrote:
> >>>> It has been awhile since I tested it, but I believe the
> --enable-recovery option might do what you want.
> >>>>
> >>>>> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com>
> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> So I know from searching the archive that this is a repeated topic of
> >>>>> discussion here, and apologies for that, but since it's been a year
> or
> >>>>> so I thought I'd double-check whether anything has changed before
> >>>>> really starting to tear my hair out too much.
> >>>>>
> >>>>> Is there a combination of MCA parameters or similar that will prevent
> >>>>> ORTE from aborting a job when it detects a node failure?  This is
> >>>>> using the tcp btl, under slurm.
> >>>>>
> >>>>> The application, not written by us and too complicated to re-engineer
> >>>>> at short notice, has a strictly master-slave communication pattern.
> >>>>> The master never blocks on communication from individual slaves, and
> >>>>> apparently can itself detect slaves that have silently disappeared
> and
> >>>>> reissue the work to those remaining.  So from an application
> >>>>> standpoint I believe we should be able to handle this.  However, in
> >>>>> all my testing so far the job is aborted as soon as the runtime
> system
> >>>>> figures out what is going on.
> >>>>>
> >>>>> If not, do any users know of another MPI implementation that might
> >>>>> work for this use case?  As far as I can tell, FT-MPI has been pretty
> >>>>> quiet the last couple of years?
> >>>>>
> >>>>> Thanks in advance,
> >>>>>
> >>>>> Tim
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users@lists.open-mpi.org
> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users@lists.open-mpi.org
> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >>> _______________________________________________
> >>> users mailing list
> >>> users@lists.open-mpi.org
> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Node failure handling

Reply via email to