Okay, this should fix it - https://github.com/open-mpi/ompi/pull/3771 <https://github.com/open-mpi/ompi/pull/3771>
> On Jun 27, 2017, at 6:31 AM, r...@open-mpi.org wrote: > > Actually, the error message is coming from mpirun to indicate that it lost > connection to one (or more) of its daemons. This happens because slurm only > knows about the remote daemons - mpirun was started outside of “srun”, and so > slurm doesn’t know it exists. Thus, when slurm kills the job, it only kills > the daemons on the compute nodes, not mpirun. As a result, we always see that > error message. > > The capability should exist as an option - it used to, but probably has > fallen into disrepair. I’ll see if I can bring it back. > >> On Jun 27, 2017, at 3:35 AM, George Bosilca <bosi...@icl.utk.edu >> <mailto:bosi...@icl.utk.edu>> wrote: >> >> I would also be interested in having the slurm keep the remaining processes >> around, we have been struggling with this on many of the NERSC machines. >> That being said the error message comes from orted, and it suggest that they >> are giving up because they lose connection to a peer. I was not aware that >> this capability exists in the master version of ORTE, but if it does then it >> makes our life easier. >> >> George. >> >> >> On Tue, Jun 27, 2017 at 6:14 AM, r...@open-mpi.org >> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> >> wrote: >> Let me poke at it a bit tomorrow - we should be able to avoid the abort. >> It’s a bug if we can’t. >> >> > On Jun 26, 2017, at 7:39 PM, Tim Burgess <ozburgess+o...@gmail.com >> > <mailto:ozburgess%2bo...@gmail.com>> wrote: >> > >> > Hi Ralph, >> > >> > Thanks for the quick response. >> > >> > Just tried again not under slurm, but the same result... (though I >> > just did kill -9 orted on the remote node this time) >> > >> > Any ideas? Do you think my multiple-mpirun idea is worth trying? >> > >> > Cheers, >> > Tim >> > >> > >> > ``` >> > [user@bud96 mpi_resilience]$ >> > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh >> > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery >> > --debug-daemons $(pwd)/test >> > ( some output from job here ) >> > ( I then do kill -9 `pgrep orted` on pnod0331 ) >> > bash: line 1: 161312 Killed >> > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca >> > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608" >> > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex >> > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri >> > "581828608.0;tcp://172.16.251.96 >> > <http://172.16.251.96/>,172.31.1.254:58250 <http://172.31.1.254:58250/>" >> > -mca plm "rsh" >> > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1" >> > -------------------------------------------------------------------------- >> > ORTE has lost communication with a remote daemon. >> > >> > HNP daemon : [[8878,0],0] on node bud96 >> > Remote daemon: [[8878,0],1] on node pnod0331 >> > >> > This is usually due to either a failure of the TCP network >> > connection to the node, or possibly an internal failure of >> > the daemon itself. We cannot recover from this failure, and >> > therefore will terminate the job. >> > -------------------------------------------------------------------------- >> > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd >> > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - >> > exiting >> > ``` >> > >> > On 27 June 2017 at 12:19, r...@open-mpi.org <mailto:r...@open-mpi.org> >> > <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote: >> >> Ah - you should have told us you are running under slurm. That does >> >> indeed make a difference. When we launch the daemons, we do so with "srun >> >> --kill-on-bad-exit” - this means that slurm automatically kills the job >> >> if any daemon terminates. We take that measure to avoid leaving zombies >> >> behind in the event of a failure. >> >> >> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh >> >> launcher instead of the slurm one, which gives you more control. >> >> >> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess <ozburgess+o...@gmail.com >> >>> <mailto:ozburgess%2bo...@gmail.com>> wrote: >> >>> >> >>> Hi Ralph, George, >> >>> >> >>> Thanks very much for getting back to me. Alas, neither of these >> >>> options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a >> >>> recent master (7002535), with slurm's "--no-kill" and openmpi's >> >>> "--enable-recovery", once the node reboots one gets the following >> >>> error: >> >>> >> >>> ``` >> >>> -------------------------------------------------------------------------- >> >>> ORTE has lost communication with a remote daemon. >> >>> >> >>> HNP daemon : [[58323,0],0] on node pnod0330 >> >>> Remote daemon: [[58323,0],1] on node pnod0331 >> >>> >> >>> This is usually due to either a failure of the TCP network >> >>> connection to the node, or possibly an internal failure of >> >>> the daemon itself. We cannot recover from this failure, and >> >>> therefore will terminate the job. >> >>> -------------------------------------------------------------------------- >> >>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd >> >>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd >> >>> ``` >> >>> >> >>> I haven't yet tried the hard reboot case with ULFM (these nodes take >> >>> forever to come back up), but earlier experiments SIGKILLing the orted >> >>> on a compute node led to a very similar message as above, so at this >> >>> point I'm not optimistic... >> >>> >> >>> I think my next step is to try with several separate mpiruns and use >> >>> mpi_comm_{connect,accept} to plumb everything together before the >> >>> application starts. I notice this is the subject of some recent work >> >>> on ompi master. Even though the mpiruns will all be associated to the >> >>> same ompi-server, do you think this could be sufficient to isolate the >> >>> failures? >> >>> >> >>> Cheers, >> >>> Tim >> >>> >> >>> >> >>> >> >>> On 10 June 2017 at 00:56, r...@open-mpi.org <mailto:r...@open-mpi.org> >> >>> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote: >> >>>> It has been awhile since I tested it, but I believe the >> >>>> --enable-recovery option might do what you want. >> >>>> >> >>>>> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com >> >>>>> <mailto:ozburgess%2bo...@gmail.com>> wrote: >> >>>>> >> >>>>> Hi! >> >>>>> >> >>>>> So I know from searching the archive that this is a repeated topic of >> >>>>> discussion here, and apologies for that, but since it's been a year or >> >>>>> so I thought I'd double-check whether anything has changed before >> >>>>> really starting to tear my hair out too much. >> >>>>> >> >>>>> Is there a combination of MCA parameters or similar that will prevent >> >>>>> ORTE from aborting a job when it detects a node failure? This is >> >>>>> using the tcp btl, under slurm. >> >>>>> >> >>>>> The application, not written by us and too complicated to re-engineer >> >>>>> at short notice, has a strictly master-slave communication pattern. >> >>>>> The master never blocks on communication from individual slaves, and >> >>>>> apparently can itself detect slaves that have silently disappeared and >> >>>>> reissue the work to those remaining. So from an application >> >>>>> standpoint I believe we should be able to handle this. However, in >> >>>>> all my testing so far the job is aborted as soon as the runtime system >> >>>>> figures out what is going on. >> >>>>> >> >>>>> If not, do any users know of another MPI implementation that might >> >>>>> work for this use case? As far as I can tell, FT-MPI has been pretty >> >>>>> quiet the last couple of years? >> >>>>> >> >>>>> Thanks in advance, >> >>>>> >> >>>>> Tim >> >>>>> _______________________________________________ >> >>>>> users mailing list >> >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> >>>> >> >>>> _______________________________________________ >> >>>> users mailing list >> >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> >>> _______________________________________________ >> >>> users mailing list >> >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> >> >> >> _______________________________________________ >> >> users mailing list >> >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> > _______________________________________________ >> > users mailing list >> > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users