Re: [OMPI users] Node failure handling

2017-06-27 Thread r...@open-mpi.org
Okay, this should fix it - https://github.com/open-mpi/ompi/pull/3771 


> On Jun 27, 2017, at 6:31 AM, r...@open-mpi.org wrote:
> 
> Actually, the error message is coming from mpirun to indicate that it lost 
> connection to one (or more) of its daemons. This happens because slurm only 
> knows about the remote daemons - mpirun was started outside of “srun”, and so 
> slurm doesn’t know it exists. Thus, when slurm kills the job, it only kills 
> the daemons on the compute nodes, not mpirun. As a result, we always see that 
> error message.
> 
> The capability should exist as an option - it used to, but probably has 
> fallen into disrepair. I’ll see if I can bring it back.
> 
>> On Jun 27, 2017, at 3:35 AM, George Bosilca > > wrote:
>> 
>> I would also be interested in having the slurm keep the remaining processes 
>> around, we have been struggling with this on many of the NERSC machines. 
>> That being said the error message comes from orted, and it suggest that they 
>> are giving up because they lose connection to a peer. I was not aware that 
>> this capability exists in the master version of ORTE, but if it does then it 
>> makes our life easier.
>> 
>>   George.
>> 
>> 
>> On Tue, Jun 27, 2017 at 6:14 AM, r...@open-mpi.org 
>>  > 
>> wrote:
>> Let me poke at it a bit tomorrow - we should be able to avoid the abort. 
>> It’s a bug if we can’t.
>> 
>> > On Jun 26, 2017, at 7:39 PM, Tim Burgess > > > wrote:
>> >
>> > Hi Ralph,
>> >
>> > Thanks for the quick response.
>> >
>> > Just tried again not under slurm, but the same result... (though I
>> > just did kill -9 orted on the remote node this time)
>> >
>> > Any ideas?  Do you think my multiple-mpirun idea is worth trying?
>> >
>> > Cheers,
>> > Tim
>> >
>> >
>> > ```
>> > [user@bud96 mpi_resilience]$
>> > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
>> > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
>> > --debug-daemons $(pwd)/test
>> > ( some output from job here )
>> > ( I then do kill -9 `pgrep orted`  on pnod0331 )
>> > bash: line 1: 161312 Killed
>> > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
>> > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
>> > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
>> > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
>> > "581828608.0;tcp://172.16.251.96 
>> > ,172.31.1.254:58250 " 
>> > -mca plm "rsh"
>> > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
>> > --
>> > ORTE has lost communication with a remote daemon.
>> >
>> >  HNP daemon   : [[8878,0],0] on node bud96
>> >  Remote daemon: [[8878,0],1] on node pnod0331
>> >
>> > This is usually due to either a failure of the TCP network
>> > connection to the node, or possibly an internal failure of
>> > the daemon itself. We cannot recover from this failure, and
>> > therefore will terminate the job.
>> > --
>> > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
>> > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - 
>> > exiting
>> > ```
>> >
>> > On 27 June 2017 at 12:19, r...@open-mpi.org  
>> > > wrote:
>> >> Ah - you should have told us you are running under slurm. That does 
>> >> indeed make a difference. When we launch the daemons, we do so with "srun 
>> >> --kill-on-bad-exit” - this means that slurm automatically kills the job 
>> >> if any daemon terminates. We take that measure to avoid leaving zombies 
>> >> behind in the event of a failure.
>> >>
>> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
>> >> launcher instead of the slurm one, which gives you more control.
>> >>
>> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess > >>> > wrote:
>> >>>
>> >>> Hi Ralph, George,
>> >>>
>> >>> Thanks very much for getting back to me.  Alas, neither of these
>> >>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
>> >>> recent master (7002535), with slurm's "--no-kill" and openmpi's
>> >>> "--enable-recovery", once the node reboots one gets the following
>> >>> error:
>> >>>
>> >>> ```
>> >>> --
>> >>> ORTE has lost communication with a remote daemon.
>> >>>
>> >>> HNP daemon   : [[58323,0],0] on node pnod0330
>> >>> Remote daemon: [[58323,0],1] on node pnod0331
>> >>>
>> >>> This is usually due to either a failure of the TCP network
>> >>> connection to the node, or 

Re: [OMPI users] Node failure handling

2017-06-27 Thread r...@open-mpi.org
Actually, the error message is coming from mpirun to indicate that it lost 
connection to one (or more) of its daemons. This happens because slurm only 
knows about the remote daemons - mpirun was started outside of “srun”, and so 
slurm doesn’t know it exists. Thus, when slurm kills the job, it only kills the 
daemons on the compute nodes, not mpirun. As a result, we always see that error 
message.

The capability should exist as an option - it used to, but probably has fallen 
into disrepair. I’ll see if I can bring it back.

> On Jun 27, 2017, at 3:35 AM, George Bosilca  wrote:
> 
> I would also be interested in having the slurm keep the remaining processes 
> around, we have been struggling with this on many of the NERSC machines. That 
> being said the error message comes from orted, and it suggest that they are 
> giving up because they lose connection to a peer. I was not aware that this 
> capability exists in the master version of ORTE, but if it does then it makes 
> our life easier.
> 
>   George.
> 
> 
> On Tue, Jun 27, 2017 at 6:14 AM, r...@open-mpi.org  
> > wrote:
> Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s 
> a bug if we can’t.
> 
> > On Jun 26, 2017, at 7:39 PM, Tim Burgess  > > wrote:
> >
> > Hi Ralph,
> >
> > Thanks for the quick response.
> >
> > Just tried again not under slurm, but the same result... (though I
> > just did kill -9 orted on the remote node this time)
> >
> > Any ideas?  Do you think my multiple-mpirun idea is worth trying?
> >
> > Cheers,
> > Tim
> >
> >
> > ```
> > [user@bud96 mpi_resilience]$
> > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
> > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
> > --debug-daemons $(pwd)/test
> > ( some output from job here )
> > ( I then do kill -9 `pgrep orted`  on pnod0331 )
> > bash: line 1: 161312 Killed
> > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
> > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
> > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
> > "581828608.0;tcp://172.16.251.96 ,172.31.1.254:58250 
> > " -mca plm "rsh"
> > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
> > --
> > ORTE has lost communication with a remote daemon.
> >
> >  HNP daemon   : [[8878,0],0] on node bud96
> >  Remote daemon: [[8878,0],1] on node pnod0331
> >
> > This is usually due to either a failure of the TCP network
> > connection to the node, or possibly an internal failure of
> > the daemon itself. We cannot recover from this failure, and
> > therefore will terminate the job.
> > --
> > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
> > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
> > ```
> >
> > On 27 June 2017 at 12:19, r...@open-mpi.org  
> > > wrote:
> >> Ah - you should have told us you are running under slurm. That does indeed 
> >> make a difference. When we launch the daemons, we do so with "srun 
> >> --kill-on-bad-exit” - this means that slurm automatically kills the job if 
> >> any daemon terminates. We take that measure to avoid leaving zombies 
> >> behind in the event of a failure.
> >>
> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
> >> launcher instead of the slurm one, which gives you more control.
> >>
> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess  >>> > wrote:
> >>>
> >>> Hi Ralph, George,
> >>>
> >>> Thanks very much for getting back to me.  Alas, neither of these
> >>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
> >>> recent master (7002535), with slurm's "--no-kill" and openmpi's
> >>> "--enable-recovery", once the node reboots one gets the following
> >>> error:
> >>>
> >>> ```
> >>> --
> >>> ORTE has lost communication with a remote daemon.
> >>>
> >>> HNP daemon   : [[58323,0],0] on node pnod0330
> >>> Remote daemon: [[58323,0],1] on node pnod0331
> >>>
> >>> This is usually due to either a failure of the TCP network
> >>> connection to the node, or possibly an internal failure of
> >>> the daemon itself. We cannot recover from this failure, and
> >>> therefore will terminate the job.
> >>> --
> >>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
> >>> [pnod0332:56161] [[58323,0],2] orted_cmd: 

Re: [OMPI users] Node failure handling

2017-06-27 Thread George Bosilca
I would also be interested in having the slurm keep the remaining processes
around, we have been struggling with this on many of the NERSC machines.
That being said the error message comes from orted, and it suggest that
they are giving up because they lose connection to a peer. I was not aware
that this capability exists in the master version of ORTE, but if it does
then it makes our life easier.

  George.


On Tue, Jun 27, 2017 at 6:14 AM, r...@open-mpi.org  wrote:

> Let me poke at it a bit tomorrow - we should be able to avoid the abort.
> It’s a bug if we can’t.
>
> > On Jun 26, 2017, at 7:39 PM, Tim Burgess 
> wrote:
> >
> > Hi Ralph,
> >
> > Thanks for the quick response.
> >
> > Just tried again not under slurm, but the same result... (though I
> > just did kill -9 orted on the remote node this time)
> >
> > Any ideas?  Do you think my multiple-mpirun idea is worth trying?
> >
> > Cheers,
> > Tim
> >
> >
> > ```
> > [user@bud96 mpi_resilience]$
> > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
> > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
> > --debug-daemons $(pwd)/test
> > ( some output from job here )
> > ( I then do kill -9 `pgrep orted`  on pnod0331 )
> > bash: line 1: 161312 Killed
> > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
> > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
> > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
> > "581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
> > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
> > 
> --
> > ORTE has lost communication with a remote daemon.
> >
> >  HNP daemon   : [[8878,0],0] on node bud96
> >  Remote daemon: [[8878,0],1] on node pnod0331
> >
> > This is usually due to either a failure of the TCP network
> > connection to the node, or possibly an internal failure of
> > the daemon itself. We cannot recover from this failure, and
> > therefore will terminate the job.
> > 
> --
> > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
> > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone -
> exiting
> > ```
> >
> > On 27 June 2017 at 12:19, r...@open-mpi.org  wrote:
> >> Ah - you should have told us you are running under slurm. That does
> indeed make a difference. When we launch the daemons, we do so with "srun
> --kill-on-bad-exit” - this means that slurm automatically kills the job if
> any daemon terminates. We take that measure to avoid leaving zombies behind
> in the event of a failure.
> >>
> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the
> rsh launcher instead of the slurm one, which gives you more control.
> >>
> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess 
> wrote:
> >>>
> >>> Hi Ralph, George,
> >>>
> >>> Thanks very much for getting back to me.  Alas, neither of these
> >>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
> >>> recent master (7002535), with slurm's "--no-kill" and openmpi's
> >>> "--enable-recovery", once the node reboots one gets the following
> >>> error:
> >>>
> >>> ```
> >>> 
> --
> >>> ORTE has lost communication with a remote daemon.
> >>>
> >>> HNP daemon   : [[58323,0],0] on node pnod0330
> >>> Remote daemon: [[58323,0],1] on node pnod0331
> >>>
> >>> This is usually due to either a failure of the TCP network
> >>> connection to the node, or possibly an internal failure of
> >>> the daemon itself. We cannot recover from this failure, and
> >>> therefore will terminate the job.
> >>> 
> --
> >>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
> >>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
> >>> ```
> >>>
> >>> I haven't yet tried the hard reboot case with ULFM (these nodes take
> >>> forever to come back up), but earlier experiments SIGKILLing the orted
> >>> on a compute node led to a very similar message as above, so at this
> >>> point I'm not optimistic...
> >>>
> >>> I think my next step is to try with several separate mpiruns and use
> >>> mpi_comm_{connect,accept} to plumb everything together before the
> >>> application starts.  I notice this is the subject of some recent work
> >>> on ompi master.  Even though the mpiruns will all be associated to the
> >>> same ompi-server, do you think this could be sufficient to isolate the
> >>> failures?
> >>>
> >>> Cheers,
> >>> Tim
> >>>
> >>>
> >>>
> >>> On 10 June 2017 at 00:56, r...@open-mpi.org  wrote:
>  It has been awhile since I tested it, but I believe the
> 

Re: [OMPI users] Node failure handling

2017-06-26 Thread r...@open-mpi.org
Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a 
bug if we can’t.

> On Jun 26, 2017, at 7:39 PM, Tim Burgess  wrote:
> 
> Hi Ralph,
> 
> Thanks for the quick response.
> 
> Just tried again not under slurm, but the same result... (though I
> just did kill -9 orted on the remote node this time)
> 
> Any ideas?  Do you think my multiple-mpirun idea is worth trying?
> 
> Cheers,
> Tim
> 
> 
> ```
> [user@bud96 mpi_resilience]$
> /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
> --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
> --debug-daemons $(pwd)/test
> ( some output from job here )
> ( I then do kill -9 `pgrep orted`  on pnod0331 )
> bash: line 1: 161312 Killed
> /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
> orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
> -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
> "581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
> -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
> --
> ORTE has lost communication with a remote daemon.
> 
>  HNP daemon   : [[8878,0],0] on node bud96
>  Remote daemon: [[8878,0],1] on node pnod0331
> 
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --
> [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
> [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
> ```
> 
> On 27 June 2017 at 12:19, r...@open-mpi.org  wrote:
>> Ah - you should have told us you are running under slurm. That does indeed 
>> make a difference. When we launch the daemons, we do so with "srun 
>> --kill-on-bad-exit” - this means that slurm automatically kills the job if 
>> any daemon terminates. We take that measure to avoid leaving zombies behind 
>> in the event of a failure.
>> 
>> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
>> launcher instead of the slurm one, which gives you more control.
>> 
>>> On Jun 26, 2017, at 6:59 PM, Tim Burgess  wrote:
>>> 
>>> Hi Ralph, George,
>>> 
>>> Thanks very much for getting back to me.  Alas, neither of these
>>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
>>> recent master (7002535), with slurm's "--no-kill" and openmpi's
>>> "--enable-recovery", once the node reboots one gets the following
>>> error:
>>> 
>>> ```
>>> --
>>> ORTE has lost communication with a remote daemon.
>>> 
>>> HNP daemon   : [[58323,0],0] on node pnod0330
>>> Remote daemon: [[58323,0],1] on node pnod0331
>>> 
>>> This is usually due to either a failure of the TCP network
>>> connection to the node, or possibly an internal failure of
>>> the daemon itself. We cannot recover from this failure, and
>>> therefore will terminate the job.
>>> --
>>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
>>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
>>> ```
>>> 
>>> I haven't yet tried the hard reboot case with ULFM (these nodes take
>>> forever to come back up), but earlier experiments SIGKILLing the orted
>>> on a compute node led to a very similar message as above, so at this
>>> point I'm not optimistic...
>>> 
>>> I think my next step is to try with several separate mpiruns and use
>>> mpi_comm_{connect,accept} to plumb everything together before the
>>> application starts.  I notice this is the subject of some recent work
>>> on ompi master.  Even though the mpiruns will all be associated to the
>>> same ompi-server, do you think this could be sufficient to isolate the
>>> failures?
>>> 
>>> Cheers,
>>> Tim
>>> 
>>> 
>>> 
>>> On 10 June 2017 at 00:56, r...@open-mpi.org  wrote:
 It has been awhile since I tested it, but I believe the --enable-recovery 
 option might do what you want.
 
> On Jun 8, 2017, at 6:17 AM, Tim Burgess  wrote:
> 
> Hi!
> 
> So I know from searching the archive that this is a repeated topic of
> discussion here, and apologies for that, but since it's been a year or
> so I thought I'd double-check whether anything has changed before
> really starting to tear my hair out too much.
> 
> Is there a combination of MCA parameters or similar that will prevent
> ORTE from aborting a job when it detects a node failure?  This is
> using the tcp btl, under slurm.
> 
> The application, not written by us and 

Re: [OMPI users] Node failure handling

2017-06-26 Thread Tim Burgess
Hi Ralph,

Thanks for the quick response.

Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)

Any ideas?  Do you think my multiple-mpirun idea is worth trying?

Cheers,
Tim


```
[user@bud96 mpi_resilience]$
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted`  on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
"581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[8878,0],0] on node bud96
  Remote daemon: [[8878,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```

On 27 June 2017 at 12:19, r...@open-mpi.org  wrote:
> Ah - you should have told us you are running under slurm. That does indeed 
> make a difference. When we launch the daemons, we do so with "srun 
> --kill-on-bad-exit” - this means that slurm automatically kills the job if 
> any daemon terminates. We take that measure to avoid leaving zombies behind 
> in the event of a failure.
>
> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
> launcher instead of the slurm one, which gives you more control.
>
>> On Jun 26, 2017, at 6:59 PM, Tim Burgess  wrote:
>>
>> Hi Ralph, George,
>>
>> Thanks very much for getting back to me.  Alas, neither of these
>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
>> recent master (7002535), with slurm's "--no-kill" and openmpi's
>> "--enable-recovery", once the node reboots one gets the following
>> error:
>>
>> ```
>> --
>> ORTE has lost communication with a remote daemon.
>>
>>  HNP daemon   : [[58323,0],0] on node pnod0330
>>  Remote daemon: [[58323,0],1] on node pnod0331
>>
>> This is usually due to either a failure of the TCP network
>> connection to the node, or possibly an internal failure of
>> the daemon itself. We cannot recover from this failure, and
>> therefore will terminate the job.
>> --
>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
>> ```
>>
>> I haven't yet tried the hard reboot case with ULFM (these nodes take
>> forever to come back up), but earlier experiments SIGKILLing the orted
>> on a compute node led to a very similar message as above, so at this
>> point I'm not optimistic...
>>
>> I think my next step is to try with several separate mpiruns and use
>> mpi_comm_{connect,accept} to plumb everything together before the
>> application starts.  I notice this is the subject of some recent work
>> on ompi master.  Even though the mpiruns will all be associated to the
>> same ompi-server, do you think this could be sufficient to isolate the
>> failures?
>>
>> Cheers,
>> Tim
>>
>>
>>
>> On 10 June 2017 at 00:56, r...@open-mpi.org  wrote:
>>> It has been awhile since I tested it, but I believe the --enable-recovery 
>>> option might do what you want.
>>>
 On Jun 8, 2017, at 6:17 AM, Tim Burgess  wrote:

 Hi!

 So I know from searching the archive that this is a repeated topic of
 discussion here, and apologies for that, but since it's been a year or
 so I thought I'd double-check whether anything has changed before
 really starting to tear my hair out too much.

 Is there a combination of MCA parameters or similar that will prevent
 ORTE from aborting a job when it detects a node failure?  This is
 using the tcp btl, under slurm.

 The application, not written by us and too complicated to re-engineer
 at short notice, has a strictly master-slave communication pattern.
 The master never blocks on communication from individual slaves, and
 apparently can itself detect slaves that have silently disappeared and
 reissue the work to those remaining.  So from an application
 standpoint I believe we 

Re: [OMPI users] Node failure handling

2017-06-26 Thread Tim Burgess
Hi Ralph, George,

Thanks very much for getting back to me.  Alas, neither of these
options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
error:

```
--
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[58323,0],0] on node pnod0330
  Remote daemon: [[58323,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```

I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...

I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts.  I notice this is the subject of some recent work
on ompi master.  Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?

Cheers,
Tim



On 10 June 2017 at 00:56, r...@open-mpi.org  wrote:
> It has been awhile since I tested it, but I believe the --enable-recovery 
> option might do what you want.
>
>> On Jun 8, 2017, at 6:17 AM, Tim Burgess  wrote:
>>
>> Hi!
>>
>> So I know from searching the archive that this is a repeated topic of
>> discussion here, and apologies for that, but since it's been a year or
>> so I thought I'd double-check whether anything has changed before
>> really starting to tear my hair out too much.
>>
>> Is there a combination of MCA parameters or similar that will prevent
>> ORTE from aborting a job when it detects a node failure?  This is
>> using the tcp btl, under slurm.
>>
>> The application, not written by us and too complicated to re-engineer
>> at short notice, has a strictly master-slave communication pattern.
>> The master never blocks on communication from individual slaves, and
>> apparently can itself detect slaves that have silently disappeared and
>> reissue the work to those remaining.  So from an application
>> standpoint I believe we should be able to handle this.  However, in
>> all my testing so far the job is aborted as soon as the runtime system
>> figures out what is going on.
>>
>> If not, do any users know of another MPI implementation that might
>> work for this use case?  As far as I can tell, FT-MPI has been pretty
>> quiet the last couple of years?
>>
>> Thanks in advance,
>>
>> Tim
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Node failure handling

2017-06-09 Thread George Bosilca
Tim,

FT-MPI is gone, but the ideas it put forward have been refined and the
software algorithms behind them improved in a newer (and supported) project
ULFM. It features a smaller API, with a much more flexible approach. You
can find more information about it at http://fault-tolerance.org/. The
corresponding implementation (based on an older version of Open MPI 1.6) is
available at https://bitbucket.org/icldistcomp/ulfm

  George.



On Thu, Jun 8, 2017 at 9:17 AM, Tim Burgess 
wrote:

> Hi!
>
> So I know from searching the archive that this is a repeated topic of
> discussion here, and apologies for that, but since it's been a year or
> so I thought I'd double-check whether anything has changed before
> really starting to tear my hair out too much.
>
> Is there a combination of MCA parameters or similar that will prevent
> ORTE from aborting a job when it detects a node failure?  This is
> using the tcp btl, under slurm.
>
> The application, not written by us and too complicated to re-engineer
> at short notice, has a strictly master-slave communication pattern.
> The master never blocks on communication from individual slaves, and
> apparently can itself detect slaves that have silently disappeared and
> reissue the work to those remaining.  So from an application
> standpoint I believe we should be able to handle this.  However, in
> all my testing so far the job is aborted as soon as the runtime system
> figures out what is going on.
>
> If not, do any users know of another MPI implementation that might
> work for this use case?  As far as I can tell, FT-MPI has been pretty
> quiet the last couple of years?
>
> Thanks in advance,
>
> Tim
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Node failure handling

2017-06-09 Thread r...@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery 
option might do what you want.

> On Jun 8, 2017, at 6:17 AM, Tim Burgess  wrote:
> 
> Hi!
> 
> So I know from searching the archive that this is a repeated topic of
> discussion here, and apologies for that, but since it's been a year or
> so I thought I'd double-check whether anything has changed before
> really starting to tear my hair out too much.
> 
> Is there a combination of MCA parameters or similar that will prevent
> ORTE from aborting a job when it detects a node failure?  This is
> using the tcp btl, under slurm.
> 
> The application, not written by us and too complicated to re-engineer
> at short notice, has a strictly master-slave communication pattern.
> The master never blocks on communication from individual slaves, and
> apparently can itself detect slaves that have silently disappeared and
> reissue the work to those remaining.  So from an application
> standpoint I believe we should be able to handle this.  However, in
> all my testing so far the job is aborted as soon as the runtime system
> figures out what is going on.
> 
> If not, do any users know of another MPI implementation that might
> work for this use case?  As far as I can tell, FT-MPI has been pretty
> quiet the last couple of years?
> 
> Thanks in advance,
> 
> Tim
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users