Re: [OMPI users] Node failure handling

2017-06-26 Thread r...@open-mpi.org
Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a 
bug if we can’t.

> On Jun 26, 2017, at 7:39 PM, Tim Burgess  wrote:
> 
> Hi Ralph,
> 
> Thanks for the quick response.
> 
> Just tried again not under slurm, but the same result... (though I
> just did kill -9 orted on the remote node this time)
> 
> Any ideas?  Do you think my multiple-mpirun idea is worth trying?
> 
> Cheers,
> Tim
> 
> 
> ```
> [user@bud96 mpi_resilience]$
> /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
> --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
> --debug-daemons $(pwd)/test
> ( some output from job here )
> ( I then do kill -9 `pgrep orted`  on pnod0331 )
> bash: line 1: 161312 Killed
> /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
> orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
> -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
> "581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
> -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
> --
> ORTE has lost communication with a remote daemon.
> 
>  HNP daemon   : [[8878,0],0] on node bud96
>  Remote daemon: [[8878,0],1] on node pnod0331
> 
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --
> [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
> [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
> ```
> 
> On 27 June 2017 at 12:19, r...@open-mpi.org  wrote:
>> Ah - you should have told us you are running under slurm. That does indeed 
>> make a difference. When we launch the daemons, we do so with "srun 
>> --kill-on-bad-exit” - this means that slurm automatically kills the job if 
>> any daemon terminates. We take that measure to avoid leaving zombies behind 
>> in the event of a failure.
>> 
>> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
>> launcher instead of the slurm one, which gives you more control.
>> 
>>> On Jun 26, 2017, at 6:59 PM, Tim Burgess  wrote:
>>> 
>>> Hi Ralph, George,
>>> 
>>> Thanks very much for getting back to me.  Alas, neither of these
>>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
>>> recent master (7002535), with slurm's "--no-kill" and openmpi's
>>> "--enable-recovery", once the node reboots one gets the following
>>> error:
>>> 
>>> ```
>>> --
>>> ORTE has lost communication with a remote daemon.
>>> 
>>> HNP daemon   : [[58323,0],0] on node pnod0330
>>> Remote daemon: [[58323,0],1] on node pnod0331
>>> 
>>> This is usually due to either a failure of the TCP network
>>> connection to the node, or possibly an internal failure of
>>> the daemon itself. We cannot recover from this failure, and
>>> therefore will terminate the job.
>>> --
>>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
>>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
>>> ```
>>> 
>>> I haven't yet tried the hard reboot case with ULFM (these nodes take
>>> forever to come back up), but earlier experiments SIGKILLing the orted
>>> on a compute node led to a very similar message as above, so at this
>>> point I'm not optimistic...
>>> 
>>> I think my next step is to try with several separate mpiruns and use
>>> mpi_comm_{connect,accept} to plumb everything together before the
>>> application starts.  I notice this is the subject of some recent work
>>> on ompi master.  Even though the mpiruns will all be associated to the
>>> same ompi-server, do you think this could be sufficient to isolate the
>>> failures?
>>> 
>>> Cheers,
>>> Tim
>>> 
>>> 
>>> 
>>> On 10 June 2017 at 00:56, r...@open-mpi.org  wrote:
 It has been awhile since I tested it, but I believe the --enable-recovery 
 option might do what you want.
 
> On Jun 8, 2017, at 6:17 AM, Tim Burgess  wrote:
> 
> Hi!
> 
> So I know from searching the archive that this is a repeated topic of
> discussion here, and apologies for that, but since it's been a year or
> so I thought I'd double-check whether anything has changed before
> really starting to tear my hair out too much.
> 
> Is there a combination of MCA parameters or similar that will prevent
> ORTE from aborting a job when it detects a node failure?  This is
> using the tcp btl, under slurm.
> 
> The application, not written by us and 

Re: [OMPI users] Node failure handling

2017-06-26 Thread Tim Burgess
Hi Ralph,

Thanks for the quick response.

Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)

Any ideas?  Do you think my multiple-mpirun idea is worth trying?

Cheers,
Tim


```
[user@bud96 mpi_resilience]$
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted`  on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
"581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[8878,0],0] on node bud96
  Remote daemon: [[8878,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```

On 27 June 2017 at 12:19, r...@open-mpi.org  wrote:
> Ah - you should have told us you are running under slurm. That does indeed 
> make a difference. When we launch the daemons, we do so with "srun 
> --kill-on-bad-exit” - this means that slurm automatically kills the job if 
> any daemon terminates. We take that measure to avoid leaving zombies behind 
> in the event of a failure.
>
> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
> launcher instead of the slurm one, which gives you more control.
>
>> On Jun 26, 2017, at 6:59 PM, Tim Burgess  wrote:
>>
>> Hi Ralph, George,
>>
>> Thanks very much for getting back to me.  Alas, neither of these
>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
>> recent master (7002535), with slurm's "--no-kill" and openmpi's
>> "--enable-recovery", once the node reboots one gets the following
>> error:
>>
>> ```
>> --
>> ORTE has lost communication with a remote daemon.
>>
>>  HNP daemon   : [[58323,0],0] on node pnod0330
>>  Remote daemon: [[58323,0],1] on node pnod0331
>>
>> This is usually due to either a failure of the TCP network
>> connection to the node, or possibly an internal failure of
>> the daemon itself. We cannot recover from this failure, and
>> therefore will terminate the job.
>> --
>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
>> ```
>>
>> I haven't yet tried the hard reboot case with ULFM (these nodes take
>> forever to come back up), but earlier experiments SIGKILLing the orted
>> on a compute node led to a very similar message as above, so at this
>> point I'm not optimistic...
>>
>> I think my next step is to try with several separate mpiruns and use
>> mpi_comm_{connect,accept} to plumb everything together before the
>> application starts.  I notice this is the subject of some recent work
>> on ompi master.  Even though the mpiruns will all be associated to the
>> same ompi-server, do you think this could be sufficient to isolate the
>> failures?
>>
>> Cheers,
>> Tim
>>
>>
>>
>> On 10 June 2017 at 00:56, r...@open-mpi.org  wrote:
>>> It has been awhile since I tested it, but I believe the --enable-recovery 
>>> option might do what you want.
>>>
 On Jun 8, 2017, at 6:17 AM, Tim Burgess  wrote:

 Hi!

 So I know from searching the archive that this is a repeated topic of
 discussion here, and apologies for that, but since it's been a year or
 so I thought I'd double-check whether anything has changed before
 really starting to tear my hair out too much.

 Is there a combination of MCA parameters or similar that will prevent
 ORTE from aborting a job when it detects a node failure?  This is
 using the tcp btl, under slurm.

 The application, not written by us and too complicated to re-engineer
 at short notice, has a strictly master-slave communication pattern.
 The master never blocks on communication from individual slaves, and
 apparently can itself detect slaves that have silently disappeared and
 reissue the work to those remaining.  So from an application
 standpoint I believe we 

Re: [OMPI users] Node failure handling

2017-06-26 Thread Tim Burgess
Hi Ralph, George,

Thanks very much for getting back to me.  Alas, neither of these
options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
error:

```
--
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[58323,0],0] on node pnod0330
  Remote daemon: [[58323,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```

I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...

I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts.  I notice this is the subject of some recent work
on ompi master.  Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?

Cheers,
Tim



On 10 June 2017 at 00:56, r...@open-mpi.org  wrote:
> It has been awhile since I tested it, but I believe the --enable-recovery 
> option might do what you want.
>
>> On Jun 8, 2017, at 6:17 AM, Tim Burgess  wrote:
>>
>> Hi!
>>
>> So I know from searching the archive that this is a repeated topic of
>> discussion here, and apologies for that, but since it's been a year or
>> so I thought I'd double-check whether anything has changed before
>> really starting to tear my hair out too much.
>>
>> Is there a combination of MCA parameters or similar that will prevent
>> ORTE from aborting a job when it detects a node failure?  This is
>> using the tcp btl, under slurm.
>>
>> The application, not written by us and too complicated to re-engineer
>> at short notice, has a strictly master-slave communication pattern.
>> The master never blocks on communication from individual slaves, and
>> apparently can itself detect slaves that have silently disappeared and
>> reissue the work to those remaining.  So from an application
>> standpoint I believe we should be able to handle this.  However, in
>> all my testing so far the job is aborted as soon as the runtime system
>> figures out what is going on.
>>
>> If not, do any users know of another MPI implementation that might
>> work for this use case?  As far as I can tell, FT-MPI has been pretty
>> quiet the last couple of years?
>>
>> Thanks in advance,
>>
>> Tim
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users