[MXNet Forum] [Discussion] One node failure but other nodes hang in mulit-node distributed training

Hwu via MXNet Forum Tue, 17 Nov 2020 23:00:15 -0800


We implemented the multi-node training with MXNET in AWS batch,  using ps-lite 
server.  Occasionally,  we found one node failure due some run-time error or 
exceptions,  the other nodes still keep running or waiting. Is there anyway,  
that we can terminate the whole training job with multi-node in case one node 
failed.   For the failed node,  is there anyway that it can signal/msg the 
other nodes, and terminate those nodes?  Thanks.






---
[Visit 
Topic](https://discuss.mxnet.apache.org/t/one-node-failure-but-other-nodes-hang-in-mulit-node-distributed-training/6732/1)
 or reply to this email to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.mxnet.apache.org/email/unsubscribe/6e6a7a2508ea9ccbc43488d40fe04d952ccc6d3ad2436f1f20ed56ad3cc831d2).

[MXNet Forum] [Discussion] One node failure but other nodes hang in mulit-node distributed training

Reply via email to