We implemented the multi-node training with MXNET in AWS batch, using ps-lite server. Occasionally, we found one node failure due some run-time error or exceptions, the other nodes still keep running or waiting. Is there anyway, that we can terminate the whole training job with multi-node in case one node failed. For the failed node, is there anyway that it can signal/msg the other nodes, and terminate those nodes? Thanks.
--- [Visit Topic](https://discuss.mxnet.apache.org/t/one-node-failure-but-other-nodes-hang-in-mulit-node-distributed-training/6732/1) or reply to this email to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.mxnet.apache.org/email/unsubscribe/6e6a7a2508ea9ccbc43488d40fe04d952ccc6d3ad2436f1f20ed56ad3cc831d2).
