happylee524 opened a new issue, #21005: URL: https://github.com/apache/incubator-mxnet/issues/21005
## Description (A clear and concise description of what the bug is.) I run a distributed training program using "launch.py" method,but after training task finish, the program cannot exit and return. ### Error Message This is the logging.  This is my code  ### Steps to reproduce (Paste the commands you ran that produced the error.) 1. Start command: python /usr/local/lib/python3.8/dist-packages/mxnet/tools/launch.py -n 2 -H /tmp/algorithm/Host --sync-dst-dir /tmp/algorithm/mnist_sync --launcher ssh "python /tmp/algorithm/image_classification.py --dataset mnist --model alexnet --epochs 3 --gpus 0,1" 2. I run my code in two k8s pod with image mxnet/python:2.0.0beta1_gpu_cu110_py3 3. mxnet version: 2.0.0 4. My code : image_classification.py copy from mxnet github -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
