Davdi edited a comment on issue #13526: distributed training van.cc Check failed URL: https://github.com/apache/incubator-mxnet/issues/13526#issuecomment-444727742 > As I see there are 3 different issues here: > > ``` > File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in check_call > raise MXNetError(py_str(LIB.MXGetLastError())) > mxnet.base.MXNetError: [08:54:25] src/van.cc:291: Check failed: (my_node.port) != (-1) bind failed > ``` > 1. Host file - > if you say -n 2, there will be 2 worker and 2 server. If you have only one line with host and port, all of the processes will try to launch on same port. > So work around is same as what I suggested earlier. Please use only host and let mxnet chose port. > If you want chose port yourself, find 4 different ports which are not used and use 4 entries in host file. > > Ideally you should have multiple hosts for distributed training. > > ``` > `Traceback (most recent call last): > File "/userhome/incubator-mxnet/tools/launch.py", line 128, in > main() > File "/userhome/incubator-mxnet/tools/launch.py", line 109, in main > raise RuntimeError('Unknown submission cluster type %s' % args.cluster) > RuntimeError: Unknown submission cluster type ssh > ``` > This seems like a launch script issue. Can you try not giving --launcher option in command line, and using and use full host file path in -H option > > ``` > usage: image_classification.py [-h] [--dataset DATASET] [--data-dir DATA_DIR] > [--num-worker NUM_WORKERS] > [--batch-size BATCH_SIZE] [--gpus GPUS] > [--epochs EPOCHS] [--lr LR] > [--momentum MOMENTUM] [--wd WD] [--seed SEED] > [--mode MODE] --model MODEL [--use_thumbnail] > [--batch-norm] [--use-pretrained] > [--prefix PREFIX] [--start-epoch START_EPOCH] > [--resume RESUME] [--lr-factor LR_FACTOR] > [--lr-steps LR_STEPS] [--dtype DTYPE] > [--save-frequency SAVE_FREQUENCY] > [--kvstore KVSTORE] > [--log-interval LOG_INTERVAL] [--profile] > [--builtin-profiler BUILTIN_PROFILER] > image_classification.py: error: unrecognized arguments: epochs 1 > ``` > This is problem with training code. If it is coming from examples this needs to be fixed. thanks ,i modify the hosts file and the content is this ps-0 worker-0 worker-1 this is ip of ps and worker , and under the folder /root/.ssh/config > Host ps-0 HostName 192.168.113.227 Port 10015 User root StrictHostKeyChecking no UserKnownHostsFile /dev/null IdentityFile /root/.ssh/application_1544059068811_0001 Host worker-0 HostName 192.168.113.227 Port 10016 User root StrictHostKeyChecking no UserKnownHostsFile /dev/null IdentityFile /root/.ssh/application_1544059068811_0001 Host worker-1 HostName 192.168.113.226 Port 10023 User root StrictHostKeyChecking no UserKnownHostsFile /dev/null IdentityFile /root/.ssh/application_1544059068811_0001 > but when i run command ` ../../tools/launch.py -n 2 -H hosts --launcher ssh python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync ` it shows the error Warning: Permanently added '192.168.113.227' (ECDSA) to the list of known hosts. Warning: Permanently added '192.168.113.227' (ECDSA) to the list of known hosts. Warning: Permanently added '192.168.113.227' (ECDSA) to the list of known hosts. Warning: Permanently added '192.168.113.226' (ECDSA) to the list of known hosts. root@192.168.113.227's password: root@192.168.113.227's password: root@192.168.113.227's password: root@192.168.113.226's password: it seems that i need password but i use the primary key and no password ,and when i use `ssh worker-0` it login succesfully
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services