Davdi edited a comment on issue #13526: distributed training  van.cc Check 
failed
URL: 
https://github.com/apache/incubator-mxnet/issues/13526#issuecomment-444727742
 
 
   > As I see there are 3 different issues here:
   > 
   > ```
   > File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in 
check_call
   > raise MXNetError(py_str(LIB.MXGetLastError()))
   > mxnet.base.MXNetError: [08:54:25] src/van.cc:291: Check failed: 
(my_node.port) != (-1) bind failed
   > ```
   > 1. Host file -
   >    if you say -n 2, there will be 2 worker and 2 server. If you have only 
one line with host and port, all of the processes will try to launch on same 
port.
   >    So work around is same as what I suggested earlier. Please use only 
host and let mxnet chose port.
   >    If you want chose port yourself, find 4 different ports which are not 
used and use 4 entries in host file.
   > 
   > Ideally you should have multiple hosts for distributed training.
   > 
   > ```
   > `Traceback (most recent call last):
   > File "/userhome/incubator-mxnet/tools/launch.py", line 128, in
   > main()
   > File "/userhome/incubator-mxnet/tools/launch.py", line 109, in main
   > raise RuntimeError('Unknown submission cluster type %s' % args.cluster)
   > RuntimeError: Unknown submission cluster type ssh
   > ```
   > This seems like a launch script issue. Can you try not giving --launcher 
option in command line, and using and use full host file path in -H option
   > 
   > ```
   > usage: image_classification.py [-h] [--dataset DATASET] [--data-dir 
DATA_DIR]
   > [--num-worker NUM_WORKERS]
   > [--batch-size BATCH_SIZE] [--gpus GPUS]
   > [--epochs EPOCHS] [--lr LR]
   > [--momentum MOMENTUM] [--wd WD] [--seed SEED]
   > [--mode MODE] --model MODEL [--use_thumbnail]
   > [--batch-norm] [--use-pretrained]
   > [--prefix PREFIX] [--start-epoch START_EPOCH]
   > [--resume RESUME] [--lr-factor LR_FACTOR]
   > [--lr-steps LR_STEPS] [--dtype DTYPE]
   > [--save-frequency SAVE_FREQUENCY]
   > [--kvstore KVSTORE]
   > [--log-interval LOG_INTERVAL] [--profile]
   > [--builtin-profiler BUILTIN_PROFILER]
   > image_classification.py: error: unrecognized arguments: epochs 1
   > ```
   > This is problem with training code. If it is coming from examples this 
needs to be fixed.
   
   
    thanks ,i modify the hosts file  and the content is this
   ps-0
   worker-0
   worker-1
   
   this is ip of ps and worker ,
   and under the folder /root/.ssh/config 
   
   > 
   Host ps-0
     HostName 192.168.113.227
     Port 10015
     User root
     StrictHostKeyChecking no
     UserKnownHostsFile /dev/null
     IdentityFile /root/.ssh/application_1544059068811_0001
   Host worker-0
     HostName 192.168.113.227
     Port 10016
     User root
     StrictHostKeyChecking no
     UserKnownHostsFile /dev/null
     IdentityFile /root/.ssh/application_1544059068811_0001
   Host worker-1
     HostName 192.168.113.226
     Port 10023
     User root
     StrictHostKeyChecking no
     UserKnownHostsFile /dev/null
     IdentityFile /root/.ssh/application_1544059068811_0001
   > 
   
   
   
   but when i run command  
   
   ` ../../tools/launch.py -n 2 -H hosts --launcher ssh python 
image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore 
dist_sync
   `
   
   it shows the error 
   
   Warning: Permanently added '192.168.113.227' (ECDSA) to the list of known 
hosts.
   Warning: Permanently added '192.168.113.227' (ECDSA) to the list of known 
hosts.
   Warning: Permanently added '192.168.113.227' (ECDSA) to the list of known 
hosts.
   Warning: Permanently added '192.168.113.226' (ECDSA) to the list of known 
hosts.
   root@192.168.113.227's password: root@192.168.113.227's password: 
root@192.168.113.227's password: root@192.168.113.226's password:
   
   
   it seems that i need password but i use the primary key and no password ,and 
when i use `ssh worker-0`
   
   it login succesfully
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to