Vikas89 commented on issue #13526: distributed training van.cc Check failed URL: https://github.com/apache/incubator-mxnet/issues/13526#issuecomment-444379161 Are you running in containers? You said "the env is 1 ps and 1 worker and should i follow the instruction on all the 2 container ?" but host file only has one entry, shouldn't it be 2 entries! Can you try these steps - 1) Make sure that there is no python process running on any of the host. 2) Make sure that for all the entries in host file, you can ssh to those host from master node(the node where you ran launch.py) 3) if you are using ec2 instance, try using private ip in hosts file 4) Launch distributed training without port: 192.168.113.223 , mxnet will automatically chose port for workers If error persist - 4) Paste the output of : echo $env 5) cat hosts 6) For each entry in hosts file , ssh to host and paste output of : ps -efl | grep python 7) paste launch command and Paste the entire log that you get after running launch.py
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services