Vikas89 commented on issue #13526: distributed training  van.cc Check failed
URL: 
https://github.com/apache/incubator-mxnet/issues/13526#issuecomment-444379161
 
 
   Are you running in containers?
   You said "the env is 1 ps and 1 worker and should i follow the instruction 
on all the 2 container ?"
   but host file only has one entry, shouldn't it be 2 entries!
   
   Can you try these steps - 
   1) Make sure that there is no python process running on any of the host.
   2)  Make sure that for all the entries in host file, you can ssh to those 
host from master node(the node where you ran launch.py)
   3) if you are using ec2 instance, try using private ip in hosts file
   4) Launch distributed training without port: 192.168.113.223 , mxnet will 
automatically chose port for workers
   
   
   If error persist - 
   4) Paste the output of :
   echo $env
   5) cat hosts 
   6) For each entry in hosts file , ssh to host and paste output of :
   ps -efl | grep python
   7) paste launch command and Paste the entire log that you get after running 
launch.py
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to