ptrendx commented on issue #7455: Distributed training is slow
URL: 
https://github.com/apache/incubator-mxnet/issues/7455#issuecomment-322538407
 
 
   @szha That is not entirely true - I agree with @solin319 that this 
WaitToRead should not be necessary (the actual communication is done in the 
lambda pushed to the engine that has send_buf as read dependency, so it will 
wait for it to be ready). What is more, this basically delays scheduling other 
copies from GPU to CPU for subsequent communications, thus limiting scaling.
   The PR introducing that line mentions crashes when using kvstore in 
imperative mode. I'm not familiar really how much does imperative way differs 
from symbolic as far as engine is concerned, but I don't think it should be 
that different that the dependencies stop working. This is definitely a bug.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to