ptrendx commented on issue #7455: Distributed training is slow URL: https://github.com/apache/incubator-mxnet/issues/7455#issuecomment-322538407 @szha That is not entirely true - I agree with @solin319 that this WaitToRead should not be necessary (the actual communication is done in the lambda pushed to the engine that has send_buf as read dependency, so it will wait for it to be ready). What is more, this basically delays scheduling other copies from GPU to CPU for subsequent communications, thus limiting scaling. The PR introducing that line mentions crashes when using kvstore in imperative mode. I'm not familiar really how much does imperative way differs from symbolic as far as engine is concerned, but I don't think it should be that different that the dependencies stop working. This is definitely a bug. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
With regards, Apache Git Services