YouhuiBai commented on issue #15674: Straggler in latest mxnet when training with distributed parameter server URL: https://github.com/apache/incubator-mxnet/issues/15674#issuecomment-515666507 The reason for the straggler is the reutilization of pageable memory. There is an unorder map int to NDArray comm_buf_ (class KVStoreDist, in kvstore_dist.h), if the compression isn't active, this buffer store the data in pull and push, namely, push pull will share the buffer. When start the distributed kvstore, the first worker whose rank number is zero will push all keys of the model to server(s), and initialize the comm_buf_ with pageable allocated memory (in push_ function of KVStoreDist), then all workers pull keys from server(s) ultil the first worker's push operations have been done, and initialize the comm_buf_ with new allocated pinned (page-locked) memory. As we can see, the comm_buf_ of first worker is different from others', and the memory copy between CPU and GPU is very different with pinned memory or pageable memory at host, the details are in https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/. This discovery will explain all the strange point above, and you can just allocate new pinned memory for comm_buf_ of first worker, the straggler will be eliminated.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services