YouhuiBai commented on issue #15674: Straggler in latest mxnet when training 
with distributed parameter server
URL: 
https://github.com/apache/incubator-mxnet/issues/15674#issuecomment-515666507
 
 
   The reason for the straggler is the reutilization of pageable memory.
   
   There is an unorder map int to NDArray comm_buf_ (class KVStoreDist, in 
kvstore_dist.h), if the compression isn't active, this buffer store the data in 
pull and push, namely, push pull will share the buffer. When start the 
distributed kvstore, the first worker whose rank number is zero will push all 
keys of the model to server(s), and initialize the comm_buf_ with pageable 
allocated memory (in push_ function of KVStoreDist), then all workers pull keys 
from server(s) ultil the first worker's push operations have been done, and 
initialize the comm_buf_ with new allocated pinned (page-locked) memory. As we 
can see, the comm_buf_ of first worker is different from others', and the 
memory copy between CPU and GPU is very different with pinned memory or 
pageable memory at host, the details are in 
https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/. This 
discovery will explain all the strange point above, and you can just allocate 
new pinned memory for comm_buf_ of first worker, the straggler will be 
eliminated.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to