ymjiang opened a new issue #14485: Any suggestion to accelerate parameter 
update for distributed training?
URL: https://github.com/apache/incubator-mxnet/issues/14485
 
 
   When doing distributed training, I find the parameter update (push requests) 
dominates the communication time. Since the update happens on CPU, I wonder if 
there is any suggestion to accelerate the CPU performance on parameter update?
   
   For simplicity, I use 1 worker (one GPU) and 1 PS, and locate them on 
different machines. I use `kvstore=dist_sync`. Sending a key only takes about 
200us while updating it takes about 780us. 
   
   The update time is measured as follows:
   - Before [sending the apply request to the 
engine](https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_dist_server.h#L353),
 get `timestamp_1`
   - After `stored.WaitToRead();`, get `timestamp_2`
   - Update_time=`timestamp_2-timestamp_1`
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to