ymjiang opened a new issue #14485: Any suggestion to accelerate parameter update for distributed training? URL: https://github.com/apache/incubator-mxnet/issues/14485 When doing distributed training, I find the parameter update (push requests) dominates the communication time. Since the update happens on CPU, I wonder if there is any suggestion to accelerate the CPU performance on parameter update? For simplicity, I use 1 worker (one GPU) and 1 PS, and locate them on different machines. I use `kvstore=dist_sync`. Sending a key only takes about 200us while updating it takes about 780us. The update time is measured as follows: - Before [sending the apply request to the engine](https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_dist_server.h#L353), get `timestamp_1` - After `stored.WaitToRead();`, get `timestamp_2` - Update_time=`timestamp_2-timestamp_1`
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services