rahul003 commented on issue #9152: tutorial for distributed training URL: https://github.com/apache/incubator-mxnet/pull/9152#issuecomment-371059065 @TaoLv Sorry I missed your comment. You can profile the worker process similar to a single machine case. ``` mx.profiler.profiler_set_config(mode='all', filename= str(kv.rank) + 'profile_output.json') mx.profiler.profiler_set_state('run') # Code to be profiled goes here... mx.profiler.profiler_set_state('stop') ``` Note the use of rank above to ensure that the path to save profile should be different for different workers. There you can look for the operators KVStore Push/Pull to see time taken for communication. I'll add a proper section for profiling to the tutorial once https://github.com/apache/incubator-mxnet/pull/9933 is merged. That makes it easy to profile the server processes too.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services