[GitHub] LakeCarrot opened a new issue #7341: Usage of Tensorboard in Distributed MXNet
LakeCarrot opened a new issue #7341: Usage of Tensorboard in Distributed MXNet URL: https://github.com/apache/incubator-mxnet/issues/7341 Hi all, I tried to use Tensorboard to visualize my model training process. In the single-node training mode, the usage of Tensorboard is straightforward. Thing is different when it comes to the distributed training mode. Suppose I have 2 servers and 4 workers in my cluster, how can I use Tensorboard to track the overall training process? Basically, I can imagine there will be 4 different set of log files locate in each worker, and I need to use 4 separate Tensorboard processes to visualize the whole process. After some research, I found the following question on StackOverflow, which said that in TensorFlow, only one of the workers need to write the log. https://stackoverflow.com/questions/37411005/unable-to-use-tensorboard-in-distributed-tensorflow I wonder what is the by-design usage of Tensorboard in Distributed MXNet? My main concern of writing summary on one of the worker is whether the log from a single worker can be a good representative to the overall learning process. @zihaolucky Thanks a lot for your work to make the Tensorboard on MXNet come true. I wonder do you have any idea of my question? Thanks in advance! Bo This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] LakeCarrot opened a new issue #7341: Usage of Tensorboard in Distributed MXNet
LakeCarrot opened a new issue #7341: Usage of Tensorboard in Distributed MXNet URL: https://github.com/apache/incubator-mxnet/issues/7341 Hi all, I tried to use Tensorboard to visualize my model training process. In the single-node training mode, the usage of Tensorboard is straightforward. Thing is different when it comes to the distributed training mode. Suppose I have 2 servers and 4 workers in my cluster, how can I use Tensorboard to track the overall training process? Basically, I can imagine there will be 4 different set of log files locate in each worker, and I need to use 4 separate Tensorboard processes to visualize the whole process. After some research, I found the following question on StackOverflow, which said that in TensorFlow, only one of the workers need to write the log. https://stackoverflow.com/questions/37411005/unable-to-use-tensorboard-in-distributed-tensorflow I wonder what is the by-design usage of Tensorboard in Distributed MXNet? My main concern of writing summary on one of the worker is whether the log from a single worker can be a good representative to the overall learning process. @zihaolucky Thanks a lot for your work to make the Tensorboard on MXNet come true. I wonder do you have any idea of my question? Thanks in advance! Bo This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services