[GitHub] leoxiaobin commented on issue #7455: Distributed training is slow
leoxiaobin commented on issue #7455: Distributed training is slow URL: https://github.com/apache/incubator-mxnet/issues/7455#issuecomment-322390365 @starimpact , I have tried to use 4 servers per machine, I got almost the same result. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] leoxiaobin commented on issue #7455: Distributed training is slow
leoxiaobin commented on issue #7455: Distributed training is slow URL: https://github.com/apache/incubator-mxnet/issues/7455#issuecomment-322390219 @szha , every server has 8 TitanXp GPUs and 2 Intel Xeon CPU E5-2650 v2@ 2.60GHz. The two servers are connected with IB cards. The test is using --benchmark = 1 configuration, so there is no disk I/O operation. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] leoxiaobin commented on issue #7455: Distributed training is slow
leoxiaobin commented on issue #7455: Distributed training is slow URL: https://github.com/apache/incubator-mxnet/issues/7455#issuecomment-322365618 @szha , I have tried dist_sync_device, and I got almost the same result. For dist_async, it using the async SGD, i don't think it can be comparable. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services