Hi MXNET owners/developers, As you known, the AllReduce and Parameter Severs are two very popular distributed training modes in DL.
Currently, MXNET only supports parameter server mode and is lack of AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can work with AllReduce. Based on our analysis and experiments, AllReduce mode can achieves the better scalability and more efficiency So, we propose to extend MXNET distributed training with MPI AllReduce mode. We have implemented a AllReduce prototype in MXNET and the results are very positive. AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16 while the Parameter Server requires totally 16 nodes (8 compute nodes + 8 parameter severs) to reach 93.2%. The whole proposal is available in MXNET wiki. Any feedback are highly appreciated. https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce Thanks in advance. BR, --Patric