Hi MXNET owners/developers,

As you known, the AllReduce and Parameter Severs are two very popular 
distributed training modes in DL.

Currently, MXNET only supports parameter server mode and is lack of AllReduce 
mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can work with 
AllReduce.
Based on our analysis and experiments, AllReduce mode can achieves the better 
scalability and more efficiency

So, we propose to extend MXNET distributed training with MPI AllReduce mode.
We have implemented a AllReduce prototype in MXNET and the results are very 
positive.
AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16 
while the Parameter Server requires totally 16 nodes (8 compute nodes + 8 
parameter severs) to reach 93.2%.

The whole proposal is available in MXNET wiki. Any feedback are highly 
appreciated.
https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce

Thanks in advance.

BR,

--Patric

Reply via email to