Re: Extend MXNET distributed training with MPI AllReduce

Nan Zhu Mon, 26 Mar 2018 19:39:12 -0700

Hi, Patric

It's pretty nice work!


A question:

how the future code structure would look like when putting this allreduce
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that
a single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier <cjolivie...@gmail.com>
wrote:

> great! nice work!
>
> On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric <patric.z...@intel.com>
> wrote:
>
> > Hi MXNET owners/developers,
> >
> > As you known, the AllReduce and Parameter Severs are two very popular
> > distributed training modes in DL.
> >
> > Currently, MXNET only supports parameter server mode and is lack of
> > AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc,
> can
> > work with AllReduce.
> > Based on our analysis and experiments, AllReduce mode can achieves the
> > better scalability and more efficiency
> >
> > So, we propose to extend MXNET distributed training with MPI AllReduce
> > mode.
> > We have implemented a AllReduce prototype in MXNET and the results are
> > very positive.
> > AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for
> VGG16
> > while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> > parameter severs) to reach 93.2%.
> >
> > The whole proposal is available in MXNET wiki. Any feedback are highly
> > appreciated.
> >
> > https://cwiki.apache.org/confluence/display/MXNET/
> Extend+MXNet+Distributed+Training+by+MPI+AllReduce
> >
> > Thanks in advance.
> >
> > BR,
> >
> > --Patric
> >
> >
>

Re: Extend MXNET distributed training with MPI AllReduce

Reply via email to