RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Ye, Zhouhai
For our current POC:
b. Add mpi.kvstore in python. It depends upon mxnet submodule mpi_collectives 
(new). (mpi_collectives is c++ library depending upon mxnet.)(Add new type 
of kvstore in python layer.)

mpi_collectives doesn’t need to be a single c++ library. It’s source code can 
be compiled into libmxnet.so.


From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:21 AM
To: Nan Zhu <zhunanmcg...@gmail.com>; dev@mxnet.incubator.apache.org
Cc: Li, Mu <m...@amazon.com>; Lv, Tao A <tao.a...@intel.com>; Ma, Guokai 
<guokai...@intel.com>; Rahul Huilgol <rahulhuil...@gmail.com>; Ye, Jason Y 
<jason.y...@intel.com>; Zhang, Rong A <rong.a.zh...@intel.com>; Zhao, Patric 
<patric.z...@intel.com>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

You can check mpi.kvstore API Spec in our design doc:

e.g.  We add pushpull and broadcast interface and disable original push and 
pull in new kvstore.

From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:18 AM
To: 'Nan Zhu' <zhunanmcg...@gmail.com<mailto:zhunanmcg...@gmail.com>>; 
dev@mxnet.incubator.apache.org<mailto:dev@mxnet.incubator.apache.org>
Cc: Li, Mu <m...@amazon.com<mailto:m...@amazon.com>>; Lv, Tao A 
<tao.a...@intel.com<mailto:tao.a...@intel.com>>; Ma, Guokai 
<guokai...@intel.com<mailto:guokai...@intel.com>>; Rahul Huilgol 
<rahulhuil...@gmail.com<mailto:rahulhuil...@gmail.com>>; Ye, Jason Y 
<jason.y...@intel.com<mailto:jason.y...@intel.com>>; Zhang, Rong A 
<rong.a.zh...@intel.com<mailto:rong.a.zh...@intel.com>>; Zhao, Patric 
<patric.z...@intel.com<mailto:patric.z...@intel.com>>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

Hi,
Nan Zhu

As we described in our design doc, there’s two possible code structure 
(implementation) : (currently we implement second in our POC)


a.  Implement mpi.kvstore same level as the current kvstores (CPP 
src/kvstore)   (Adhere to original kvstore factory pattern)



b.  Add mpi.kvstore in python. It depends upon mxnet submodule 
mpi_collectives (new). (mpi_collectives is c++ library depending upon mxnet.)   
 (Add new type of kvstore in python layer.)


For your second question, I think to make a single communication submodule is 
OK (just like a.). But an unified abstraction for both PS and Allreduce is very 
hard.


From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Tuesday, March 27, 2018 10:39 AM
To: dev@mxnet.incubator.apache.org<mailto:dev@mxnet.incubator.apache.org>
Cc: Li, Mu <m...@amazon.com<mailto:m...@amazon.com>>; Lv, Tao A 
<tao.a...@intel.com<mailto:tao.a...@intel.com>>; Ma, Guokai 
<guokai...@intel.com<mailto:guokai...@intel.com>>; Rahul Huilgol 
<rahulhuil...@gmail.com<mailto:rahulhuil...@gmail.com>>; Ye, Jason Y 
<jason.y...@intel.com<mailto:jason.y...@intel.com>>; Ye, Zhouhai 
<zhouhai...@intel.com<mailto:zhouhai...@intel.com>>; Zhang, Rong A 
<rong.a.zh...@intel.com<mailto:rong.a.zh...@intel.com>>; Zhao, Patric 
<patric.z...@intel.com<mailto:patric.z...@intel.com>>
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce 
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a 
single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier 
<cjolivie...@gmail.com<mailto:cjolivie...@gmail.com>> wrote:
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric 
<patric.z...@intel.com<mailto:patric.z...@intel.com>> wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
>
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
>
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
>
> https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce
>
> Thanks in advance.
>
> BR,
>
> --Patric
>
>



RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Ye, Zhouhai
You can check mpi.kvstore API Spec in our design doc:

e.g.  We add pushpull and broadcast interface and disable original push and 
pull in new kvstore.

From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:18 AM
To: 'Nan Zhu' <zhunanmcg...@gmail.com>; dev@mxnet.incubator.apache.org
Cc: Li, Mu <m...@amazon.com>; Lv, Tao A <tao.a...@intel.com>; Ma, Guokai 
<guokai...@intel.com>; Rahul Huilgol <rahulhuil...@gmail.com>; Ye, Jason Y 
<jason.y...@intel.com>; Zhang, Rong A <rong.a.zh...@intel.com>; Zhao, Patric 
<patric.z...@intel.com>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

Hi,
Nan Zhu

As we described in our design doc, there’s two possible code structure 
(implementation) : (currently we implement second in our POC)


a.  Implement mpi.kvstore same level as the current kvstores (CPP 
src/kvstore)   (Adhere to original kvstore factory pattern)



b.  Add mpi.kvstore in python. It depends upon mxnet submodule 
mpi_collectives (new). (mpi_collectives is c++ library depending upon mxnet.)   
 (Add new type of kvstore in python layer.)


For your second question, I think to make a single communication submodule is 
OK (just like a.). But an unified abstraction for both PS and Allreduce is very 
hard.


From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Tuesday, March 27, 2018 10:39 AM
To: dev@mxnet.incubator.apache.org<mailto:dev@mxnet.incubator.apache.org>
Cc: Li, Mu <m...@amazon.com<mailto:m...@amazon.com>>; Lv, Tao A 
<tao.a...@intel.com<mailto:tao.a...@intel.com>>; Ma, Guokai 
<guokai...@intel.com<mailto:guokai...@intel.com>>; Rahul Huilgol 
<rahulhuil...@gmail.com<mailto:rahulhuil...@gmail.com>>; Ye, Jason Y 
<jason.y...@intel.com<mailto:jason.y...@intel.com>>; Ye, Zhouhai 
<zhouhai...@intel.com<mailto:zhouhai...@intel.com>>; Zhang, Rong A 
<rong.a.zh...@intel.com<mailto:rong.a.zh...@intel.com>>; Zhao, Patric 
<patric.z...@intel.com<mailto:patric.z...@intel.com>>
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce 
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a 
single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier 
<cjolivie...@gmail.com<mailto:cjolivie...@gmail.com>> wrote:
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric 
<patric.z...@intel.com<mailto:patric.z...@intel.com>> wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
>
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
>
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
>
> https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce
>
> Thanks in advance.
>
> BR,
>
> --Patric
>
>



RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Ye, Zhouhai
Hi,
Nan Zhu

As we described in our design doc, there’s two possible code structure 
(implementation) : (currently we implement second in our POC)


a.  Implement mpi.kvstore same level as the current kvstores (CPP 
src/kvstore)   (Adhere to original kvstore factory pattern)



b.  Add mpi.kvstore in python. It depends upon mxnet submodule 
mpi_collectives (new). (mpi_collectives is c++ library depending upon mxnet.)   
 (Add new type of kvstore in python layer.)


For your second question, I think to make a single communication submodule is 
OK (just like a.). But an unified abstraction for both PS and Allreduce is very 
hard.


From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Tuesday, March 27, 2018 10:39 AM
To: dev@mxnet.incubator.apache.org
Cc: Li, Mu <m...@amazon.com>; Lv, Tao A <tao.a...@intel.com>; Ma, Guokai 
<guokai...@intel.com>; Rahul Huilgol <rahulhuil...@gmail.com>; Ye, Jason Y 
<jason.y...@intel.com>; Ye, Zhouhai <zhouhai...@intel.com>; Zhang, Rong A 
<rong.a.zh...@intel.com>; Zhao, Patric <patric.z...@intel.com>
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce 
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a 
single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier 
<cjolivie...@gmail.com<mailto:cjolivie...@gmail.com>> wrote:
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric 
<patric.z...@intel.com<mailto:patric.z...@intel.com>> wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
>
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
>
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
>
> https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce
>
> Thanks in advance.
>
> BR,
>
> --Patric
>
>



RE: Extend MXNET distributed training with MPI AllReduce

2018-03-29 Thread Zhao, Patric
Actually, the current design structure is very like kvstore_nccl as attached 
picture shown.

I have updated the proposal into google doc as well. It’s more easy to add 
comments and modify.

https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094

Thanks,

--Patric


From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 4:30 PM
To: 'Nan Zhu' <zhunanmcg...@gmail.com>; 'dev@mxnet.incubator.apache.org' 
<dev@mxnet.incubator.apache.org>
Cc: 'Li, Mu' <m...@amazon.com>; Lv, Tao A <tao.a...@intel.com>; Ma, Guokai 
<guokai...@intel.com>; 'Rahul Huilgol' <rahulhuil...@gmail.com>; Ye, Jason Y 
<jason.y...@intel.com>; Zhang, Rong A <rong.a.zh...@intel.com>; Zhao, Patric 
<patric.z...@intel.com>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

For our current POC:
b. Add mpi.kvstore in python. It depends upon mxnet submodule mpi_collectives 
(new). (mpi_collectives is c++ library depending upon mxnet.)(Add new type 
of kvstore in python layer.)

mpi_collectives doesn’t need to be a single c++ library. It’s source code can 
be compiled into libmxnet.so.


From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:21 AM
To: Nan Zhu <zhunanmcg...@gmail.com<mailto:zhunanmcg...@gmail.com>>; 
dev@mxnet.incubator.apache.org<mailto:dev@mxnet.incubator.apache.org>
Cc: Li, Mu <m...@amazon.com<mailto:m...@amazon.com>>; Lv, Tao A 
<tao.a...@intel.com<mailto:tao.a...@intel.com>>; Ma, Guokai 
<guokai...@intel.com<mailto:guokai...@intel.com>>; Rahul Huilgol 
<rahulhuil...@gmail.com<mailto:rahulhuil...@gmail.com>>; Ye, Jason Y 
<jason.y...@intel.com<mailto:jason.y...@intel.com>>; Zhang, Rong A 
<rong.a.zh...@intel.com<mailto:rong.a.zh...@intel.com>>; Zhao, Patric 
<patric.z...@intel.com<mailto:patric.z...@intel.com>>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

You can check mpi.kvstore API Spec in our design doc:

e.g.  We add pushpull and broadcast interface and disable original push and 
pull in new kvstore.

From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:18 AM
To: 'Nan Zhu' <zhunanmcg...@gmail.com<mailto:zhunanmcg...@gmail.com>>; 
dev@mxnet.incubator.apache.org<mailto:dev@mxnet.incubator.apache.org>
Cc: Li, Mu <m...@amazon.com<mailto:m...@amazon.com>>; Lv, Tao A 
<tao.a...@intel.com<mailto:tao.a...@intel.com>>; Ma, Guokai 
<guokai...@intel.com<mailto:guokai...@intel.com>>; Rahul Huilgol 
<rahulhuil...@gmail.com<mailto:rahulhuil...@gmail.com>>; Ye, Jason Y 
<jason.y...@intel.com<mailto:jason.y...@intel.com>>; Zhang, Rong A 
<rong.a.zh...@intel.com<mailto:rong.a.zh...@intel.com>>; Zhao, Patric 
<patric.z...@intel.com<mailto:patric.z...@intel.com>>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

Hi,
Nan Zhu

As we described in our design doc, there’s two possible code structure 
(implementation) : (currently we implement second in our POC)


a.   Implement mpi.kvstore same level as the current kvstores (CPP 
src/kvstore)   (Adhere to original kvstore factory pattern)



b.  Add mpi.kvstore in python. It depends upon mxnet submodule 
mpi_collectives (new). (mpi_collectives is c++ library depending upon mxnet.)   
 (Add new type of kvstore in python layer.)


For your second question, I think to make a single communication submodule is 
OK (just like a.). But an unified abstraction for both PS and Allreduce is very 
hard.


From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Tuesday, March 27, 2018 10:39 AM
To: dev@mxnet.incubator.apache.org<mailto:dev@mxnet.incubator.apache.org>
Cc: Li, Mu <m...@amazon.com<mailto:m...@amazon.com>>; Lv, Tao A 
<tao.a...@intel.com<mailto:tao.a...@intel.com>>; Ma, Guokai 
<guokai...@intel.com<mailto:guokai...@intel.com>>; Rahul Huilgol 
<rahulhuil...@gmail.com<mailto:rahulhuil...@gmail.com>>; Ye, Jason Y 
<jason.y...@intel.com<mailto:jason.y...@intel.com>>; Ye, Zhouhai 
<zhouhai...@intel.com<mailto:zhouhai...@intel.com>>; Zhang, Rong A 
<rong.a.zh...@intel.com<mailto:rong.a.zh...@intel.com>>; Zhao, Patric 
<patric.z...@intel.com<mailto:patric.z...@intel.com>>
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce 
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a 
single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier 
<cjolivie...@gmail.com<mailto:cjolivie...@gmail.com>> wrote:
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric 
&

Re: Extend MXNET distributed training with MPI AllReduce

2018-03-26 Thread Nan Zhu
Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce
module as an submodule? We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that
a single communication submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier 
wrote:

> great! nice work!
>
> On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric 
> wrote:
>
> > Hi MXNET owners/developers,
> >
> > As you known, the AllReduce and Parameter Severs are two very popular
> > distributed training modes in DL.
> >
> > Currently, MXNET only supports parameter server mode and is lack of
> > AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc,
> can
> > work with AllReduce.
> > Based on our analysis and experiments, AllReduce mode can achieves the
> > better scalability and more efficiency
> >
> > So, we propose to extend MXNET distributed training with MPI AllReduce
> > mode.
> > We have implemented a AllReduce prototype in MXNET and the results are
> > very positive.
> > AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for
> VGG16
> > while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> > parameter severs) to reach 93.2%.
> >
> > The whole proposal is available in MXNET wiki. Any feedback are highly
> > appreciated.
> >
> > https://cwiki.apache.org/confluence/display/MXNET/
> Extend+MXNet+Distributed+Training+by+MPI+AllReduce
> >
> > Thanks in advance.
> >
> > BR,
> >
> > --Patric
> >
> >
>


Re: Extend MXNET distributed training with MPI AllReduce

2018-03-26 Thread Chris Olivier
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric  wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
>
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
>
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
>
> https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce
>
> Thanks in advance.
>
> BR,
>
> --Patric
>
>