Re: [apache/incubator-mxnet] [RFC] Apache MXNet 2.0 Roadmap (#16167)

lilongyue Fri, 27 Mar 2020 22:21:09 -0700

@szha 
i checked some docs and projects about distributed training ,
'Horovod' is project from uber team , 'Gloo' is project from facebook team.
The basic idea is to use trick from HPC computing field which is more efficient 
then traditional param-server:
http://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/?from=timeline
There is a tool called openmpi on which the 'Horvod' project is based ,but i 
found openmpi is too difficult to configure and use .
I also check the 'Gloo' which seems to use 'redis' to replace 'openmpi' .
I strongly suggest not to use Horovod directly which is based on openmpi that 
is too complex and old.


maybe we could figure some way to directly do distributed training over redis ?

-- 
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/16167#issuecomment-605396959

Re: [apache/incubator-mxnet] [RFC] Apache MXNet 2.0 Roadmap (#16167)

Reply via email to