@szha i checked some docs and projects about distributed training , 'Horovod' is project from uber team , 'Gloo' is project from facebook team. The basic idea is to use trick from HPC computing field which is more efficient then traditional param-server: http://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/?from=timeline There is a tool called openmpi on which the 'Horvod' project is based ,but i found openmpi is too difficult to configure and use . I also check the 'Gloo' which seems to use 'redis' to replace 'openmpi' . I strongly suggest not to use Horovod directly which is based on openmpi that is too complex and old.
maybe we could figure some way to directly do distributed training over redis ? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-mxnet/issues/16167#issuecomment-605396959