[
https://issues.apache.org/jira/browse/MXNET-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lin Yuan updated MXNET-1112:
----------------------------
Description:
-Error message: *** An error occurred in MPI_Allreduce: the reduction
operation MPI_SUM is not defined for non-intrinsic datatypes
-suspected reason: Horovod MPI using CPU (i.e. non-CUDA aware) does not
support float16 MPI_SUM ops
-We are running into an issue with `update_multi_precision` in
https://github.com/ctcyang/horovod/blob/0a0240113fe5a24ec2c772fd7309840ba179562a/horovod/mxnet/__init__.py#L47
We don't yet have a way of hooking into SGD's `update_multi_precision` to do
the `hvd.allreduce` before weight update and after it is casted to float32. The
way it is written now, the `hvd.allreduce` is all-reducing in `float16`, which
does not presently support hierarchical allreduce
-this is an issue, because our scalability experiments for 256 GPUs in
float32 mode show 68% scalability with HOROVOD_HIERARCHICAL_ALLREDUCE=0, and
92.1% scalability with HOROVOD_HIERARCHICAL_ALLREDUCE=1. If this analogy holds
for float16, hierarchical allreduce will be a necessity for getting good
scalability
-it is a good idea to rebase with Horovod `master` if you haven't done so
already, to take advantage of this new performance-improving feature
-3 possible fixes:
1) add MPI_SUM for float16 and do gradient all-reduce in float16 (may be
difficult to converge model)
2) hook into `update_multi_precision` after casting gradient to float32 and
before weight update
3) hardcode `hvd.allreduce` here. Problems: Might not be possible for mxnet
to import horovod?
> float16 HIERARCHICAL_ALLREDUCE not working
> ------------------------------------------
>
> Key: MXNET-1112
> URL: https://issues.apache.org/jira/browse/MXNET-1112
> Project: Apache MXNet
> Issue Type: Improvement
> Components: Apache MXNet Backend
> Reporter: Lin Yuan
> Priority: Major
>
> -Error message: *** An error occurred in MPI_Allreduce: the reduction
> operation MPI_SUM is not defined for non-intrinsic datatypes
> -suspected reason: Horovod MPI using CPU (i.e. non-CUDA aware) does not
> support float16 MPI_SUM ops
> -We are running into an issue with `update_multi_precision` in
> https://github.com/ctcyang/horovod/blob/0a0240113fe5a24ec2c772fd7309840ba179562a/horovod/mxnet/__init__.py#L47
> We don't yet have a way of hooking into SGD's `update_multi_precision` to do
> the `hvd.allreduce` before weight update and after it is casted to float32.
> The way it is written now, the `hvd.allreduce` is all-reducing in `float16`,
> which does not presently support hierarchical allreduce
> -this is an issue, because our scalability experiments for 256 GPUs in
> float32 mode show 68% scalability with HOROVOD_HIERARCHICAL_ALLREDUCE=0, and
> 92.1% scalability with HOROVOD_HIERARCHICAL_ALLREDUCE=1. If this analogy
> holds for float16, hierarchical allreduce will be a necessity for getting
> good scalability
> -it is a good idea to rebase with Horovod `master` if you haven't done so
> already, to take advantage of this new performance-improving feature
> -3 possible fixes:
> 1) add MPI_SUM for float16 and do gradient all-reduce in float16 (may be
> difficult to converge model)
> 2) hook into `update_multi_precision` after casting gradient to float32
> and before weight update
> 3) hardcode `hvd.allreduce` here. Problems: Might not be possible for
> mxnet to import horovod?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]