KexinFeng edited a comment on issue #20293:
URL:
https://github.com/apache/incubator-mxnet/issues/20293#issuecomment-866448263
**Analysis:**
The issue is solved mainly by comparing ` elemwise_add` with `elemwise_mul`
at operation registration. @barry-jin pointed out an important discrepancy
between these two operators today. The discrepancy is in “registering the
function for creating the node of the operator in a backward pass”:
https://github.com/apache/incubator-mxnet/blob/da4ff3a4dc0bd6a54af3d75c492021d18ba1867b/src/operator/tensor/elemwise_binary_op_basic.cc#L111
https://github.com/apache/incubator-mxnet/blob/da4ff3a4dc0bd6a54af3d75c492021d18ba1867b/src/operator/tensor/elemwise_binary_op_basic.cc#L238
`elemwise_mul` uses `ElemwiseGradUseIn`, since calculation of `_back_mul`
depends on input value. However, `elemwise_add` uses `CloneGradient`, the
reason of which is not clear yet. But ` ElemwiseGradUseNone` is another valid
option for `elemwise_add` . So I tested it by replacing ` CloneGradient` with`
ElemwiseGradUseNone` .
**Result:**
**The issue is solved!** `./demo2 0 1` specifies the gradient request as
following, and 2021 is the hardcoding head gradient.
```
uint32_t grad_req_1[1] = {0};
uint32_t grad_req_2[1] = {1};
```
```
$~/mxnet/issue2/build2$ ./demo2 0 1
[19:45:23] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by
previous version v0.8.0. Attempting to upgrade...
[19:45:23] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[19:45:23] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[19:45:23] ../src/storage/storage.cc:199: Using Pooled (Naive)
StorageManager for CPU
INPUT 1: 0.4
INPUT 2: 0.5
OUTPUT: 0.9
OUTGRAD: 2021
GRAD 1: 0
GRAD 2: 2021
$~/mxnet/issue2/build2$ ./demo2 1 0
[19:49:59] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by
previous version v0.8.0. Attempting to upgrade...
[19:49:59] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[19:49:59] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[19:49:59] ../src/storage/storage.cc:199: Using Pooled (Naive)
StorageManager for CPU
INPUT 1: 0.4
INPUT 2: 0.5
OUTPUT: 0.9
OUTGRAD: 2021
GRAD 1: 2021
GRAD 2: 0
$~/mxnet/issue2/build2$ ./demo2 1 1
[19:50:11] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by
previous version v0.8.0. Attempting to upgrade...
[19:50:11] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[19:50:11] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[19:50:11] ../src/storage/storage.cc:199: Using Pooled (Naive)
StorageManager for CPU
INPUT 1: 0.4
INPUT 2: 0.5
OUTPUT: 0.9
OUTGRAD: 2021
GRAD 1: 2021
GRAD 2: 2021
```
**Further investigation:**
This result shows CloneGradient is the most possible reason that causes
this issue. So the remaining check is
1. In bug report , the sym is built by:
```
{
"nodes": [
{
\"op\":\"null\",
\"name\":\".Inputs.Input1\",
\"inputs\":[]
},
{
\"op\":\"null\",
\"name\":\".Inputs.Input2\",
\"inputs\":[]
},
{
\"op\":\"elemwise_add\",
\"name\":\".$0\",
\"inputs\":[[0,0,0],[1,0,0]]
},
{
\"op\":\"_copy\",
\"name\":\".Outputs.Output\",
\"inputs\":[[2,0,0]]
}
],
\"arg_nodes\":[0,1],
\"heads\":[[3,0,0]]
}
```
It is possible that "_copy" operation intrigues this issue in
CloneGradient. This helps understand why it fails.
2. Other usages of CloneGradient in mxnet\src\operator\contrib\stes_op.cc
```
CloneGradient{"_backward_round_ste"}
CloneGradient{"_backward_sign_ste"}
```
It is possible they have the similar bug.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]