KexinFeng edited a comment on issue #20293:
URL: 
https://github.com/apache/incubator-mxnet/issues/20293#issuecomment-866448263


   **Analysis:**
   
   The issue is solved mainly by comparing  elemwise_add with  elemwise_mul at 
operation registration.  @barry-jin  pointed out an important discrepancy 
between these two operators today. The discrepancy is in “registering the 
function for creating the node of the operator in a backward pass”:
    
https://github.com/apache/incubator-mxnet/blob/da4ff3a4dc0bd6a54af3d75c492021d18ba1867b/src/operator/tensor/elemwise_binary_op_basic.cc#L111
 
https://github.com/apache/incubator-mxnet/blob/da4ff3a4dc0bd6a54af3d75c492021d18ba1867b/src/operator/tensor/elemwise_binary_op_basic.cc#L238
 
   
   elemwise_mul uses ElemwiseGradUseIn, since calculation of _back_mul depends 
on input value. However,  elemwise_add uses CloneGradient, the reason of which 
is not clear yet. But  ElemwiseGradUseNone is another valid option for 
elemwise_add . So I tested it by replacing  CloneGradient  with 
ElemwiseGradUseNone .
   
   **Result:**
   
   **The issue is solved!** ./demo2 0 1 specifies the gradient request as 
following, and 2021 is the hardcoding head gradient.
   ```
    uint32_t grad_req_1[1] = {0};
    uint32_t grad_req_2[1] = {1};
   ```
   ```
   $~/mxnet/issue2/build2$ ./demo2 0 1
   [19:45:23] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by 
previous version v0.8.0. Attempting to upgrade...
   [19:45:23] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
   [19:45:23] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
   [19:45:23] ../src/storage/storage.cc:199: Using Pooled (Naive) 
StorageManager for CPU
   INPUT 1: 0.4
   INPUT 2: 0.5
   OUTPUT: 0.9
   OUTGRAD: 2021
   GRAD 1: 0
   GRAD 2: 2021
   
   $~/mxnet/issue2/build2$ ./demo2 1 0
   [19:49:59] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by 
previous version v0.8.0. Attempting to upgrade...
   [19:49:59] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
   [19:49:59] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
   [19:49:59] ../src/storage/storage.cc:199: Using Pooled (Naive) 
StorageManager for CPU
   INPUT 1: 0.4
   INPUT 2: 0.5
   OUTPUT: 0.9
   OUTGRAD: 2021
   GRAD 1: 2021
   GRAD 2: 0
   
   $~/mxnet/issue2/build2$ ./demo2 1 1
   [19:50:11] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by 
previous version v0.8.0. Attempting to upgrade...
   [19:50:11] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
   [19:50:11] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
   [19:50:11] ../src/storage/storage.cc:199: Using Pooled (Naive) 
StorageManager for CPU
   INPUT 1: 0.4
   INPUT 2: 0.5
   OUTPUT: 0.9
   OUTGRAD: 2021
   GRAD 1: 2021
   GRAD 2: 2021
   ```
   **Further investigation:**
   
   This result shows CloneGradient   is the most possible reason that causes 
this issue. So the remaining check is 
   
   1. In bug report , the sym is built by:
   ```
    {
           "nodes": [
               {
                   \"op\":\"null\",
                   \"name\":\".Inputs.Input1\",
                   \"inputs\":[]
               },
               {
                   \"op\":\"null\",
                   \"name\":\".Inputs.Input2\",
                   \"inputs\":[]
               },
               {
                   \"op\":\"elemwise_add\",
                   \"name\":\".$0\",
                   \"inputs\":[[0,0,0],[1,0,0]]
               },
               {
                   \"op\":\"_copy\",
                   \"name\":\".Outputs.Output\",
                   \"inputs\":[[2,0,0]]
               }
           ],
           \"arg_nodes\":[0,1],
           \"heads\":[[3,0,0]]
       }
   ```
   It is possible that "_copy" operation intrigues  this issue in 
CloneGradient. This helps understand why it fails.
   
   2. Other usages of CloneGradient in mxnet\src\operator\contrib\stes_op.cc
   ``` 
   CloneGradient{"_backward_round_ste"}
   CloneGradient{"_backward_sign_ste"}
   ```
   It is possible they have the similar bug.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to