ceisenach opened a new issue #19817:
URL: https://github.com/apache/incubator-mxnet/issues/19817


   ## Description
   Backwards implementation of F.take computes incorrect gradient when used 
after  sequence of transpose -> convolution -> transpose. any trainable 
parameters that receive gradients through the `F.take` operator are incorrect. 
Equivalent implementations using slice operators produce correct results.
   
   ### Other Details
   I have been unable to find any other scenario when it happens (for example, 
if one replaces the Conv Layers in the example below with a linear layer, there 
is no issue with the gradient computation).
   
   I also encounter the bug on MXNet 1.5 and 1.6 (have not tested with earlier 
versions).
   
   ## To Reproduce
   Below I provide an example of a simple model with two implementations -- one 
that uses `F.take` (Model A) and one that uses `F.slice_axis` (Model B) instead.
   
   ```py
   def conv_layer(atrous_rates, num_channels):
       convs = HybridSequential()
       convs.add(HybridLambda(lambda F, x: F.transpose(x, (0, 2, 1))))
       for rate in atrous_rates:
           convs.add(Conv1D(num_channels, 3, padding=rate, dilation=rate, 
activation='tanh'))
       convs.add(HybridLambda(lambda F, x: F.transpose(x, (0, 2, 1))))
       return convs
   
   
   class Model(HybridBlock):
       """
       Model takes tensors of shape N x T x C and produces predictions with 
shape N x T
       """
   
       def __init__(self, conv_units, atrous_rates, use_take=False, **kwargs):
           super().__init__(prefix=kwargs.get('prefix', None), 
params=kwargs.get('params', None))
           self.use_take = use_take
           with self.name_scope():
               self.convs = conv_layer(atrous_rates, conv_units)
               self.dense_out = Dense(1, flatten=False, activation='tanh')
   
       def hybrid_forward(self, F, X):
           X1 = X
           X2 = self.convs(X1)
           if self.use_take:
               X3 = F.take(X2, nd.array([1, 2, 3]), axis=-1)
           else:
               X3 = F.slice_axis(X2, begin=1, end=4, axis=-1)
           X4 = self.dense_out(X3)
           X4 = F.squeeze(X4, axis=-1)
           return X4
   ```
   
   The script provided below instantiates both implementations with the same 
initial weights, computes L2Loss and prints the gradients from both models. A 
random seed is set so the output should be deterministic (and it is for Model 
B).
   
   ### Steps to reproduce
   1. Download this script: 
https://gist.github.com/ceisenach/9ffed8343e5576748ec7d5623ffe6c46 
   1. Run script (`python take_bug.py`)
   
   
   ### Result
   1. As expected, output of forward pass is the same from both models
   2. Gradients (Model A): parameters in Model A that receive gradients through 
`F.take` are on the order of 1e28 (or in some cases are infinite). The results 
are non-deterministic
   3. Gradients (Model B): Gradient values seem reasonable and are 
deterministic (same results each time). 
   
   Example output from the script I provided
   
   ```
   ||g_param||_2: INF | Param: model0_conv0_weight
   ||g_param||_2: 7.21E+18 | Param: model0_conv0_bias
   ||g_param||_2: INF | Param: model0_conv1_weight
   ||g_param||_2: INF | Param: model0_conv1_bias
   ||g_param||_2: INF | Param: model0_conv2_weight
   ||g_param||_2: INF | Param: model0_conv2_bias
   ||g_param||_2: 1.38E-04 | Param: model0_dense0_weight
   ||g_param||_2: 1.06E-02 | Param: model0_dense0_bias
   
       -------------------------------------------
       -------  Grad Info
       *  ||g||_2: INF
       *  ||g||_1: 1.77E+21
       *  ||g||_inf: 5.79E+20
   
       
   ||g_param||_2: 2.37E-04 | Param: model1_conv0_weight
   ||g_param||_2: 2.29E-05 | Param: model1_conv0_bias
   ||g_param||_2: 2.23E-04 | Param: model1_conv1_weight
   ||g_param||_2: 1.50E-04 | Param: model1_conv1_bias
   ||g_param||_2: 4.26E-04 | Param: model1_conv2_weight
   ||g_param||_2: 7.02E-04 | Param: model1_conv2_bias
   ||g_param||_2: 1.38E-04 | Param: model1_dense0_weight
   ||g_param||_2: 1.06E-02 | Param: model1_dense0_bias
   
       -------------------------------------------
       -------  Grad Info
       *  ||g||_2: 1.06E-02
       *  ||g||_1: 1.75E-02
       *  ||g||_inf: 1.06E-02
   
       
   ==== Same outputs?
   Y_hat1 - Yhat2 = 0.0000
   ```
   
   It appears that there is either an OOB memory access or some values involved 
in the calculation are not initialized before they are used. I haven't 
attempted to track down the root cause. 
   
   
   ## What have you tried to solve it?
   
   In many cases, can workaround by using one of the slice operators instead. 
They do not appear to have any issues.
   
   ## Environment
   
   OS: ubuntu 18.04
   Python: 3.8.5
   pip: 20.2.3
   mxnet: 1.7.0 (Commit Hash: 64f737cdd59fe88d2c5b479f25d011c5156b6a8a)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to