[GitHub] [incubator-mxnet] matteosal opened a new issue, #21143: Gradient calculation for recurrent operators is wrong

GitBox Tue, 11 Oct 2022 10:31:00 -0700


matteosal opened a new issue, #21143:
URL: https://github.com/apache/incubator-mxnet/issues/21143


   This script creates an `RNN` operator and computes its input gradient 5 
times for sequence lengths = 1, 2, 3, 4, 5. Then it shows each gradient element 
at a fixed sequence position for all the computed sequence lengths:
   ```
   import mxnet as mx
   from mxnet import autograd
   import numpy as np
   
   batch_size = 1
   data_len = 5
   input_size = 2
   output_size = 3
   
   param_shapes = {
        'wx': [output_size, input_size], 
        'ws': [output_size, output_size], 
        'bx': [output_size],
        'bs': [output_size]
   }
   fused_param_len = np.sum(
        [np.prod(v) for v in param_shapes.values()]
   )
   
   shapes = {
        'data': [data_len, batch_size, input_size], 
        'par': [fused_param_len], 
        's0': [1, batch_size, output_size]
   }
   
   sym = mx.symbol.RNN(
        *[mx.symbol.Variable(name) for name in shapes.keys()],
        state_size=output_size,
        num_layers=1,
        mode='rnn_tanh'
   )
   op = mx.ndarray.CachedOp(sym)
   
   args = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in 
shapes.values()]
   
   def get_grad(seq_len):
        input_data = args[0][:seq_len]
        with autograd.record(train_mode=True):
                input_data.attach_grad()
                output = op(input_data, args[1], args[2], default_ctx=mx.cpu())
        autograd.backward(output, head_grads=mx.np.ones([data_len, batch_size, 
output_size], ctx=mx.cpu()))
        return input_data.grad
   
   results = []
   for i in range(1, 6):
        print('**************')
        print('Input gradient for sequence length = ' + str(i) + '\n')
        results.append(get_grad(i))
        print(results[-1])
        print('\n')
   
   for i in range(4):
        print('++++++++++++++')
        print('Element #' + str(i) + ' of all input gradients')
        for j in range(i, 5):
                print('sequence length: ' + str(j+1) + ': ' + 
str(results[j][i]))
        # [print('sequence length: ' + str(i+1) + ': ' + str(grad[i])) for grad 
in results[i:]]
        print('\n')
   ```
   
   The output is:
   ```
   **************
   Input gradient for sequence length = 1
   
   [[[0.14385478 0.05408207]]]
   
   
   **************
   Input gradient for sequence length = 2
   
   [[[0.14385478 0.05408207]]
    [[0.01706791 0.00660894]]]
   
   
   **************
   Input gradient for sequence length = 3
   
   [[[0.14385478 0.05408207]]
    [[0.01706791 0.00660894]]
    [[0.0178871  0.00672178]]]
   
   
   **************
   Input gradient for sequence length = 4
   
   [[[0.14385478 0.05408207]]
    [[0.01706791 0.00660894]]
    [[0.0178871  0.00672178]]
    [[0.01958952 0.00729937]]]
   
   
   **************
   Input gradient for sequence length = 5
   
   [[[0.14385478 0.05408207]]
    [[0.01706791 0.00660894]]
    [[0.0178871  0.00672178]]
    [[0.01958952 0.00729937]]
    [[0.02612576 0.00999804]]]
   
   
   ++++++++++++++
   Element #0 of all input gradients
   sequence length: 1: [[0.14385478 0.05408207]]
   sequence length: 2: [[0.14385478 0.05408207]]
   sequence length: 3: [[0.14385478 0.05408207]]
   sequence length: 4: [[0.14385478 0.05408207]]
   sequence length: 5: [[0.14385478 0.05408207]]
   
   
   ++++++++++++++
   Element #1 of all input gradients
   sequence length: 2: [[0.01706791 0.00660894]]
   sequence length: 3: [[0.01706791 0.00660894]]
   sequence length: 4: [[0.01706791 0.00660894]]
   sequence length: 5: [[0.01706791 0.00660894]]
   
   
   ++++++++++++++
   Element #2 of all input gradients
   sequence length: 3: [[0.0178871  0.00672178]]
   sequence length: 4: [[0.0178871  0.00672178]]
   sequence length: 5: [[0.0178871  0.00672178]]
   
   
   ++++++++++++++
   Element #3 of all input gradients
   sequence length: 4: [[0.01958952 0.00729937]]
   sequence length: 5: [[0.01958952 0.00729937]]
   
   ```
   In the last 4 sections starting with `++++++++++++++`, it can be seen that 
gradient elements at the same sequence position are equal across all the 5 
gradient computations with sequence length 1, 2, 3, 4, 5 (if they are long 
enough to have that element, e.g. gradient with sequence length 2 cannot have 
element 3 obviously). This means that `RNN` behaves as if the presence of later 
elements in the sequence does not affect the gradient for earlier elements. 
   But this is clearly wrong, because by the nature of recurrent computations 
earlier elements in the sequence DO affect later ones, hence gradient elements 
at the same sequence position should change if the sequence length is 
different. With a longer input sequence having an additional element, the 
gradient of all earlier elements should get an additional contribution from the 
new element, changing their value.
   
   This is not a direct comparison with a manual computation of the gradient, 
but pointing out this behavior is enough to conclude that the gradients 
computed by this op are wrong. I should also point out that this is happening 
for all other settings of the `mode` parameter of the operator, not only 
`mode='rnn_tanh'`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org

[GitHub] [incubator-mxnet] matteosal opened a new issue, #21143: Gradient calculation for recurrent operators is wrong

Reply via email to