[GitHub] [incubator-mxnet] matteosal opened a new issue, #21111: cuDNN batchnorm behaviour is not consistent and it can output nan

GitBox Mon, 01 Aug 2022 07:28:47 -0700


matteosal opened a new issue, #21111:
URL: https://github.com/apache/incubator-mxnet/issues/21111


   This script creates a batchnorm and runs it 3 times:
   1) A first test mode evaluation
   2) A dummy training mode evaluation
   3) A second test mode evaluation
   
   The outputs of (1) and (3) are compared under various circumstances: CPU vs 
GPU, cudnn batchnorm ON vs OFF, evaluation (2) with vs without backward pass.
   ```
   import mxnet as mx
   import numpy as np
   from mxnet import autograd
   
   def testStateChange(backward, device, cudnn):
        print()
        print('backward: ' + str(backward) + ', device: ' + str(device) + ', 
cudnn: ' + str(cudnn))
        sym = mx.symbol.BatchNorm(
                *[mx.symbol.Variable(name) for name in shapes.keys()],
                eps=0.001,
                fix_gamma=False,
                use_global_stats=False,
                axis=1,
                cudnn_off=not(cudnn)
        )
        op = mx.ndarray.CachedOp(sym)
   
        if(device == mx.cpu()):
                arguments = args_cpu
        else:
                arguments = args_gpu
   
        # First evaluation in test mode
        out1 = op(*arguments, default_ctx=device)
   
        # Dummy evaluation in training mode, with or without backward
        if(backward):
                with autograd.record(train_mode=True):
                        [arg.attach_grad() for arg in arguments]
                        dummy = op(*arguments, default_ctx=device)
                autograd.backward(dummy, head_grads=mx.np.ones([1, 2, 3], 
ctx=device))
        else:
                with autograd.train_mode():
                        op(*arguments, default_ctx=device)
   
        # Second evaluation in test mode
        out2 = op(*arguments, default_ctx=device)
   
        if(np.isnan(np.sum(out1.asnumpy()))):
                print('out1 has nans!')
        if(np.isnan(np.sum(out2.asnumpy()))):
                print('out2 has nans!')
   
        # Check if the dummy evaluation in training mode has changed the state 
of the 
        # batchnorm. If out1 and out2 are different, the state was changed
        print(mx.np.max(mx.np.abs(out1 - out2)))        
   
   print("**** cudnn batchnorm inconsistency")
   
   shapes = {'input': [1, 2, 3], 'gamma': [2], 'beta': [2], 'mean': [2], 'var': 
[2]}
   args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in 
shapes.values()]
   args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu]
   
   testStateChange(False, mx.cpu(), False)
   testStateChange(True, mx.cpu(), False)
   
   testStateChange(False, mx.gpu(), False)
   testStateChange(True, mx.gpu(), False)
   
   testStateChange(False, mx.gpu(), True)
   testStateChange(True, mx.gpu(), True)
   
   print("\n\n**** cudnn batchnorm nan")
   
   shapes = {'input': [1, 6], 'gamma': [6], 'beta': [6], 'mean': [6], 'var': 
[6]}
   args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in 
shapes.values()]
   args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu]
   
   testStateChange(False, mx.gpu(), True)
   ```
   I get this output from the above script:
   ```
   **** cudnn batchnorm inconsistency
   
   backward: False, device: cpu(0), cudnn: False
   0.0
   
   backward: True, device: cpu(0), cudnn: False
   0.045242727
   
   backward: False, device: gpu(0), cudnn: False
   0.0
   
   backward: True, device: gpu(0), cudnn: False
   0.045242667
   
   backward: False, device: gpu(0), cudnn: True
   0.044606388
   
   backward: True, device: gpu(0), cudnn: True
   0.043622255
   
   
   **** cudnn batchnorm nan
   
   backward: False, device: gpu(0), cudnn: True
   out2 has nans!
   nan
   ```
   
   This shows 2 problems:
   1) The dummy training mode evaluation can change the values of the moving 
mean and variance thus making out1 and out2 differ sometimes, but it is 
inconsistent in doing so. The "cudnn batchnorm inconsistency" outputs shows 
that moving arrays are normally changed only if a BACKWARD pass in training 
mode is performed, but on GPU + cudnn they are changed by the FORWARD (case 
`backward: False, device: gpu(0), cudnn: True`) 
   2) The "cudnn batchnorm nan" output shows that the cudnn batchnorm can also 
output nan when alternating training and test mode evaluations with certain 
input shapes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-mxnet] matteosal opened a new issue, #21111: cuDNN batchnorm behaviour is not consistent and it can output nan

Reply via email to