matteosal opened a new issue, #21111:
URL: https://github.com/apache/incubator-mxnet/issues/21111
This script creates a batchnorm and runs it 3 times:
1) A first test mode evaluation
2) A dummy training mode evaluation
3) A second test mode evaluation
The outputs of (1) and (3) are compared under various circumstances: CPU vs
GPU, cudnn batchnorm ON vs OFF, evaluation (2) with vs without backward pass.
```
import mxnet as mx
import numpy as np
from mxnet import autograd
def testStateChange(backward, device, cudnn):
print()
print('backward: ' + str(backward) + ', device: ' + str(device) + ',
cudnn: ' + str(cudnn))
sym = mx.symbol.BatchNorm(
*[mx.symbol.Variable(name) for name in shapes.keys()],
eps=0.001,
fix_gamma=False,
use_global_stats=False,
axis=1,
cudnn_off=not(cudnn)
)
op = mx.ndarray.CachedOp(sym)
if(device == mx.cpu()):
arguments = args_cpu
else:
arguments = args_gpu
# First evaluation in test mode
out1 = op(*arguments, default_ctx=device)
# Dummy evaluation in training mode, with or without backward
if(backward):
with autograd.record(train_mode=True):
[arg.attach_grad() for arg in arguments]
dummy = op(*arguments, default_ctx=device)
autograd.backward(dummy, head_grads=mx.np.ones([1, 2, 3],
ctx=device))
else:
with autograd.train_mode():
op(*arguments, default_ctx=device)
# Second evaluation in test mode
out2 = op(*arguments, default_ctx=device)
if(np.isnan(np.sum(out1.asnumpy()))):
print('out1 has nans!')
if(np.isnan(np.sum(out2.asnumpy()))):
print('out2 has nans!')
# Check if the dummy evaluation in training mode has changed the state
of the
# batchnorm. If out1 and out2 are different, the state was changed
print(mx.np.max(mx.np.abs(out1 - out2)))
print("**** cudnn batchnorm inconsistency")
shapes = {'input': [1, 2, 3], 'gamma': [2], 'beta': [2], 'mean': [2], 'var':
[2]}
args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in
shapes.values()]
args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu]
testStateChange(False, mx.cpu(), False)
testStateChange(True, mx.cpu(), False)
testStateChange(False, mx.gpu(), False)
testStateChange(True, mx.gpu(), False)
testStateChange(False, mx.gpu(), True)
testStateChange(True, mx.gpu(), True)
print("\n\n**** cudnn batchnorm nan")
shapes = {'input': [1, 6], 'gamma': [6], 'beta': [6], 'mean': [6], 'var':
[6]}
args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in
shapes.values()]
args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu]
testStateChange(False, mx.gpu(), True)
```
I get this output from the above script:
```
**** cudnn batchnorm inconsistency
backward: False, device: cpu(0), cudnn: False
0.0
backward: True, device: cpu(0), cudnn: False
0.045242727
backward: False, device: gpu(0), cudnn: False
0.0
backward: True, device: gpu(0), cudnn: False
0.045242667
backward: False, device: gpu(0), cudnn: True
0.044606388
backward: True, device: gpu(0), cudnn: True
0.043622255
**** cudnn batchnorm nan
backward: False, device: gpu(0), cudnn: True
out2 has nans!
nan
```
This shows 2 problems:
1) The dummy training mode evaluation can change the values of the moving
mean and variance thus making out1 and out2 differ sometimes, but it is
inconsistent in doing so. The "cudnn batchnorm inconsistency" outputs shows
that moving arrays are normally changed only if a BACKWARD pass in training
mode is performed, but on GPU + cudnn they are changed by the FORWARD (case
`backward: False, device: gpu(0), cudnn: True`)
2) The "cudnn batchnorm nan" output shows that the cudnn batchnorm can also
output nan when alternating training and test mode evaluations with certain
input shapes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]