matteosal commented on issue #21111:
URL: 
https://github.com/apache/incubator-mxnet/issues/21111#issuecomment-1235710738

   @DickJC123 you were in fact right about biased vs unbiased variance 
computation. This script tests such claim by letting a non-cudnn batchnorm and 
a cudnn-batchnorm update their moving variance, and checking that they are 
updated differently and that they respectively correspond to the biased 
(non-cudnn) and the unbiased (cudnn) computations:
   ```
   import mxnet as mx
   import numpy as np
   from mxnet import autograd
   
   print("**** cudnn batchnorm variance")
   
   shapes = {'input': [1, 6, 5], 'gamma': [6], 'beta': [6], 'mean': [6], 'var': 
[6]}
   
   # Define batchnorms with identical specs except cudnn_off
   # Note that momentum is 0, so moving arrays are replaced everytime with the 
latest one
   sym1 = mx.symbol.BatchNorm(
        *[mx.symbol.Variable(name) for name in shapes.keys()],
        eps=0.001,
        momentum=0,
        fix_gamma=False,
        use_global_stats=False,
        axis=1,
        cudnn_off=True
   )
   sym2 = mx.symbol.BatchNorm(
        *[mx.symbol.Variable(name) for name in shapes.keys()],
        eps=0.001,
        momentum=0,
        fix_gamma=False,
        use_global_stats=False,
        axis=1,
        cudnn_off=False
   )
   op1 = mx.ndarray.CachedOp(sym1)
   op2 = mx.ndarray.CachedOp(sym2)
   
   # Define arrays for op1 and 
   # They are identical now, but they will be changed differently by the ops
   args1 = [mx.np.random.uniform(size=shape, ctx=mx.gpu()) for shape in 
shapes.values()]
   args2 = [mx.np.array(array, ctx=mx.gpu()) for array in args1]
   
   data, gamma, beta, mean, var = args1
   
   # Evaluation in training mode with backward that rewrites moving mean and var
   with autograd.record(train_mode=True):
        [arg.attach_grad() for arg in args1]
        [arg.attach_grad() for arg in args2]
        dummy1 = op1(*args1, default_ctx=mx.gpu())
        dummy2 = op2(*args2, default_ctx=mx.gpu())
   autograd.backward(dummy1, head_grads=mx.np.ones(shapes['input'], 
ctx=mx.gpu()))
   autograd.backward(dummy2, head_grads=mx.np.ones(shapes['input'], 
ctx=mx.gpu()))
   
   # Check that outputs are the same
   print()
   print("difference between training mode outputs")
   print(mx.np.max(mx.np.abs(dummy1 - dummy2))) 
   
   # Check updated moving vars and observe they are different
   print()
   print("variance updated by the non-cudnn batchnorm")
   print(args1[-1])
   print("variance updated by the cudnn batchnorm")
   print(args2[-1])
   
   # Manually compute biased and unbiased variance
   data_mean = mx.np.mean(data, axis=(-1))
   data_zeromean = data - data_mean[:, :, np.newaxis]
   var1 = mx.np.mean((data_zeromean * data_zeromean), axis=(-1))
   var2 = var1 * shapes['input'][-1] / (shapes['input'][-1] - 1)
   
   print()
   print("manual biased variance")
   print(var1)
   print("manual unbiased variance")
   print(var2)
   ```
   output is:
   ```
   **** cudnn batchnorm variance
   
   difference between training mode outputs
   2.3841858e-07
   
   variance updated by the non-cudnn batchnorm
   [0.12171984 0.03338415 0.03920404 0.04988261 0.02153183 0.02420242] @gpu(0)
   variance updated by the cudnn batchnorm
   [0.15214981 0.04173018 0.04900505 0.06235326 0.02691478 0.03025302] @gpu(0)
   
   manual biased variance
   [[0.12171984 0.03338414 0.03920404 0.04988261 0.02153182 0.02420242]] @gpu(0)
   manual unbiased variance
   [[0.1521498  0.04173018 0.04900505 0.06235326 0.02691478 0.03025302]] @gpu(0)
   ```
   
   So this shows that:
   1) The training mode output is the same between non-cudnn and cudnn 
implementations ("difference between training mode outputs"), so they are 
computing the data variance in the same way at this step.It can be checked 
manually that their result corresponds to using the biased variance
   2) However the way the end up changing their moving variance is different. 
In particular, the non-cudnn case uses the biased variance as before but the 
cudnn case uses the non-biased variance this time. Note that the momentum is 
set to 0 for both ops, which means that moving arrays are replaced with the 
latest ones, that makes it easy to check the results
   3) This explains the numerical error found in my original report. For a 
spatial size of 1, the unbiased variance gets multiplied by a factor **1 / (1 - 
1) = nan** which would make a subsequent evaluation fail for the cudnn case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to