[GitHub] FCInter opened a new issue #13902: Loss becomes NaN when setting use_global_stat=True for batchnorm

GitBox Tue, 15 Jan 2019 22:47:30 -0800

FCInter opened a new issue #13902: Loss becomes NaN when setting 
use_global_stat=True for batchnorm
URL: https://github.com/apache/incubator-mxnet/issues/13902
 
 
   ## Description
   I trained a model and used it to perform prediction. While building the 
predictor, if I set the argument for_training=False, the prediction result is 
bad, as bad as predicted using a randomly initialized model.
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   ('Version      :', '2.7.12')
   ('Compiler     :', 'GCC 5.4.0 20160609')
   ('Build        :', ('default', 'Dec  4 2017 14:50:18'))
   ('Arch         :', ('64bit', ''))
   ------------Pip Info-----------
   ('Version      :', '18.1')
   ('Directory    :', '/path/to/mx_env/local/lib/python2.7/site-packages/pip')
   ----------MXNet Info-----------
   ('Version      :', '1.3.0')
   ('Directory    :', '/path/to/mx_env/local/lib/python2.7/site-packages/mxnet')
   ('Commit Hash   :', 'b3be92f4a48bce62a5a8424271871c2f81c8f7f1')
   ----------System Info----------
   ('Platform     :', 'Linux-4.4.0-87-generic-x86_64-with-Ubuntu-16.04-xenial')
   ('system       :', 'Linux')
   ('node         :', 'B22-C09-G5500-01-GPU')
   ('release      :', '4.4.0-87-generic')
   ('version      :', '#110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017')
   ----------Hardware Info----------
   ('machine      :', 'x86_64')
   ('processor    :', 'x86_64')
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                88
   On-line CPU(s) list:   0-87
   Thread(s) per core:    2
   Core(s) per socket:    22
   Socket(s):             2
   NUMA node(s):          2
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 79
   Model name:            Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz
   Stepping:              1
   CPU MHz:               2400.093
   CPU max MHz:           3600.0000
   CPU min MHz:           1200.0000
   BogoMIPS:              4801.21
   Virtualization:        VT-x
   Hypervisor vendor:     vertical
   Virtualization type:   full
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              56320K
   NUMA node0 CPU(s):     0-21,44-65
   NUMA node1 CPU(s):     22-43,66-87
   ```
   
   
   Package used (Python/R/Scala/Julia):
   Python
   
   ## Build info (Required if built from source)
   
   Compiler (gcc/clang/mingw/visual studio):
   gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
   
   Build config:
   I use pip install.
   
   ## Error Message:
   the training log:
   ```
   Epoch[0] Batch [178]    Speed: 16.56 samples/sec        
Train-RPNAcc=0.870976,  RPNLogLoss=nan, RPNL1Loss=nan,  RCNNAcc=0.990496,       
RCNNLogLoss=nan,        RCNNL1Loss=nan,
   Epoch[0] Batch [179]    Speed: 14.71 samples/sec        
Train-RPNAcc=0.871275,  RPNLogLoss=nan, RPNL1Loss=nan,  RCNNAcc=0.990522,       
RCNNLogLoss=nan,        RCNNL1Loss=nan,
   ```
   
   
   ## Minimum reproducible example
   (If you are using your own code, please provide a short script that 
reproduces the error. Otherwise, please provide link to the existing example.)
   This is how I build a resnet-50 model.
   
   ```
   def residual_unit(self, data, num_filter, stride, dim_match, name, 
bottle_neck=True, bn_mom=0.9, workspace=256, memonger=False):
       """Return ResNet Unit symbol for building ResNet
       Parameters
       ----------
       data : str
           Input data
       num_filter : int
           Number of output channels
       bnf : int
           Bottle neck channels factor with regard to num_filter
       stride : tuple
           Stride used in convolution
       dim_match : Boolean
           True means channel number between input and output is the same, 
otherwise means differ
       name : str
           Base name of the operators
       workspace : int
           Workspace used in convolution operator
       """
       if bottle_neck:
           # the same as https://github.com/facebook/fb.resnet.torch#notes, a 
bit difference with origin paper
           bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, eps=self.eps, 
momentum=bn_mom, name=name + '_bn1', use_global_stats=self.use_global_stats)
           act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + 
'_relu1')
           conv1 = mx.sym.Convolution(data=act1, 
num_filter=int(num_filter*0.25), kernel=(1,1), stride=(1,1), pad=(0,0),
                                      no_bias=True, workspace=workspace, 
name=name + '_conv1')
           bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=self.eps, 
momentum=bn_mom, name=name + '_bn2', use_global_stats=self.use_global_stats)
           act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + 
'_relu2')
           conv2 = mx.sym.Convolution(data=act2, 
num_filter=int(num_filter*0.25), kernel=(3,3), stride=stride, pad=(1,1),
                                      no_bias=True, workspace=workspace, 
name=name + '_conv2')
           bn3 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=self.eps, 
momentum=bn_mom, name=name + '_bn3', use_global_stats=self.use_global_stats)
           act3 = mx.sym.Activation(data=bn3, act_type='relu', name=name + 
'_relu3')
           conv3 = mx.sym.Convolution(data=act3, num_filter=num_filter, 
kernel=(1,1), stride=(1,1), pad=(0,0), no_bias=True,
                                      workspace=workspace, name=name + '_conv3')
           if dim_match:
               shortcut = data
           else:
               shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, 
kernel=(1,1), stride=stride, no_bias=True,
                                               workspace=workspace, 
name=name+'_sc')
           if memonger:
               shortcut._set_attr(mirror_stage='True')
           return conv3 + shortcut
       else:
           bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, momentum=bn_mom, 
eps=self.eps, name=name + '_bn1', use_global_stats=self.use_global_stats)
           act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + 
'_relu1')
           conv1 = mx.sym.Convolution(data=act1, num_filter=num_filter, 
kernel=(3,3), stride=stride, pad=(1,1),
                                         no_bias=True, workspace=workspace, 
name=name + '_conv1')
           bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, momentum=bn_mom, 
eps=self.eps, name=name + '_bn2', use_global_stats=self.use_global_stats)
           act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + 
'_relu2')
           conv2 = mx.sym.Convolution(data=act2, num_filter=num_filter, 
kernel=(3,3), stride=(1,1), pad=(1,1),
                                         no_bias=True, workspace=workspace, 
name=name + '_conv2')
           if dim_match:
               shortcut = data
           else:
               shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, 
kernel=(1,1), stride=stride, no_bias=True,
                                               workspace=workspace, 
name=name+'_sc')
           if memonger:
               shortcut._set_attr(mirror_stage='True')
           return conv2 + shortcut
   
   def resnet(self, data, units, num_stages, filter_list, num_classes, 
bottle_neck=True, bn_mom=0.9, workspace=256, dtype='float32', memonger=False):
       """Return ResNet symbol of
       Parameters
       ----------
       units : list
           Number of units in each stage
       num_stages : int
           Number of stage
       filter_list : list
           Channel size of each stage
       num_classes : int
           Ouput size of symbol
       dataset : str
           Dataset type, only cifar10 and imagenet supports
       workspace : int
           Workspace used in convolution operator
       dtype : str
           Precision (float32 or float16)
       """
       num_unit = len(units)
       assert(num_unit == num_stages)
       body = mx.sym.Convolution(data=data, num_filter=filter_list[0], 
kernel=(7, 7), stride=(2,2), pad=(3, 3),
                                 no_bias=True, name="conv0", 
workspace=workspace)
       body = mx.sym.BatchNorm(data=body, fix_gamma=False, eps=self.eps, 
momentum=bn_mom, name='bn0', use_global_stats=self.use_global_stats)
       body = mx.sym.Activation(data=body, act_type='relu', name='relu0')
       body = mx.sym.Pooling(data=body, kernel=(3, 3), stride=(2,2), pad=(1,1), 
pool_type='max')
   
       for i in range(num_stages):
           stride = (2, 2)
           if i == num_stages - 1 or i == 0:
               stride = (1, 1)
           body = self.residual_unit(body, filter_list[i+1], stride, False,
                                name='stage%d_unit%d' % (i + 1, 1), 
bottle_neck=bottle_neck, workspace=workspace,
                                memonger=memonger)
           for j in range(units[i]-1):
               body = self.residual_unit(body, filter_list[i+1], (1,1), True, 
name='stage%d_unit%d' % (i + 1, j + 2),
                                    bottle_neck=bottle_neck, 
workspace=workspace, memonger=memonger) 
       feat_conv_3x3 = mx.sym.Convolution(
           data=body, kernel=(3, 3), pad=(6, 6), dilate=(6, 6), 
num_filter=1024, name="feat_conv_3x3")
       feat_conv_3x3_relu = mx.sym.Activation(data=feat_conv_3x3, 
act_type="relu", name="feat_conv_3x3_relu") # ('feat_conv_3x3_relu.shape', 
[(1L, 1024L, 38L, 50L)])
       return feat_conv_3x3_relu
   
   def get_resnet_symbol(self, data, num_classes=2, dtype='float32'):
       """
       Adapted from 
https://github.com/tornadomeet/ResNet/blob/master/train_resnet.py
       Original author Wei Wu
       """
       num_layers = self.num_layers
       if num_layers >= 50:
           filter_list = [64, 256, 512, 1024, 2048]
           bottle_neck = True
       else:
           filter_list = [64, 64, 128, 256, 512]
           bottle_neck = False
       num_stages = 4
       if num_layers == 18:
           units = [2, 2, 2, 2]
       elif num_layers == 34:
           units = [3, 4, 6, 3]
       elif num_layers == 50:
           units = [3, 4, 6, 3]
       elif num_layers == 101:
           units = [3, 4, 23, 3]
       elif num_layers == 152:
           units = [3, 8, 36, 3]
       elif num_layers == 200:
           units = [3, 24, 36, 3]
       elif num_layers == 269:
           units = [3, 30, 48, 8]
       else:
           raise ValueError("no experiments done on num_layers {}, you can do 
it yourself".format(num_layers))
       return self.resnet( data = data, 
                     units  = units,
                     num_stages  = num_stages,
                     filter_list = filter_list,
                     num_classes = num_classes,
                     bottle_neck = bottle_neck,
                     workspace   = self.workspace,
                     dtype       = dtype)
   
   ```
   
   ## Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1. For all the `Batchnorm` layers, if the `self.use_global_stats` is 
`False`, then everything goes fine. Training loss keeps going down and training 
acc increases. However, if `self.use_global_stats` is `True`, the training loss 
becomes `NaN` as is shown in the error message.
   2. I loaded a pretrained resnet-50 checkpoint, which is downloaded from 
[here](http://data.dmlc.ml/mxnet/models/imagenet/resnet/50-layers/).
   
   What's wrong with my code?
   Thank you all for helping me!!!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] FCInter opened a new issue #13902: Loss becomes NaN when setting use_global_stat=True for batchnorm

Reply via email to