FCInter opened a new issue #13902: Loss becomes NaN when setting use_global_stat=True for batchnorm URL: https://github.com/apache/incubator-mxnet/issues/13902 ## Description I trained a model and used it to perform prediction. While building the predictor, if I set the argument for_training=False, the prediction result is bad, as bad as predicted using a randomly initialized model. ## Environment info (Required) ``` ----------Python Info---------- ('Version :', '2.7.12') ('Compiler :', 'GCC 5.4.0 20160609') ('Build :', ('default', 'Dec 4 2017 14:50:18')) ('Arch :', ('64bit', '')) ------------Pip Info----------- ('Version :', '18.1') ('Directory :', '/path/to/mx_env/local/lib/python2.7/site-packages/pip') ----------MXNet Info----------- ('Version :', '1.3.0') ('Directory :', '/path/to/mx_env/local/lib/python2.7/site-packages/mxnet') ('Commit Hash :', 'b3be92f4a48bce62a5a8424271871c2f81c8f7f1') ----------System Info---------- ('Platform :', 'Linux-4.4.0-87-generic-x86_64-with-Ubuntu-16.04-xenial') ('system :', 'Linux') ('node :', 'B22-C09-G5500-01-GPU') ('release :', '4.4.0-87-generic') ('version :', '#110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017') ----------Hardware Info---------- ('machine :', 'x86_64') ('processor :', 'x86_64') Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz Stepping: 1 CPU MHz: 2400.093 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 BogoMIPS: 4801.21 Virtualization: VT-x Hypervisor vendor: vertical Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 56320K NUMA node0 CPU(s): 0-21,44-65 NUMA node1 CPU(s): 22-43,66-87 ``` Package used (Python/R/Scala/Julia): Python ## Build info (Required if built from source) Compiler (gcc/clang/mingw/visual studio): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 Build config: I use pip install. ## Error Message: the training log: ``` Epoch[0] Batch [178] Speed: 16.56 samples/sec Train-RPNAcc=0.870976, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.990496, RCNNLogLoss=nan, RCNNL1Loss=nan, Epoch[0] Batch [179] Speed: 14.71 samples/sec Train-RPNAcc=0.871275, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.990522, RCNNLogLoss=nan, RCNNL1Loss=nan, ``` ## Minimum reproducible example (If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.) This is how I build a resnet-50 model. ``` def residual_unit(self, data, num_filter, stride, dim_match, name, bottle_neck=True, bn_mom=0.9, workspace=256, memonger=False): """Return ResNet Unit symbol for building ResNet Parameters ---------- data : str Input data num_filter : int Number of output channels bnf : int Bottle neck channels factor with regard to num_filter stride : tuple Stride used in convolution dim_match : Boolean True means channel number between input and output is the same, otherwise means differ name : str Base name of the operators workspace : int Workspace used in convolution operator """ if bottle_neck: # the same as https://github.com/facebook/fb.resnet.torch#notes, a bit difference with origin paper bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, eps=self.eps, momentum=bn_mom, name=name + '_bn1', use_global_stats=self.use_global_stats) act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1') conv1 = mx.sym.Convolution(data=act1, num_filter=int(num_filter*0.25), kernel=(1,1), stride=(1,1), pad=(0,0), no_bias=True, workspace=workspace, name=name + '_conv1') bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=self.eps, momentum=bn_mom, name=name + '_bn2', use_global_stats=self.use_global_stats) act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2') conv2 = mx.sym.Convolution(data=act2, num_filter=int(num_filter*0.25), kernel=(3,3), stride=stride, pad=(1,1), no_bias=True, workspace=workspace, name=name + '_conv2') bn3 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=self.eps, momentum=bn_mom, name=name + '_bn3', use_global_stats=self.use_global_stats) act3 = mx.sym.Activation(data=bn3, act_type='relu', name=name + '_relu3') conv3 = mx.sym.Convolution(data=act3, num_filter=num_filter, kernel=(1,1), stride=(1,1), pad=(0,0), no_bias=True, workspace=workspace, name=name + '_conv3') if dim_match: shortcut = data else: shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True, workspace=workspace, name=name+'_sc') if memonger: shortcut._set_attr(mirror_stage='True') return conv3 + shortcut else: bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, momentum=bn_mom, eps=self.eps, name=name + '_bn1', use_global_stats=self.use_global_stats) act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1') conv1 = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(3,3), stride=stride, pad=(1,1), no_bias=True, workspace=workspace, name=name + '_conv1') bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, momentum=bn_mom, eps=self.eps, name=name + '_bn2', use_global_stats=self.use_global_stats) act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2') conv2 = mx.sym.Convolution(data=act2, num_filter=num_filter, kernel=(3,3), stride=(1,1), pad=(1,1), no_bias=True, workspace=workspace, name=name + '_conv2') if dim_match: shortcut = data else: shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True, workspace=workspace, name=name+'_sc') if memonger: shortcut._set_attr(mirror_stage='True') return conv2 + shortcut def resnet(self, data, units, num_stages, filter_list, num_classes, bottle_neck=True, bn_mom=0.9, workspace=256, dtype='float32', memonger=False): """Return ResNet symbol of Parameters ---------- units : list Number of units in each stage num_stages : int Number of stage filter_list : list Channel size of each stage num_classes : int Ouput size of symbol dataset : str Dataset type, only cifar10 and imagenet supports workspace : int Workspace used in convolution operator dtype : str Precision (float32 or float16) """ num_unit = len(units) assert(num_unit == num_stages) body = mx.sym.Convolution(data=data, num_filter=filter_list[0], kernel=(7, 7), stride=(2,2), pad=(3, 3), no_bias=True, name="conv0", workspace=workspace) body = mx.sym.BatchNorm(data=body, fix_gamma=False, eps=self.eps, momentum=bn_mom, name='bn0', use_global_stats=self.use_global_stats) body = mx.sym.Activation(data=body, act_type='relu', name='relu0') body = mx.sym.Pooling(data=body, kernel=(3, 3), stride=(2,2), pad=(1,1), pool_type='max') for i in range(num_stages): stride = (2, 2) if i == num_stages - 1 or i == 0: stride = (1, 1) body = self.residual_unit(body, filter_list[i+1], stride, False, name='stage%d_unit%d' % (i + 1, 1), bottle_neck=bottle_neck, workspace=workspace, memonger=memonger) for j in range(units[i]-1): body = self.residual_unit(body, filter_list[i+1], (1,1), True, name='stage%d_unit%d' % (i + 1, j + 2), bottle_neck=bottle_neck, workspace=workspace, memonger=memonger) feat_conv_3x3 = mx.sym.Convolution( data=body, kernel=(3, 3), pad=(6, 6), dilate=(6, 6), num_filter=1024, name="feat_conv_3x3") feat_conv_3x3_relu = mx.sym.Activation(data=feat_conv_3x3, act_type="relu", name="feat_conv_3x3_relu") # ('feat_conv_3x3_relu.shape', [(1L, 1024L, 38L, 50L)]) return feat_conv_3x3_relu def get_resnet_symbol(self, data, num_classes=2, dtype='float32'): """ Adapted from https://github.com/tornadomeet/ResNet/blob/master/train_resnet.py Original author Wei Wu """ num_layers = self.num_layers if num_layers >= 50: filter_list = [64, 256, 512, 1024, 2048] bottle_neck = True else: filter_list = [64, 64, 128, 256, 512] bottle_neck = False num_stages = 4 if num_layers == 18: units = [2, 2, 2, 2] elif num_layers == 34: units = [3, 4, 6, 3] elif num_layers == 50: units = [3, 4, 6, 3] elif num_layers == 101: units = [3, 4, 23, 3] elif num_layers == 152: units = [3, 8, 36, 3] elif num_layers == 200: units = [3, 24, 36, 3] elif num_layers == 269: units = [3, 30, 48, 8] else: raise ValueError("no experiments done on num_layers {}, you can do it yourself".format(num_layers)) return self.resnet( data = data, units = units, num_stages = num_stages, filter_list = filter_list, num_classes = num_classes, bottle_neck = bottle_neck, workspace = self.workspace, dtype = dtype) ``` ## Steps to reproduce (Paste the commands you ran that produced the error.) 1. For all the `Batchnorm` layers, if the `self.use_global_stats` is `False`, then everything goes fine. Training loss keeps going down and training acc increases. However, if `self.use_global_stats` is `True`, the training loss becomes `NaN` as is shown in the error message. 2. I loaded a pretrained resnet-50 checkpoint, which is downloaded from [here](http://data.dmlc.ml/mxnet/models/imagenet/resnet/50-layers/). What's wrong with my code? Thank you all for helping me!!!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services