[GitHub] [incubator-mxnet] karan6181 commented on issue #19631: MXNetError: unknown type for MKLDNN :2 when training Mask RCNN with mxnet-cu101==1.7.0

GitBox Tue, 15 Dec 2020 15:35:14 -0800


karan6181 commented on issue #19631:
URL: 
https://github.com/apache/incubator-mxnet/issues/19631#issuecomment-745633774



   I also tried running Mask RCNN script on single node using 
`mxnet-cu101mkl==1.6.0.post0` with `gluon-cv==0.8.0` and I was able to run it 
successfully without any issue.
   
   Below is the output from the run:
   
   ```
   python gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py --gpus 
0,1,2,3,4,5,6,7 --num-workers 4 --amp --lr-decay-epoch 8,10 --epochs 6 
--log-interval 10 --val-interval 12 --batch-size 8 --use-fpn --lr 0.01 
--lr-warmup-factor 0.001 --lr-warmup 1600 --static-alloc --clip-gradient 1.5 
--use-ext --seed 987
   /shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py:1389: 
UserWarning: Cannot decide type for the following arguments. Consider providing 
them as input:
        data: None
     input_sym_arg_type = in_param.infer_type()[0]
   [23:15:28] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
   [23:15:31] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
   [23:15:34] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
   [23:15:36] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
   [23:15:39] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
   [23:15:41] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
   [23:15:44] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
   [23:15:46] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
   loading annotations into memory...
   Done (t=14.20s)
   creating index...
   index created!
   loading annotations into memory...
   Done (t=0.39s)
   creating index...
   index created!
   creating index...
   /shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: 
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator0_anchor_" 
does not support grad_req other than "null", and new value "write" is ignored.
     warnings.warn('Constant parameter "{}" does not support '
   /shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: 
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator1_anchor_" 
does not support grad_req other than "null", and new value "write" is ignored.
     warnings.warn('Constant parameter "{}" does not support '
   /shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: 
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator2_anchor_" 
does not support grad_req other than "null", and new value "write" is ignored.
     warnings.warn('Constant parameter "{}" does not support '
   /shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: 
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator3_anchor_" 
does not support grad_req other than "null", and new value "write" is ignored.
     warnings.warn('Constant parameter "{}" does not support '
   /shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701: 
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator4_anchor_" 
does not support grad_req other than "null", and new value "write" is ignored.
     warnings.warn('Constant parameter "{}" does not support '
   INFO:root:Namespace(amp=True, batch_size=8, clip_gradient=1.5, 
custom_model=None, dataset='coco', disable_hybridization=False, epochs=6, 
executor_threads=1, gpus='0,1,2,3,4,5,6,7', horovod=False, kv_store='device', 
log_interval=10, lr=0.01, lr_decay=0.1, lr_decay_epoch='8,10', 
lr_warmup='1600', lr_warmup_factor=0.001, momentum=0.9, network='resnet50_v1b', 
norm_layer=None, num_workers=4, rcnn_smoothl1_rho=1.0, resume='', 
rpn_smoothl1_rho=0.1111111111111111, save_interval=1, 
save_prefix='mask_rcnn_fpn_resnet50_v1b_coco', seed=987, start_epoch=0, 
static_alloc=True, train_datapath='/scratch/data/mask_rcnn/mxnet/', 
use_ext=True, use_fpn=True, val_datapath='/scratch/data/mask_rcnn/mxnet/', 
val_interval=12, verbose=False, wd=0.0001)
   INFO:root:Start training from [Epoch 0]
   INFO:root:[Epoch 0 Iteration 0] Set learning rate to 1e-05
   [23:16:40] src/imperative/cached_op.cc:192: Disabling fusion due to altered 
topological order of inputs.
   [23:16:40] src/imperative/cached_op.cc:192: Disabling fusion due to altered 
topological order of inputs.
   [23:16:41] src/imperative/cached_op.cc:192: Disabling fusion due to altered 
topological order of inputs.
   [23:16:42] src/imperative/cached_op.cc:192: Disabling fusion due to altered 
topological order of inputs.
   [23:16:42] src/imperative/cached_op.cc:192: Disabling fusion due to altered 
topological order of inputs.
   [23:16:43] src/imperative/cached_op.cc:192: Disabling fusion due to altered 
topological order of inputs.
   [23:16:43] src/imperative/cached_op.cc:192: Disabling fusion due to altered 
topological order of inputs.
   [23:16:44] src/imperative/cached_op.cc:192: Disabling fusion due to altered 
topological order of inputs.
   [23:16:47] src/kvstore/././comm.h:744: only 32 out of 56 GPU pairs are 
enabled direct access. It may affect the performance. You can set 
MXNET_ENABLE_GPU_P2P=0 to turn it off
   [23:16:47] src/kvstore/././comm.h:753: .vvvv...
   [23:16:47] src/kvstore/././comm.h:753: v.vv.v..
   [23:16:47] src/kvstore/././comm.h:753: vv.v..v.
   [23:16:47] src/kvstore/././comm.h:753: vvv....v
   [23:16:47] src/kvstore/././comm.h:753: v....vvv
   [23:16:47] src/kvstore/././comm.h:753: .v..v.vv
   [23:16:47] src/kvstore/././comm.h:753: ..v.vv.v
   [23:16:47] src/kvstore/././comm.h:753: ...vvvv.
   INFO:root:AMP: decreasing loss scale to 32768.000000
   INFO:root:AMP: decreasing loss scale to 16384.000000
   INFO:root:AMP: decreasing loss scale to 8192.000000
   INFO:root:AMP: decreasing loss scale to 4096.000000
   INFO:root:AMP: decreasing loss scale to 2048.000000
   INFO:root:AMP: decreasing loss scale to 1024.000000
   INFO:root:[Epoch 0][Batch 9], Speed: 4.880 samples/sec, 
RPN_Conf=0.606,RPN_SmoothL1=0.156,RCNN_CrossEntropy=4.487,RCNN_SmoothL1=0.021,RCNN_Mask=1.882,RPNAcc=0.751,RPNL1Loss=1.384,RCNNAcc=0.004,RCNNL1Loss=0.947,MaskAcc=0.518,MaskFGAcc=0.522
   INFO:root:[Epoch 0 Iteration 10] Set learning rate to 7.24375e-05
   INFO:root:[Epoch 0][Batch 19], Speed: 15.230 samples/sec, 
RPN_Conf=0.577,RPN_SmoothL1=0.151,RCNN_CrossEntropy=4.018,RCNN_SmoothL1=0.019,RCNN_Mask=1.803,RPNAcc=0.797,RPNL1Loss=1.402,RCNNAcc=0.314,RCNNL1Loss=0.879,MaskAcc=0.513,MaskFGAcc=0.523
   INFO:root:[Epoch 0 Iteration 20] Set learning rate to 0.000134875
   INFO:root:[Epoch 0][Batch 29], Speed: 11.308 samples/sec, 
RPN_Conf=0.526,RPN_SmoothL1=0.143,RCNN_CrossEntropy=3.149,RCNN_SmoothL1=0.020,RCNN_Mask=1.635,RPNAcc=0.828,RPNL1Loss=1.323,RCNNAcc=0.535,RCNNL1Loss=0.927,MaskAcc=0.514,MaskFGAcc=0.525
   INFO:root:[Epoch 0 Iteration 30] Set learning rate to 0.0001973125
   INFO:root:[Epoch 0][Batch 39], Speed: 16.033 samples/sec, 
RPN_Conf=0.479,RPN_SmoothL1=0.139,RCNN_CrossEntropy=2.477,RCNN_SmoothL1=0.022,RCNN_Mask=1.504,RPNAcc=0.842,RPNL1Loss=1.277,RCNNAcc=0.645,RCNNL1Loss=0.998,MaskAcc=0.514,MaskFGAcc=0.527
   INFO:root:[Epoch 0 Iteration 40] Set learning rate to 0.00025975
   INFO:root:[Epoch 0][Batch 49], Speed: 13.002 samples/sec, 
RPN_Conf=0.435,RPN_SmoothL1=0.129,RCNN_CrossEntropy=2.047,RCNN_SmoothL1=0.026,RCNN_Mask=1.390,RPNAcc=0.856,RPNL1Loss=1.229,RCNNAcc=0.711,RCNNL1Loss=1.112,MaskAcc=0.518,MaskFGAcc=0.529
   INFO:root:[Epoch 0 Iteration 50] Set learning rate to 0.0003221875
   INFO:root:[Epoch 0][Batch 59], Speed: 13.902 samples/sec, 
RPN_Conf=0.423,RPN_SmoothL1=0.127,RCNN_CrossEntropy=1.780,RCNN_SmoothL1=0.032,RCNN_Mask=1.303,RPNAcc=0.858,RPNL1Loss=1.156,RCNNAcc=0.753,RCNNL1Loss=1.227,MaskAcc=0.516,MaskFGAcc=0.532
   INFO:root:[Epoch 0 Iteration 60] Set learning rate to 0.000384625
   INFO:root:[Epoch 0][Batch 69], Speed: 13.738 samples/sec, 
RPN_Conf=0.402,RPN_SmoothL1=0.122,RCNN_CrossEntropy=1.582,RCNN_SmoothL1=0.039,RCNN_Mask=1.230,RPNAcc=0.862,RPNL1Loss=1.104,RCNNAcc=0.782,RCNNL1Loss=1.369,MaskAcc=0.517,MaskFGAcc=0.534
   INFO:root:[Epoch 0 Iteration 70] Set learning rate to 0.0004470625
   INFO:root:[Epoch 0][Batch 79], Speed: 12.123 samples/sec, 
RPN_Conf=0.385,RPN_SmoothL1=0.116,RCNN_CrossEntropy=1.440,RCNN_SmoothL1=0.048,RCNN_Mask=1.172,RPNAcc=0.865,RPNL1Loss=1.055,RCNNAcc=0.802,RCNNL1Loss=1.537,MaskAcc=0.517,MaskFGAcc=0.536
   ```
   
   However, running it with `mxnet-cu101==1.7.0` and `gluoncv==0.8.0` fails 
with:
   
   ```
   Traceback (most recent call last):
     File "/shared/mx_oob_env/lib/python3.8/multiprocessing/pool.py", line 125, 
in worker
       result = (True, func(*args, **kwds))
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/data/dataloader.py",
 line 429, in _worker_fn
       batch = batchify_fn([_worker_dataset[i] for i in samples])
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/data/dataloader.py",
 line 429, in <listcomp>
       batch = batchify_fn([_worker_dataset[i] for i in samples])
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/data/dataset.py", 
line 219, in __getitem__
       return self._fn(*item)
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/data/transforms/presets/rcnn.py",
 line 407, in __call__
       cls_target, box_target, box_mask = self._target_generator(
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line 
682, in __call__
       out = self.forward(*args)
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/model_zoo/rcnn/rpn/rpn_target.py",
 line 157, in forward
       ious = mx.nd.contrib.box_iou(anchor, bbox, format='corner').asnumpy()
     File 
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/ndarray/ndarray.py", line 
2563, in asnumpy
       check_call(_LIB.MXNDArraySyncCopyToCPU(
     File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/base.py", line 
246, in check_call
       raise get_last_ffi_error()
   mxnet.base.MXNetError: Traceback (most recent call last):
     File 
"src/ndarray/./../operator/tensor/.././../common/../operator/nn/mkldnn/mkldnn_base-inl.h",
 line 246
   MXNetError: unknown type for MKLDNN :2
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-mxnet] karan6181 commented on issue #19631: MXNetError: unknown type for MKLDNN :2 when training Mask RCNN with mxnet-cu101==1.7.0

Reply via email to