karan6181 commented on issue #19631:
URL:
https://github.com/apache/incubator-mxnet/issues/19631#issuecomment-745633774
I also tried running Mask RCNN script on single node using
`mxnet-cu101mkl==1.6.0.post0` with `gluon-cv==0.8.0` and I was able to run it
successfully without any issue.
Below is the output from the run:
```
python gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py --gpus
0,1,2,3,4,5,6,7 --num-workers 4 --amp --lr-decay-epoch 8,10 --epochs 6
--log-interval 10 --val-interval 12 --batch-size 8 --use-fpn --lr 0.01
--lr-warmup-factor 0.001 --lr-warmup 1600 --static-alloc --clip-gradient 1.5
--use-ext --seed 987
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py:1389:
UserWarning: Cannot decide type for the following arguments. Consider providing
them as input:
data: None
input_sym_arg_type = in_param.infer_type()[0]
[23:15:28] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:31] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:34] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:36] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:39] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:41] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:44] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[23:15:46] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
loading annotations into memory...
Done (t=14.20s)
creating index...
index created!
loading annotations into memory...
Done (t=0.39s)
creating index...
index created!
creating index...
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701:
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator0_anchor_"
does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701:
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator1_anchor_"
does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701:
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator2_anchor_"
does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701:
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator3_anchor_"
does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/parameter.py:701:
UserWarning: Constant parameter "maskrcnn0_rpn0_rpnanchorgenerator4_anchor_"
does not support grad_req other than "null", and new value "write" is ignored.
warnings.warn('Constant parameter "{}" does not support '
INFO:root:Namespace(amp=True, batch_size=8, clip_gradient=1.5,
custom_model=None, dataset='coco', disable_hybridization=False, epochs=6,
executor_threads=1, gpus='0,1,2,3,4,5,6,7', horovod=False, kv_store='device',
log_interval=10, lr=0.01, lr_decay=0.1, lr_decay_epoch='8,10',
lr_warmup='1600', lr_warmup_factor=0.001, momentum=0.9, network='resnet50_v1b',
norm_layer=None, num_workers=4, rcnn_smoothl1_rho=1.0, resume='',
rpn_smoothl1_rho=0.1111111111111111, save_interval=1,
save_prefix='mask_rcnn_fpn_resnet50_v1b_coco', seed=987, start_epoch=0,
static_alloc=True, train_datapath='/scratch/data/mask_rcnn/mxnet/',
use_ext=True, use_fpn=True, val_datapath='/scratch/data/mask_rcnn/mxnet/',
val_interval=12, verbose=False, wd=0.0001)
INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0 Iteration 0] Set learning rate to 1e-05
[23:16:40] src/imperative/cached_op.cc:192: Disabling fusion due to altered
topological order of inputs.
[23:16:40] src/imperative/cached_op.cc:192: Disabling fusion due to altered
topological order of inputs.
[23:16:41] src/imperative/cached_op.cc:192: Disabling fusion due to altered
topological order of inputs.
[23:16:42] src/imperative/cached_op.cc:192: Disabling fusion due to altered
topological order of inputs.
[23:16:42] src/imperative/cached_op.cc:192: Disabling fusion due to altered
topological order of inputs.
[23:16:43] src/imperative/cached_op.cc:192: Disabling fusion due to altered
topological order of inputs.
[23:16:43] src/imperative/cached_op.cc:192: Disabling fusion due to altered
topological order of inputs.
[23:16:44] src/imperative/cached_op.cc:192: Disabling fusion due to altered
topological order of inputs.
[23:16:47] src/kvstore/././comm.h:744: only 32 out of 56 GPU pairs are
enabled direct access. It may affect the performance. You can set
MXNET_ENABLE_GPU_P2P=0 to turn it off
[23:16:47] src/kvstore/././comm.h:753: .vvvv...
[23:16:47] src/kvstore/././comm.h:753: v.vv.v..
[23:16:47] src/kvstore/././comm.h:753: vv.v..v.
[23:16:47] src/kvstore/././comm.h:753: vvv....v
[23:16:47] src/kvstore/././comm.h:753: v....vvv
[23:16:47] src/kvstore/././comm.h:753: .v..v.vv
[23:16:47] src/kvstore/././comm.h:753: ..v.vv.v
[23:16:47] src/kvstore/././comm.h:753: ...vvvv.
INFO:root:AMP: decreasing loss scale to 32768.000000
INFO:root:AMP: decreasing loss scale to 16384.000000
INFO:root:AMP: decreasing loss scale to 8192.000000
INFO:root:AMP: decreasing loss scale to 4096.000000
INFO:root:AMP: decreasing loss scale to 2048.000000
INFO:root:AMP: decreasing loss scale to 1024.000000
INFO:root:[Epoch 0][Batch 9], Speed: 4.880 samples/sec,
RPN_Conf=0.606,RPN_SmoothL1=0.156,RCNN_CrossEntropy=4.487,RCNN_SmoothL1=0.021,RCNN_Mask=1.882,RPNAcc=0.751,RPNL1Loss=1.384,RCNNAcc=0.004,RCNNL1Loss=0.947,MaskAcc=0.518,MaskFGAcc=0.522
INFO:root:[Epoch 0 Iteration 10] Set learning rate to 7.24375e-05
INFO:root:[Epoch 0][Batch 19], Speed: 15.230 samples/sec,
RPN_Conf=0.577,RPN_SmoothL1=0.151,RCNN_CrossEntropy=4.018,RCNN_SmoothL1=0.019,RCNN_Mask=1.803,RPNAcc=0.797,RPNL1Loss=1.402,RCNNAcc=0.314,RCNNL1Loss=0.879,MaskAcc=0.513,MaskFGAcc=0.523
INFO:root:[Epoch 0 Iteration 20] Set learning rate to 0.000134875
INFO:root:[Epoch 0][Batch 29], Speed: 11.308 samples/sec,
RPN_Conf=0.526,RPN_SmoothL1=0.143,RCNN_CrossEntropy=3.149,RCNN_SmoothL1=0.020,RCNN_Mask=1.635,RPNAcc=0.828,RPNL1Loss=1.323,RCNNAcc=0.535,RCNNL1Loss=0.927,MaskAcc=0.514,MaskFGAcc=0.525
INFO:root:[Epoch 0 Iteration 30] Set learning rate to 0.0001973125
INFO:root:[Epoch 0][Batch 39], Speed: 16.033 samples/sec,
RPN_Conf=0.479,RPN_SmoothL1=0.139,RCNN_CrossEntropy=2.477,RCNN_SmoothL1=0.022,RCNN_Mask=1.504,RPNAcc=0.842,RPNL1Loss=1.277,RCNNAcc=0.645,RCNNL1Loss=0.998,MaskAcc=0.514,MaskFGAcc=0.527
INFO:root:[Epoch 0 Iteration 40] Set learning rate to 0.00025975
INFO:root:[Epoch 0][Batch 49], Speed: 13.002 samples/sec,
RPN_Conf=0.435,RPN_SmoothL1=0.129,RCNN_CrossEntropy=2.047,RCNN_SmoothL1=0.026,RCNN_Mask=1.390,RPNAcc=0.856,RPNL1Loss=1.229,RCNNAcc=0.711,RCNNL1Loss=1.112,MaskAcc=0.518,MaskFGAcc=0.529
INFO:root:[Epoch 0 Iteration 50] Set learning rate to 0.0003221875
INFO:root:[Epoch 0][Batch 59], Speed: 13.902 samples/sec,
RPN_Conf=0.423,RPN_SmoothL1=0.127,RCNN_CrossEntropy=1.780,RCNN_SmoothL1=0.032,RCNN_Mask=1.303,RPNAcc=0.858,RPNL1Loss=1.156,RCNNAcc=0.753,RCNNL1Loss=1.227,MaskAcc=0.516,MaskFGAcc=0.532
INFO:root:[Epoch 0 Iteration 60] Set learning rate to 0.000384625
INFO:root:[Epoch 0][Batch 69], Speed: 13.738 samples/sec,
RPN_Conf=0.402,RPN_SmoothL1=0.122,RCNN_CrossEntropy=1.582,RCNN_SmoothL1=0.039,RCNN_Mask=1.230,RPNAcc=0.862,RPNL1Loss=1.104,RCNNAcc=0.782,RCNNL1Loss=1.369,MaskAcc=0.517,MaskFGAcc=0.534
INFO:root:[Epoch 0 Iteration 70] Set learning rate to 0.0004470625
INFO:root:[Epoch 0][Batch 79], Speed: 12.123 samples/sec,
RPN_Conf=0.385,RPN_SmoothL1=0.116,RCNN_CrossEntropy=1.440,RCNN_SmoothL1=0.048,RCNN_Mask=1.172,RPNAcc=0.865,RPNL1Loss=1.055,RCNNAcc=0.802,RCNNL1Loss=1.537,MaskAcc=0.517,MaskFGAcc=0.536
```
However, running it with `mxnet-cu101==1.7.0` and `gluoncv==0.8.0` fails
with:
```
Traceback (most recent call last):
File "/shared/mx_oob_env/lib/python3.8/multiprocessing/pool.py", line 125,
in worker
result = (True, func(*args, **kwds))
File
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/data/dataloader.py",
line 429, in _worker_fn
batch = batchify_fn([_worker_dataset[i] for i in samples])
File
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/data/dataloader.py",
line 429, in <listcomp>
batch = batchify_fn([_worker_dataset[i] for i in samples])
File
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/data/dataset.py",
line 219, in __getitem__
return self._fn(*item)
File
"/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/data/transforms/presets/rcnn.py",
line 407, in __call__
cls_target, box_target, box_mask = self._target_generator(
File
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/gluon/block.py", line
682, in __call__
out = self.forward(*args)
File
"/shared/mx_oob_env/lib/python3.8/site-packages/gluoncv/model_zoo/rcnn/rpn/rpn_target.py",
line 157, in forward
ious = mx.nd.contrib.box_iou(anchor, bbox, format='corner').asnumpy()
File
"/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/ndarray/ndarray.py", line
2563, in asnumpy
check_call(_LIB.MXNDArraySyncCopyToCPU(
File "/shared/mx_oob_env/lib/python3.8/site-packages/mxnet/base.py", line
246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
File
"src/ndarray/./../operator/tensor/.././../common/../operator/nn/mkldnn/mkldnn_base-inl.h",
line 246
MXNetError: unknown type for MKLDNN :2
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]