chengyuz opened a new issue #18743:
URL: https://github.com/apache/incubator-mxnet/issues/18743


   ## Description
   i followed this 
link(https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html)
 to enable amp in my project, but with error: 
   
INFO:root:----------------------------------------------------------------------------------------------------
   INFO:root:Using AMP
   INFO:root:Features in transition 1: 96 -> 96
   INFO:root:Features in transition 2: 192 -> 192
   INFO:root:Features in transition 3: 448 -> 448
   [11:43:40] 
/media/apache-mxnet-src-1.6.0-incubating/src/io/iter_image_recordio_2.cc:178: 
ImageRecordIOParser2: ./dataset/imagenet200/rec/train.rec, use 30 threads for 
decoding..
   [11:43:42] 
/media/apache-mxnet-src-1.6.0-incubating/src/io/iter_image_recordio_2.cc:178: 
ImageRecordIOParser2: ./dataset/imagenet200/rec/val.rec, use 30 threads for 
decoding..
   [11:44:05] 
/media/apache-mxnet-src-1.6.0-incubating/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97:
 Running performance tests to find the best convolution algorithm, this can 
take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 
to disable)
   [11:44:10] 
/media/apache-mxnet-src-1.6.0-incubating/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97:
 Running performance tests to find the best convolution algorithm, this can 
take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 
to disable)
   [11:44:18] 
/media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:744: only 0 out 
of 2 GPU pairs are enabled direct access. It may affect the performance. You 
can set MXNET_ENABLE_GPU_P2P=0 to turn it off
   [11:44:18] 
/media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:753: ..
   [11:44:18] 
/media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:753: ..
   Traceback (most recent call last):
     File "scripts/train_imagenet.py", line 807, in <module>
       main()
     File "scripts/train_imagenet.py", line 803, in main
       train(context)
     File "scripts/train_imagenet.py", line 736, in train
       trainer.step(batch_size)
     File 
"/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/gluon/trainer.py", line 
334, in step
       self._allreduce_grads()
     File 
"/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/gluon/trainer.py", line 
364, in _allreduce_grads
       self._kvstore.push(i, param.list_grad(), priority=-i)
     File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/kvstore.py", 
line 234, in push
       self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
     File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/base.py", line 
255, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [11:44:18] 
/media/apache-mxnet-src-1.6.0-incubating/src/storage/./pooled_storage_manager.h:164:
 cudaMalloc failed: an illegal memory access was encountered
   Stack trace:
     [bt] (0) 
/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43)
 [0x7f500e8f9493]
     [bt] (1) 
/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle*)+0x245)
 [0x7f50113b6775]
     [bt] (2) 
/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x59)
 [0x7f50113b8c79]
     [bt] (3) 
/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::NDArray::NDArray(mxnet::TShape
 const&, mxnet::Context, bool, int)+0x52b) [0x7f500e91272b]
     [bt] (4) 
/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::CommDevice::Reduce(int,
 std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, 
int)+0x277) [0x7f500ebb5eb7]
     [bt] (5) 
/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::KVStoreLocal::PushImpl(std::vector<int,
 std::allocator<int> > const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&, int)+0x11d) [0x7f500ebb9f5d]
     [bt] (6) 
/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(MXKVStorePush+0x105)
 [0x7f500e903845]
     [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
[0x7f504603fdae]
     [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) 
[0x7f504603f71f]
   
   ## Environment
   
   mxnet1.6.0 build from source, gtx2080, python3.6.9
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to