chengyuz opened a new issue #18743: URL: https://github.com/apache/incubator-mxnet/issues/18743
## Description i followed this link(https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html) to enable amp in my project, but with error: INFO:root:---------------------------------------------------------------------------------------------------- INFO:root:Using AMP INFO:root:Features in transition 1: 96 -> 96 INFO:root:Features in transition 2: 192 -> 192 INFO:root:Features in transition 3: 448 -> 448 [11:43:40] /media/apache-mxnet-src-1.6.0-incubating/src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./dataset/imagenet200/rec/train.rec, use 30 threads for decoding.. [11:43:42] /media/apache-mxnet-src-1.6.0-incubating/src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./dataset/imagenet200/rec/val.rec, use 30 threads for decoding.. [11:44:05] /media/apache-mxnet-src-1.6.0-incubating/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) [11:44:10] /media/apache-mxnet-src-1.6.0-incubating/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:744: only 0 out of 2 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:753: .. [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:753: .. Traceback (most recent call last): File "scripts/train_imagenet.py", line 807, in <module> main() File "scripts/train_imagenet.py", line 803, in main train(context) File "scripts/train_imagenet.py", line 736, in train trainer.step(batch_size) File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/gluon/trainer.py", line 334, in step self._allreduce_grads() File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/gluon/trainer.py", line 364, in _allreduce_grads self._kvstore.push(i, param.list_grad(), priority=-i) File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/kvstore.py", line 234, in push self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority))) File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/base.py", line 255, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/storage/./pooled_storage_manager.h:164: cudaMalloc failed: an illegal memory access was encountered Stack trace: [bt] (0) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7f500e8f9493] [bt] (1) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle*)+0x245) [0x7f50113b6775] [bt] (2) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x59) [0x7f50113b8c79] [bt] (3) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::NDArray::NDArray(mxnet::TShape const&, mxnet::Context, bool, int)+0x52b) [0x7f500e91272b] [bt] (4) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::CommDevice::Reduce(int, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x277) [0x7f500ebb5eb7] [bt] (5) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::KVStoreLocal::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x11d) [0x7f500ebb9f5d] [bt] (6) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(MXKVStorePush+0x105) [0x7f500e903845] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f504603fdae] [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f504603f71f] ## Environment mxnet1.6.0 build from source, gtx2080, python3.6.9 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org