Zha0q1 opened a new issue #19929: URL: https://github.com/apache/incubator-mxnet/issues/19929
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job-1.x/detail/mxnet-cd-release-job-1.x/1530/pipeline/435 ``` [2021-02-18T21:59:21.985Z] what(): [21:59:18] /work/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:126: Check failed: err == CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed [2021-02-18T21:59:21.985Z] Stack trace: [2021-02-18T21:59:21.985Z] [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27ed308) [0x7fb4b07e3308] [2021-02-18T21:59:21.985Z] [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77e1879) [0x7fb4b57d7879] [2021-02-18T21:59:21.985Z] [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77e1e36) [0x7fb4b57d7e36] [2021-02-18T21:59:21.985Z] [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)1>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x1c7) [0x7fb4b57f7097] [2021-02-18T21:59:21.985Z] [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7fb4b57f7346] [2021-02-18T21:59:21.985Z] [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77f92b4) [0x7fb4b57ef2b4] [2021-02-18T21:59:21.985Z] [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fb555711c80] [2021-02-18T21:59:21.985Z] [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fb55d8886ba] [2021-02-18T21:59:21.985Z] [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb55ca6b4dd] ``` This has been happening for a while now. https://github.com/apache/incubator-mxnet/pull/19506 attempted to fix it but the error stayed/came back. I think this is most likely a cuda/cudnn/cublas version mismatch issue. I have created a branch with ``` ENV CUDA_VERSION=10.2.89 ENV CUDNN_VERSION=8.0.4.30 COPY install/ubuntu_cudnn.sh /work/ RUN /work/ubuntu_cudnn.sh ``` this section in file (https://github.com/apache/incubator-mxnet/blob/v1.x/ci/docker/Dockerfile.build.ubuntu_gpu_cu102) removed altogether and kicked off a run on that branch to observe if this solves the issue. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org