Zha0q1 opened a new issue #19929:
URL: https://github.com/apache/incubator-mxnet/issues/19929


   
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job-1.x/detail/mxnet-cd-release-job-1.x/1530/pipeline/435
   
   ```
   [2021-02-18T21:59:21.985Z]   what():  [21:59:18] 
/work/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:126: Check failed: err 
== CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed
   
   [2021-02-18T21:59:21.985Z] Stack trace:
   
   [2021-02-18T21:59:21.985Z]   [bt] (0) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27ed308) [0x7fb4b07e3308]
   
   [2021-02-18T21:59:21.985Z]   [bt] (1) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77e1879) [0x7fb4b57d7879]
   
   [2021-02-18T21:59:21.985Z]   [bt] (2) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77e1e36) [0x7fb4b57d7e36]
   
   [2021-02-18T21:59:21.985Z]   [bt] (3) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(void 
mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)1>(mxnet::Context,
 bool, 
mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1>*,
 std::shared_ptr<dmlc::ManualEvent> const&)+0x1c7) [0x7fb4b57f7097]
   
   [2021-02-18T21:59:21.985Z]   [bt] (4) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void 
(std::shared_ptr<dmlc::ManualEvent>), 
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lambda()#3}::operator()() 
const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data
 const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7fb4b57f7346]
   
   [2021-02-18T21:59:21.985Z]   [bt] (5) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77f92b4) [0x7fb4b57ef2b4]
   
   [2021-02-18T21:59:21.985Z]   [bt] (6) 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fb555711c80]
   
   [2021-02-18T21:59:21.985Z]   [bt] (7) 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fb55d8886ba]
   
   [2021-02-18T21:59:21.985Z]   [bt] (8) 
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb55ca6b4dd]
   ```
   
   This has been happening for a while now. 
https://github.com/apache/incubator-mxnet/pull/19506 attempted to fix it but 
the error stayed/came back. I think this is most likely a cuda/cudnn/cublas 
version mismatch issue. I have created a branch with 
   ```
   ENV CUDA_VERSION=10.2.89
   ENV CUDNN_VERSION=8.0.4.30
   COPY install/ubuntu_cudnn.sh /work/
   RUN /work/ubuntu_cudnn.sh
   ```
   this section in file 
(https://github.com/apache/incubator-mxnet/blob/v1.x/ci/docker/Dockerfile.build.ubuntu_gpu_cu102)
 removed altogether and kicked off a run on that branch to observe if this 
solves the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org

Reply via email to