nttstar opened a new issue #20346:
URL: https://github.com/apache/incubator-mxnet/issues/20346


   ## Description
   (A clear and concise description of what the bug is.)
   The pip installed and built from source package are both failed while 
training 
[arcface](https://github.com/deepinsight/insightface/tree/master/recognition/ArcFace),
 with 1.8.0 branch(tag).
   
   ### Error Message
   (Paste the complete error message. Please also include stack trace by 
setting environment variable `DMLC_LOG_STACK_TRACE_DEPTH=100` before running 
your script.)
   
   For ``pip install mxnet-cu112``, the training process will hung on after the 
log of 
   
   ```
   [15:05:53] ../src/base.cc:80: cuDNN lib mismatch: linked-against version 
8101 != compiled-against version 8100.  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet 
this warning.
   [15:06:39] ../src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running 
performance tests to find the best convolution algorithm, this can take a 
while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to 
disable)
   ```
   
   For building from source:
   
   Firstly, I made some changes at ``cmake/upstream/select_compute_arch.cmake`` 
to add Ampere arch:
   
   ```
   if(CUDA_VERSION VERSION_GREATER_EQUAL "11.0")
     list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Ampere")
     list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.0")
     list(APPEND CUDA_ALL_GPU_ARCHITECTURES "8.0")
   
     set(_CUDA_MAX_COMMON_ARCHITECTURE "8.0+PTX")
     set(CUDA_LIMIT_GPU_ARCHITECTURE "8.6")
   
     list(REMOVE_ITEM CUDA_COMMON_GPU_ARCHITECTURES "3.5" "5.0")
     list(REMOVE_ITEM CUDA_ALL_GPU_ARCHITECTURES "3.0" "3.2")
   endif()
   
   if(CUDA_VERSION VERSION_GREATER_EQUAL "11.1")
     list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.6")
     list(APPEND CUDA_ALL_GPU_ARCHITECTURES "8.6")
   
     set(_CUDA_MAX_COMMON_ARCHITECTURE "8.6+PTX")
     set(CUDA_LIMIT_GPU_ARCHITECTURE "9.0")
   endif()
   ```
   
   and
   
   ```
   elseif(${arch_name} STREQUAL "Ampere")
       set(arch_bin 8.0)
       set(arch_ptx 8.0)
   ```
   
   The training process will exit with the error code:
   ```
   cuDNN: Check failed: e == CUDNN_STATUS_SUCCESS (14 vs. 0) : 
CUDNN_STATUS_VERSION_MISMATCH
   ```
   (My CUDNN is 8.1.0)
   
   Can anyone give me some advice?
   
   
   
   ## Environment
   
   cuda 11.2(in docker)
   cudnn 8.1.0
   mxnet-cu112==1.8.0
   mxnet source on 1.8.0 tag
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to