I've created an issue to track this problem:
https://github.com/apache/incubator-mxnet/issues/14652

On Tue, Apr 9, 2019 at 9:07 AM Per da Silva <perdasi...@gmail.com> wrote:

> Dear MXNet community,
>
> I've been trying to update the CI GPU images to CUDA 10, but the tests are
> failing. I'm not sure why and would really appreciate some help =D
>
> I've managed, at least, to narrow down the problem to the cuDNN version.
> The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> ).
>
> I noticed that the binary in the python packages we release uses cuDNN
> 7.3.1.20 (
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34),
> so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
> and sure enough the tests passed (
> https://github.com/apache/incubator-mxnet/pull/14513).
>
> After talking with another contributer, we decided that I would try to
> create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
> (to be fixed later). But, it seems the problem is a bit more heinous. I
> disable one test, and another one fails...So, it might make sense to reach
> out now and see if we can find the root cause and fix it.
>
> Some things I've sanity checked:
>
> We run the tests on g3.8xlarge instances. These instances contain Tesla
> M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10 supports
> compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).
>
> According to the cuDNN support matrix (
> https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html),
> cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver r410.48
> (I assume greater or equal).
>
> The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
>
> So, as best I can tell, our environment ought to support cuDNN 7.5, which
> leads me to conclude that maybe there's something wrong in the code.
>
> The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed:
> e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
>
> According to the cuDNN user guide (
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> ):
>
> CUDNN_STATUS_ARCH_MISMATCH
>
> The function requires a feature absent from the current GPU device. Note
> that cuDNN only supports devices with compute capabilities greater than or
> equal to 3.0.
>
> To correct: compile and run the application on a device with appropriate
> compute capability.
>
> But, as we've seen, our environment seems to support this version of cuDNN
> and other versions go through CI w/o any problem...
>
> You can see some logs here:
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
>
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
>
> I have about 13 runs of this pipeline. The errors for different runs can
> be seen by changing the number before /pipeline (e.g.
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> <http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/>
>  for
> the 2nd run, etc.)
>
> Thanks in advance for the help!
>
> You can reach me here or on Slack if you have any questions =D
>
> Cheers,
>
> Per
>
> P.S. I'm attaching some instructions on how to reproduce the issue at home
> (or at least on a g3.8xlarge instance running ubuntu 16.04).
>

Reply via email to