perdasilva opened a new issue #14502: [Test Failure] GPU Test failures across different CUDA versions URL: https://github.com/apache/incubator-mxnet/issues/14502 ## Description Testing mxnet library compiled for the python distribution against different versions of CUDA. I'm getting a strange failure on all CUDA versions. The tests are being run on a g3.8xlarge instance, within a docker container based on the nvidia/cuda:XXX-cudnn7-devel-ubuntu16.04 (where XXX is the particular version of CUDA). ``` ====================================================================== ERROR: test_gluon_gpu.test_lstmp ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new orig_test(*args, **kwargs) File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new orig_test(*args, **kwargs) File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 124, in test_lstmp check_rnn_layer_forward(gluon.rnn.LSTM(10, 2, projection_size=5), mx.nd.ones((8, 3, 20))) File "/work/mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 441, in check_rnn_layer_forward out = layer(inputs) File "/work/mxnet/python/mxnet/gluon/block.py", line 540, in __call__ out = self.forward(*args) File "/work/mxnet/python/mxnet/gluon/block.py", line 917, in forward return self.hybrid_forward(ndarray, x, *args, **params) File "/work/mxnet/python/mxnet/gluon/rnn/rnn_layer.py", line 239, in hybrid_forward out = self._forward_kernel(F, inputs, states, **kwargs) File "/work/mxnet/python/mxnet/gluon/rnn/rnn_layer.py", line 270, in _forward_kernel lstm_state_clip_nan=self._lstm_state_clip_nan) File "<string>", line 145, in RNN File "/work/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke ctypes.byref(out_stypes))) File "/work/mxnet/python/mxnet/base.py", line 252, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) MXNetError: [14:22:01] src/operator/./rnn-inl.h:385: hidden layer projection is only supported for GPU with CuDNN later than 7.1.1 Stack trace returned 10 entries: [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x42c70a) [0x7fcb9e25170a] [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x42cd31) [0x7fcb9e251d31] [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x3495ea8) [0x7fcba12baea8] [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x349612e) [0x7fcba12bb12e] [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x30ea87f) [0x7fcba0f0f87f] [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x75f9c5) [0x7fcb9e5849c5] [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::InvokeOp(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode, mxnet::OpStatePtr)+0xb35) [0x7fcba0cdcc45] [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x38c) [0x7fcba0cdd1cc] [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2db2d09) [0x7fcba0bd7d09] [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x6f) [0x7fcba0bd82ff] -------------------- >> begin captured stdout << --------------------- checking gradient for lstm0_l0_h2h_bias checking gradient for lstm0_l0_h2h_weight checking gradient for lstm0_l0_i2h_weight checking gradient for lstm0_l0_i2h_bias checking gradient for lstm0_l0_h2r_weight --------------------- >> end captured stdout << ---------------------- -------------------- >> begin captured logging << -------------------- common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1414687138 to reproduce. --------------------- >> end captured logging << --------------------- ``` I have not yet tried to reproduce it separately outside of Docker on a GPU machine using the current pip package for 1.4.0. I find it strange that the PRs aren't breaking. Since they seem to be based off the same docker image I'm using, running on the same instance type.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services