DickJC123 opened a new issue #9812: Unrepeatable test_gluon_model_zoo_gpu.py:test_training CI failures seen. URL: https://github.com/apache/incubator-mxnet/issues/9812 During the development of the ci_test_randomness3 #9791 PR, failures were seen in test_gluon_model_zoo_gpu.py:test_training. The first failure was seen on the Python2: MKLDNN-GPU CI runner, but before the 'with_seed()' decorator had been added, so no rng seed information was recorded. After the @with_seed() decorator was added, a 2nd failure was seen (produced by seed 1521019752) on the same runner. After that seed was hard-coded for that test, the test passed on all nodes. This suggests a problem not data-related and perhaps also tied to the Python2 MKLDNN GPU implementation. First test failure output: ``` test_gluon_model_zoo_gpu.test_training ... [04:13:14] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, use 1 threads for decoding.. testing resnet18_v1 [04:13:15] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes with malloc directly [04:13:16] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) testing densenet121 [04:13:23] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) FAIL FAIL: test_gluon_model_zoo_gpu.test_training ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 150, in test_training assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, atol=1e-2) File "/workspace/python/mxnet/test_utils.py", line 493, in assert_almost_equal raise AssertionError(msg) AssertionError: Items are not equal: Error 60.428631 exceeds tolerance rtol=0.010000, atol=0.010000. Location of maximum error:(6, 649), a=0.584048, b=-0.051143 a: array([[-0.40942043, -0.5455089 , -0.26064384, ..., 0.33553356, 0.5314904 , -0.15903676], [-0.6971618 , -0.3223077 , -0.7059576 , ..., 0.7106416 ,... b: array([[-0.40580893, -0.63151675, -0.37356558, ..., 0.36654586, 0.43078798, -0.19291902], [-0.51749593, -0.26392186, -0.66467005, ..., 0.794114 ,... ``` Second test failure output: ``` test_gluon_model_zoo_gpu.test_training ... [09:40:45] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, use 1 threads for decoding.. testing resnet18_v1 [09:40:46] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes with malloc directly [09:40:47] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) testing densenet121 [09:40:54] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1521019752 to reproduce. FAIL FAIL: test_gluon_model_zoo_gpu.test_training ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/workspace/tests/python/gpu/../unittest/common.py", line 155, in test_new orig_test(*args, **kwargs) File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 156, in test_training assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, atol=1e-2) File "/workspace/python/mxnet/test_utils.py", line 493, in assert_almost_equal raise AssertionError(msg) AssertionError: Items are not equal: Error 47.864170 exceeds tolerance rtol=0.010000, atol=0.010000. Location of maximum error:(6, 311), a=-0.094315, b=0.737164 a: array([[ 0.19677146, 0.15249339, -0.14161389, ..., -0.6827745 , 0.12698895, -0.08247809], [-0.01026695, 0.31750488, -0.14363009, ..., -0.7834535 ,... b: array([[ 0.05424175, 0.04719666, -0.09091276, ..., -0.7888349 , 0.11255977, -0.13169175], [ 0.0638914 , 0.34906954, -0.02986413, ..., -0.7855257 ,... ``` Git hash with the test hardcoded to the bad seed: 2ea19a2 Git hash of PR with the test open to random seeding (that will be printed if it fails): daceaca Note that a related test, test_gluon_model_zoo_gpu.py:test_inference is marked as 'skip': @unittest.skip("test fails intermittently. temporarily disabled.") Should this test be disabled as well?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services