DickJC123 opened a new issue #9812: Unrepeatable 
test_gluon_model_zoo_gpu.py:test_training CI failures seen.  
URL: https://github.com/apache/incubator-mxnet/issues/9812
 
 
   
   During the development of the ci_test_randomness3 #9791 PR, failures were 
seen in test_gluon_model_zoo_gpu.py:test_training.  The first failure was seen 
on the Python2: MKLDNN-GPU CI runner, but before the 'with_seed()' decorator 
had been added, so no rng seed information was recorded.  After the 
@with_seed() decorator was added, a 2nd failure was seen (produced by seed 
1521019752) on the same runner.  After that seed was hard-coded for that test, 
the test passed on all nodes.  This suggests a problem not data-related and 
perhaps also tied to the Python2 MKLDNN GPU implementation.
   
   First test failure output:
   ```
   test_gluon_model_zoo_gpu.test_training ... [04:13:14] 
src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, 
use 1 threads for decoding..
   testing resnet18_v1
   [04:13:15] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes 
with malloc directly
   [04:13:16] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running 
performance tests to find the best convolution algorithm, this can take a 
while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   testing densenet121
   [04:13:23] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running 
performance tests to find the best convolution algorithm, this can take a 
while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   FAIL
   
   FAIL: test_gluon_model_zoo_gpu.test_training
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in 
runTest
       self.test(*self.arg)
     File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 150, 
in test_training
       assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, 
atol=1e-2)
     File "/workspace/python/mxnet/test_utils.py", line 493, in 
assert_almost_equal
       raise AssertionError(msg)
   AssertionError: 
   Items are not equal:
   Error 60.428631 exceeds tolerance rtol=0.010000, atol=0.010000.  Location of 
maximum error:(6, 649), a=0.584048, b=-0.051143
    a: array([[-0.40942043, -0.5455089 , -0.26064384, ...,  0.33553356,
            0.5314904 , -0.15903676],
          [-0.6971618 , -0.3223077 , -0.7059576 , ...,  0.7106416 ,...
    b: array([[-0.40580893, -0.63151675, -0.37356558, ...,  0.36654586,
            0.43078798, -0.19291902],
          [-0.51749593, -0.26392186, -0.66467005, ...,  0.794114  ,...
   ```
   Second test failure output:
   ```
   test_gluon_model_zoo_gpu.test_training ... [09:40:45] 
src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/val-5k-256.rec, 
use 1 threads for decoding..
   testing resnet18_v1
   [09:40:46] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 8028160 bytes 
with malloc directly
   [09:40:47] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running 
performance tests to find the best convolution algorithm, this can take a 
while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   testing densenet121
   [09:40:54] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running 
performance tests to find the best convolution algorithm, this can take a 
while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   [INFO] Setting test np/mx/python random seeds, use 
MXNET_TEST_SEED=1521019752 to reproduce.
   FAIL
   
   FAIL: test_gluon_model_zoo_gpu.test_training
   ----------------------------------------------------------------------
   Traceback (most recent call last):
   
     File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in 
runTest
       self.test(*self.arg)
     File "/workspace/tests/python/gpu/../unittest/common.py", line 155, in 
test_new
       orig_test(*args, **kwargs)
     File "/workspace/tests/python/gpu/test_gluon_model_zoo_gpu.py", line 156, 
in test_training
       assert_almost_equal(cpu_out.asnumpy(), gpu_out.asnumpy(), rtol=1e-2, 
atol=1e-2)
     File "/workspace/python/mxnet/test_utils.py", line 493, in 
assert_almost_equal
       raise AssertionError(msg)
   AssertionError: 
   Items are not equal:
   Error 47.864170 exceeds tolerance rtol=0.010000, atol=0.010000.  Location of 
maximum error:(6, 311), a=-0.094315, b=0.737164
    a: array([[ 0.19677146,  0.15249339, -0.14161389, ..., -0.6827745 ,
            0.12698895, -0.08247809],
          [-0.01026695,  0.31750488, -0.14363009, ..., -0.7834535 ,...
    b: array([[ 0.05424175,  0.04719666, -0.09091276, ..., -0.7888349 ,
            0.11255977, -0.13169175],
          [ 0.0638914 ,  0.34906954, -0.02986413, ..., -0.7855257 ,...
   ```
   
   Git hash with the test hardcoded to the bad seed: 2ea19a2
   Git hash of PR with the test open to random seeding (that will be printed if 
it fails): daceaca
   
   Note that a related test, test_gluon_model_zoo_gpu.py:test_inference is 
marked as 'skip':
   @unittest.skip("test fails intermittently. temporarily disabled.")
   
   Should this test be disabled as well?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to