DickJC123 opened a new pull request #18762:
URL: https://github.com/apache/incubator-mxnet/pull/18762


   ## Description ##
   I recently ran into a CI failure in 
test_numpy_interoperability.py::test_np_array_function_protocol: 
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-cpu/detail/PR-18694/14/pipeline.
  I was not able to use the  reported seed for the failure to reproduce it.  I 
have investigated why and am supplying this PR as a fix- now reported seeds can 
be used to repro failures.  I was then able to use the new facility to 
troubleshoot which tests needed loosened tolerances for increased test 
robustness, and have supplied that as well- the failure rate I estimate now is 
around 1:10000.
   
   As a review, the robustness of a test should be able to be explored with:
   ```
   MXNET_TEST_COUNT=10000 pytest --verbose -s --log-cli-level=DEBUG <my_test>
   <see a failure, note failure seed NNN>
   MXNET_TEST_SEED=NNN pytest --verbose -s <my_test>
   ```
   The issue with test_numpy_interoperability.py was that it was creating a 
test workload at file import time using unseeded random values.  The fix makes 
the workload be regenerated for each test at test runtime in a manner that will 
depend on the seed of the test.
   
   The two tests that required loosened tolerances were linalg.tensorinv and 
linalg.solve.  At the setting as I left them, I saw 1 failure in 10K trials.  
Rather than loosening the tolerances further, I will leave it to the code 
owners to diagnose the situation and propose a fix if they see fit to.  The 
tolerances could be loosened further, but other approaches could involve 
changing the scale or other properties of the input data.  The remaining 
failure can (after the PR is merged) be repro'd with:
   ```
   MXNET_TEST_SEED=801992040 pytest --verbose -s 
tests/python/unittest/test_numpy_interoperability.py::test_np_array_function_protocol
   ```
   A curious property of the remaining failure is that so many of the values 
are consistently smaller than the golden copy by 1.9%:
   ```
   Dispatch test: linalg.tensorinv
   
   *** Maximum errors for vector of size 3600:  rtol=0.01, atol=0.005
   
     1: Error 1.934343  Location of error: (1, 1, 0, 10, 4), a=128.42663574, 
b=130.96971130
     2: Error 1.933410  Location of error: (2, 0, 2, 6, 0), a=80.68855286, 
b=82.28920746
     3: Error 1.933032  Location of error: (2, 0, 2, 8, 3), a=61.98265076, 
b=63.21426773
     4: Error 1.931998  Location of error: (1, 2, 2, 4, 4), a=-151.11050415, 
b=-154.09732056
     5: Error 1.931560  Location of error: (1, 1, 0, 4, 4), a=-97.56709290, 
b=-99.49862671
     6: Error 1.931458  Location of error: (0, 0, 2, 10, 4), a=343.97329712, 
b=350.75769043
     7: Error 1.931435  Location of error: (1, 2, 2, 10, 4), a=199.16923523, 
b=203.10166931
     8: Error 1.931303  Location of error: (1, 2, 2, 9, 0), a=116.00872803, 
b=118.30317688
     9: Error 1.931238  Location of error: (1, 2, 0, 4, 2), a=1058.37841797, 
b=1079.23059082
    10: Error 1.931191  Location of error: (1, 1, 1, 10, 4), a=702.60571289, 
b=716.45141602
   [WARNING] Setting test np/mx/python random seeds, use 
MXNET_TEST_SEED=801992040 to reproduce.
   ```
   
   [This PR may have additional fixes to other tests if I can't get a clean CI]
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [X] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to 
the relevant [JIRA issue](https://issues.apache.org/jira/projects/MXNET/issues) 
created (except PRs with tiny changes)
   - [X] Changes are complete (i.e. I finished coding on this PR)
   - [X] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding 
a new operator)
   - Nightly tests are added for complicated/long-running ones (e.g. changing 
distributed kvstore)
   - Build tests will be added for build configuration changes (e.g. adding a 
new build option with NCCL)
   - [X] Code is well-documented: 
   - For user-facing API changes, API doc string has been updated. 
   - For new C++ functions in header files, their functionalities and arguments 
are documented. 
   - For new examples, README.md is added to explain the what the example does, 
the source of the dataset, expected performance on test set and reference to 
the original paper if applicable
   - Check the API doc at 
https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
   - [X] To the best of my knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [ ] Feature1, tests, (and when applicable, API doc)
   - [ ] Feature2, tests, (and when applicable, API doc)
   
   ## Comments ##
   - If this change is a backward incompatible change, why must this change be 
made.
   - Interesting edge cases to note here
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to