On Mon, Nov 2, 2020 at 9:14 AM Peter Maydell <peter.mayd...@linaro.org> wrote: > > On Mon, 2 Nov 2020 at 16:50, Havard Skinnemoen <hskinnem...@google.com> wrote: > > But none of this is really specific to the RNG test, so I can remove > > it if you prefer for consistency. > > I would prefer us to be consistent. If you want to propose > don't-stop-on-asserts then we should set that consistently > at some higher level of the test suite, not in every > individual test.
OK, I will remove g_test_set_nonfatal_assertions from all the tests I've submitted. As for the randomness failures, I did find a bug in the runs test, but fixing it doesn't make much of an impact on the flakiness. I've seen failures from both monobit and runs, both first_bit and continuous. So it doesn't look like there's a systemic problem anywhere. I dumped the random data on failure and put it through the https://github.com/Honno/coinflip/ implementation of monobit and runs, and after fixing the runs bug, they consistently produce the same P value. Here's one example that failed the qtest with P=0.000180106965: $ ./randomness_test.py e1 ec cc 55 29 5d c9 ac 85 45 ed 4b b6 96 56 ab Monobit test: normalised diff 0.53 p-value 0.596 ┌───────┬───────┐ │ value │ count │ ├───────┼───────┤ │ 1 │ 67 │ │ 0 │ 61 │ └───────┴───────┘ Runs test: no. of runs 85 p-value 0.0 You can find the script at https://gist.github.com/hskinnemoen/41f7513ca228c2bac959c3b14e87025f Apparently, successive bits are toggling too often, producing too many runs of 1s or 0s. This will of course happen from time to time since the input is random. And the monobit test will fail if there are too many or too few 1s compared to zeroes, which is also something that can't really be prevented. While we can always tune this to happen less often, e.g. by dropping the P-value threshold or running the tests multiple times, we can never guarantee that a randomness test won't fail. So I suspect we should keep these tests disabled by default, but keep them available so that we can easily do a randomness test if there's any suspicion that the emulated RNG device produces bad data. Does that make sense? Havard