Nathan Hartman wrote:
On Wed, Jan 8, 2020 at 9:24 AM Daniel Shahaf <d...@daniel.shahaf.name> wrote:
> Julian Foad wrote on Wed, Jan 08, 2020 at 10:14:42 +0000:
> > The suggestion was that we should prefer the regression test suite to be
> > deterministic, [...]
> But unlike nearly all of the test suite, we don't actually know what the
> sequence of inputs _is_,
When developing/debugging, I made the test print each set of inputs,
sometimes for every test case and sometimes just when a case failed.
Some of that debug print code is still there, commented out.
> and we have no reason to believe it provides good
> coverage of edge cases and rare combinations of inputs. For all we know, the
> 90000 iterations could all be testing the exact same, most common codepath
over
> and over again.
I inspected the debug prints whizzing by, and noted that it included
each of the different kinds of input cases I expected (e.g. empty
ranges, reversed ranges, duplicate ranges, overlapping ranges, etc.) and
that the results included examples of the failure modes I expected.
Based on that observation, I altered the parameters such as the sizes of
the generated revision ranges and the repetition counts, until it
provided good coverage without excessive repetition. I aimed for "a few"
(more than one, less than hundreds) of occurrences of the same failure mode.
That doesn't guarantee coverage of every possible edge case combination,
of course, but it gives me confidence that it's an effective test. I
admit it's an issue that that is not apparent to an observer. However,
anyone needing to debug again can enable the debug prints and see it.
My earlier thinking (though I didn't spell it out) was to move the
PRNG code to a separate program, use it to generate a self-documenting
C array of unique test cases [...] If we discover new failing inputs in the
future, they can be added to the test suite. Then there is no guessing
about what inputs are being tested.
Capturing and storing generated cases into a fixed explicit could make
some tasks slightly easier, perhaps tasks such as reaching the required
state in a debugger, or extracting a particular case into a separate test.
What cases would you extract? One example of each currently known
failure mode (and success mode) is probably useful; that's a start. I
already made a separate test case for one failure mode, and it would be
reasonable to do so for the others.
But what about capturing a sufficient set of input combinations that
should be tested because they *might* in future result in different
behaviours? That's hard. We can try to guess some cases that look like
they would be worth testing, which is what we do when writing
traditional fixed tests. But the idea of "random" input generation is
to overcome the tendency to fail to guess in advance all the
combinations that might reveal a problem. Pseudo-random inputs are good
at that.
A good, dedicated random-testing infrastructure would make it easy to
generate a set of fixed tests based on observing random testing combined
with, say, code coverage information. If we had this test
infrastructure, sure, but if we don't, I'm not seeing why doing that
extraction and analysis manually would be worth the effort.
The point is, we get a wide coverage of input combinations. The
coverage is neither theoretically nor empirically proven to be complete,
but I demonstrated it is wide enough to be useful.
> Using a PRNG makes the test code harder to read and to maintain, makes it
> more difficult to diagnose FAILs, and increases the risk of a bug in the test
> code. What benefits are we getting that offset these risks and costs?
>
> The way in which the PRNG algorithm is used is outside the PRNG's API promises.
> PRNG's make *no* promises on their outputs when used with a fixed seed. In
> light of this, what reason do we have to believe that the test achieves good
coverage?
Used for test case generation, the PRNG only needs to generate a (fixed)
sequence of numbers that with no significant correlation to the pattern
of usage in generating test cases. We could reasonably call it a
"useful sequence generator" instead of "PRNG", to remove the stigma
attached to falling short of strong randomness guarantees.
It sounds like you feel there should be a much higher quality of
randomness. I addressed coverage above.
> It's not necessary to use a PRNG to test a large/diverse set of inputs. You
> can achieve with plain old combinatorics (e.g., define 300 rangelist variables
> and test various combinations of them). I'd actually expect that to have
> better coverage than a PRNG with a fixed seed.
I'd be mildly interested to see the result of that analysis. In
practice, however, what we have now is good enough.
In other words, replace the PRNG and leave all else as-is.
Generating all permutations when picking 2 items is easy:
I don't mind if someone wants to write the test code to generate cases a
different way. I'm not seeing the need, myself.
If there is something simple I can do, that would satisfy your concerns,
I would be happy to do so. That could be something like setting the
seed to a time-based value instead of a fixed number, or replacing the
PRNG implementation with another simple drop-in replacement, or some
such. Let me know if there is something of that magnitude that would
resolve the concerns. So far, I am unable to understand whether or what
that would be.
- Julian