[issue42109] Use hypothesis for testing the standard library, falling back to stubs

Paul Ganssle Tue, 25 May 2021 09:40:51 -0700

Paul Ganssle <p.gans...@gmail.com> added the comment:

> I use hypothesis during development, but don't have a need for in the the 
> standard library.  By the time code lands there, we normally have a specific 
> idea of what edge cases needs to be in the tests.


The suggestion I've made here is that we use @example decorators to take the 
hypothesis tests you would have written already and turn them in to what are 
essentially parameterized tests. For anyone who doesn't want to explicitly run 
the hypothesis test suite, the tests you are apparently already writing would 
simply turn into normal tests for just the edge cases.

One added benefit of keeping the tests around in the form of property tests is 
that you can run these same tests through hypothesis to find regressions in 
bugfixes that are implemented after landing (e.g. "Oh we can add a fast path 
here", which introduces a new edge case). The segfault bug from bpo-34454, for 
example, would have been found if I had been able to carry over the 
hypothesis-based tests I was using during the initial implementation of 
fromisoformat into later stages of the development. (Either because I didn't 
run it long enough to hit that particular edge case or because that edge case 
didn't come up until after I had moved the development locus into the CPython 
repo, I'm not sure).

Another benefit of keeping them around is that they become fuzz targets, 
meaning people like oss-fuzz or anyone who wants to throw some fuzzing 
resources at CPython have an existing body of tests that are expected to pass 
on *any* input, to find especially obscure bugs.

> For the most part, hypothesis has not turned up anything useful for the 
> standard library.  Most of the reports that we've gotten reflected a 
> misunderstanding by the person running hypothesis rather than an actual bug. 
> [...]

I don't really think it's a good argument to say that it hasn't turned up 
useful bugs. Most of the bugs in a bit of code will be found during development 
or during the early stages of adoption, and we have very wide adoption. I've 
found a number of bugs in zoneinfo using hypothesis tests, and I'd love to 
continue using them in CPython rather than throwing them away or maintaining 
them in a separate repo.

I also think it is very useful for us to write tests about the properties of 
our systems for re-use in PyPy (which does use hypothesis, by the way) and 
other implementations of Python. This kind of "define the contract and maintain 
tests to enforce that" is very helpful for alternate implementations.

> For numeric code, hypothesis is difficult to use and requires many 
> restrictions on the bounds of variables and error tolerances.  [...]

I do not think that we need to make hypothesis tests mandatory. They can be 
used when someone finds them useful.

> The main area where hypothesis seems easy to use and gives some comfort is in 
> simple roundtrips:  assert zlib.decompress(zlib.compress(s)) == s.  However, 
> that is only a small fraction of our test cases.

Even if this were the only time that hypothesis were useful (I don't think it 
is), some of these round-trips can be among the trickiest and important code to 
test, even if it's a small fraction of the tests. We have a bunch of functions 
that are basically "Load this file format" and "Dump this file format", usually 
implemented in C, which are a magnet for CVEs and often the target for fuzz 
testing for that reason. Having a small library of maintained tests for round 
tripping file formats seems like it would be very useful for people who want to 
donate compute time to fuzz test CPython (or other implementations!)

> Speed is another issue.  During development, it doesn't matter much if 
> Hypothesis takes a lot of time exercising one function.  But in the standard 
> library tests already run slow enough to impact development.  If hypothesis 
> were to run everytime we run a test suite, it would make the situation worse.

As mentioned in the initial ticket, the current plan I'm suggesting is to have 
fallback stubs which turn your property tests into parameterized tests when 
hypothesis is not installed. If we're good about adding `@example` decorators 
(and certainly doing so is easier than writing new ad hoc tests for every edge 
case we can think of when we already have property tests written!), then I 
don't see any particular reason to run the full test suite against a full 
hypothesis run on every CI run.

My suggestion is:

1. By default, run hypothesis in "stubs" mode, where the property tests are 
simply parameterized tests.
2. Have one or two CI jobs that runs *only* the hypothesis tests, generating 
new examples — since this is just for edge case detection, it doesn't 
necessarily need to run on every combination of architecture and platform and 
configuration in our testing matrix, just the ones where it could plausibly 
make a difference.
3. Ask users who are adding new hypothesis tests or who find new bugs to add 
@example decorators when they fix a failing test case.

With the stubs, these tests won't be any slower than other tests, and I think a 
single "hypothesis-only" CI job done in parallel with the other jobs won't 
break the compute bank. Notably, we can almost certainly disable coverage 
detection on the edge-case-only job as well, which I believe adds quite a bit 
of overhead.

> There is also a learning curve issue.  We're adding more and more things that 
> newcomer's have to learn before they can contribute (how to run blurb, how to 
> use the argument clinic, etc).  Using Hypothesis is a learned skill and takes 
> time.

Again I don't think hypothesis tests are *mandatory* or that we plan to replace 
our existing testing framework with hypothesis tests. The median newcomer won't 
need to know anything about hypothesis testing any more than they need to know 
the details of the grammar or of the ceval loop to make a meaningful 
contribution. I don't even expect them to make up a large fraction of our 
tests. I'm a pretty big fan of hypothesis and used it extensively for zoneinfo, 
and yet only 13% of the tests (by line count) are property tests:

$ wc -l tests/test_zoneinfo*
  325 tests/test_zoneinfo_property.py
 2138 tests/test_zoneinfo.py
 2463 total

I will not deny that there *are* costs to bringing this into the standard 
library, but I that they can largely be mitigated. The main concern is really 
ensuring that we can bring it on board without making our CI system more 
fragile and without shifting much (if any) concern about fixing bitrot 
associated with the hypothesis tests onto the buildbot maintainers and release 
managers.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue42109>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue42109] Use hypothesis for testing the standard library, falling back to stubs

Reply via email to