Hi everyone!

Dan Smith started a discussion about shaking out more flaky DUnit tests.
That's a great effort and I am happy it's happening.

As a corollary to that conversation I wonder what the criteria should be
for a test to not be considered flaky any longer and have the category
removed.

In general the bar should be fairly high. Even if a test only fails ~1 in
500 runs that's still a problem given how many tests we have.

I see two ends of the spectrum:
1. We have a good understanding why the test was flaky and think we fixed
it.
2. We have a hard time reproducing the flaky behavior and have no good
theory as to why the test might have shown flaky behavior.

In the first case I'd suggest to run the test ~100 times to get a little
more confidence that we fixed the flaky behavior and then remove the
category.

The second case is a lot more problematic. How often do we want to run a
test like that before we decide that it might have been fixed since we last
saw it happen? Anything else we could/should do to verify the test deserves
our trust again?

Reply via email to