[dev-servo] Intermittent failures: state of the world

Josh Matthews Thu, 22 Dec 2016 16:55:32 -0800

Executive summary:

If your PR encounters a new test failure that you believe is not causedby the changes in the PR, please follow these steps:


* perform a try build
* if the failure does not occur in the try build
  - file a new issue with I-intermittent and the test failure output
  - retry the original merge using `@bors-servo: try- retry`
* if the failure does occur in the try build

- if you _still_ think it's unrelated, do some more try builds andmake a convincing argument

  - otherwise, fix your PR to not cause the test failure

I explain below why these steps are necessary.

---

Hi everyone! You may have noticed that there is much less manualretrying of PRs occurring these days. This is not because we fixed theproblems causing them - instead we added a step to the CI that checkswhether the tests that failed can be found in the list of knownintermittent failures, and allow the merge to proceed if there are nosurprises present.

While this has generally made the process of merging PRs much lessfrustrating for project members and less confusing for new contributors,there is one tricky case that could use optimization. In the old world,if an attempted merge exposed a new test failure that was unlikely to becaused by the changes in the PR we would file an issue capturing thefailure, mark it I-intermittent, and retry the PR. If the failure turnedout to be consistent, suggesting that the PR's changes were at fault,the PR would remain unmerged. I have always strongly encouraged eagerlyfiling issues for new test failures, rather than the "retry and see ifit's intermittent method", since that would cause the PR to merge if thefailure did not reappear and the intermittent failure would likely gounfiled.


In the new world, if a new failure appears we face a decision. Do we:

* retry the PR to see if the failure was intermittent, and file an issuebased on the result

* file an issue, mark it I-intermittent, and retry the PR
* perform a try build to see if the failure reproduces consistently

The danger of using our old system (file an issue and mark itintermittent) is that we no longer have insight into whether the failureis actually a new perma-failure. Our intermittent tracking tools are notyet smart enough to look at failure rates over time, so this is an easyavenue to introduce real regressions by ignoring the warning signs. Myconcern with the retry-first choice is unchanged - I believe it will endup being common to forget to file new intermittent failures, which willlead to developers repeatedly being confused. This leaves us withperforming a try build, which unfortunately makes the merging processsignificantly longer - first we have at least one try build, then westill need to retry the original merge if we decide the failure isindeed unrelated to the changes.

I'm totally open to discussing ways to make this situation better - theprocess I'm proposing optimizes for:

* avoiding introducing real test failures into master

* making one person deal with new intermittent failures, rather than apotentially unbounded number of people

Simon suggested that one way to improve would be to reduce the time ittakes to run a try build, by limiting it to a particular platform or setof tests. If you've got other ideas, or wish to argue that we should beoptimizing for a different set of constraints, I'd love to hear about it!


Cheers,
Josh
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

[dev-servo] Intermittent failures: state of the world

Reply via email to