Executive summary:
If your PR encounters a new test failure that you believe is not caused by the changes in the PR, please follow these steps:

* perform a try build
* if the failure does not occur in the try build
  - file a new issue with I-intermittent and the test failure output
  - retry the original merge using `@bors-servo: try- retry`
* if the failure does occur in the try build
- if you _still_ think it's unrelated, do some more try builds and make a convincing argument
  - otherwise, fix your PR to not cause the test failure

I explain below why these steps are necessary.

---

Hi everyone! You may have noticed that there is much less manual retrying of PRs occurring these days. This is not because we fixed the problems causing them - instead we added a step to the CI that checks whether the tests that failed can be found in the list of known intermittent failures, and allow the merge to proceed if there are no surprises present.

While this has generally made the process of merging PRs much less frustrating for project members and less confusing for new contributors, there is one tricky case that could use optimization. In the old world, if an attempted merge exposed a new test failure that was unlikely to be caused by the changes in the PR we would file an issue capturing the failure, mark it I-intermittent, and retry the PR. If the failure turned out to be consistent, suggesting that the PR's changes were at fault, the PR would remain unmerged. I have always strongly encouraged eagerly filing issues for new test failures, rather than the "retry and see if it's intermittent method", since that would cause the PR to merge if the failure did not reappear and the intermittent failure would likely go unfiled.

In the new world, if a new failure appears we face a decision. Do we:
* retry the PR to see if the failure was intermittent, and file an issue based on the result
* file an issue, mark it I-intermittent, and retry the PR
* perform a try build to see if the failure reproduces consistently

The danger of using our old system (file an issue and mark it intermittent) is that we no longer have insight into whether the failure is actually a new perma-failure. Our intermittent tracking tools are not yet smart enough to look at failure rates over time, so this is an easy avenue to introduce real regressions by ignoring the warning signs. My concern with the retry-first choice is unchanged - I believe it will end up being common to forget to file new intermittent failures, which will lead to developers repeatedly being confused. This leaves us with performing a try build, which unfortunately makes the merging process significantly longer - first we have at least one try build, then we still need to retry the original merge if we decide the failure is indeed unrelated to the changes.

I'm totally open to discussing ways to make this situation better - the process I'm proposing optimizes for:
* avoiding introducing real test failures into master
* making one person deal with new intermittent failures, rather than a potentially unbounded number of people

Simon suggested that one way to improve would be to reduce the time it takes to run a try build, by limiting it to a particular platform or set of tests. If you've got other ideas, or wish to argue that we should be optimizing for a different set of constraints, I'd love to hear about it!

Cheers,
Josh
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Reply via email to