Executive summary:
If your PR encounters a new test failure that you believe is not caused
by the changes in the PR, please follow these steps:
* perform a try build
* if the failure does not occur in the try build
- file a new issue with I-intermittent and the test failure output
- retry the original merge using `@bors-servo: try- retry`
* if the failure does occur in the try build
- if you _still_ think it's unrelated, do some more try builds and
make a convincing argument
- otherwise, fix your PR to not cause the test failure
I explain below why these steps are necessary.
---
Hi everyone! You may have noticed that there is much less manual
retrying of PRs occurring these days. This is not because we fixed the
problems causing them - instead we added a step to the CI that checks
whether the tests that failed can be found in the list of known
intermittent failures, and allow the merge to proceed if there are no
surprises present.
While this has generally made the process of merging PRs much less
frustrating for project members and less confusing for new contributors,
there is one tricky case that could use optimization. In the old world,
if an attempted merge exposed a new test failure that was unlikely to be
caused by the changes in the PR we would file an issue capturing the
failure, mark it I-intermittent, and retry the PR. If the failure turned
out to be consistent, suggesting that the PR's changes were at fault,
the PR would remain unmerged. I have always strongly encouraged eagerly
filing issues for new test failures, rather than the "retry and see if
it's intermittent method", since that would cause the PR to merge if the
failure did not reappear and the intermittent failure would likely go
unfiled.
In the new world, if a new failure appears we face a decision. Do we:
* retry the PR to see if the failure was intermittent, and file an issue
based on the result
* file an issue, mark it I-intermittent, and retry the PR
* perform a try build to see if the failure reproduces consistently
The danger of using our old system (file an issue and mark it
intermittent) is that we no longer have insight into whether the failure
is actually a new perma-failure. Our intermittent tracking tools are not
yet smart enough to look at failure rates over time, so this is an easy
avenue to introduce real regressions by ignoring the warning signs. My
concern with the retry-first choice is unchanged - I believe it will end
up being common to forget to file new intermittent failures, which will
lead to developers repeatedly being confused. This leaves us with
performing a try build, which unfortunately makes the merging process
significantly longer - first we have at least one try build, then we
still need to retry the original merge if we decide the failure is
indeed unrelated to the changes.
I'm totally open to discussing ways to make this situation better - the
process I'm proposing optimizes for:
* avoiding introducing real test failures into master
* making one person deal with new intermittent failures, rather than a
potentially unbounded number of people
Simon suggested that one way to improve would be to reduce the time it
takes to run a try build, by limiting it to a particular platform or set
of tests. If you've got other ideas, or wish to argue that we should be
optimizing for a different set of constraints, I'd love to hear about it!
Cheers,
Josh
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo