Re: Intermittent oranges and when to disable the related test case - a simplified policy

Gregory Szorc Fri, 08 Sep 2017 16:10:37 -0700

On Wed, Sep 6, 2017 at 2:10 PM, <jma...@mozilla.com> wrote:

> Over the last 9 months a few of us have really watched intermittent test
> failures almost daily and done a lot to pester people as well as fix many.
> While there are over 420 bugs that have been fixed since the beginning of
> the year, there are half that many (211+) which have been disabled in some
> form (including turning off the jobs).
>
> We don't like to disable and have been pretty relaxed in recommending
> disabling a test.  Overall we have tried to adhere to a policy of:
> * >=30 failures/week- ask for owner to look at failure and fix it, if this
> persists for a few weeks with no real traction we would go ahead [and
> recommend] disabling it.
> * >= 75 failures/week- ask for people to fix this in a shorter time frame
> and recommend disabling the test in a week or so
> * >= 150 failures/week- often just disable the test
>
> This is confusing and hard to manage.  Since then we have started
> adjusting triage queries and some teams are doing their own triage and we
> are ignoring those bugs (while they are getting prioritized properly).
>
> What we are looking to start doing this month is adopting a simpler policy:
> * any bug that has >=200 instances in the last 30 days will be disabled
> ** this will be a manual process, so it will happen a couple times/week
>
> We expect the outcome of this to be a similar amount of disabling, just an
> easier method for doing so.  It is very possible we might recommend
> disabling a test before it hits the threshold- keep in mind a disabled test
> is easy to re-enable (so feel free to disable for that one platform until
> you have time to look at fixing it)
>
> To be clear we (and some component owners) will continue triaging bugs and
> trying to get fixes in place as often as possible and prefer a fix, not a
> disabled test!
>
> Please raise any concerns, otherwise we will move forward with this in the
> coming weeks.
>


A few replies on this thread revolve around how to determine if a test is
disabled.

Our canonical source of which tests run where is the in-tree test
manifests. e.g. a browser.ini file. These are consumed by moz.build files
and the build system produces a master list of all tests. The scheduling
logic (Taskgraph) turns these into tasks in CI. There is also the
out-of-tree SETA database that can influence test scheduling.

Our mechanism for disabling tests tends to be something like adding
`skip-if os == "win" # Bug 1393121` to a test manifest. This is useful as a
means to accomplish filtering at the test scheduling/execution layer. And
humans can read the metadata ("# Bug XXX") and other comments and commit
history to get a feel for history. But the metadata isn't rich enough to
build useful queries or dashboards. We can find all tests being skipped.
But the manifests themselves are lacking rich metadata around things like
why tests were disabled. This limits our ability to answer various
questions.

I know we've topic in this topic in the past but I can't recall outcomes.
Is it worthwhile to define and use a richer test manifest "schema" that
will facilitate querying and building dashboards so we have better
visibility into disabled tests?
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Intermittent oranges and when to disable the related test case - a simplified policy

Reply via email to