Re: Policy for disabling tests which run on TBPL

2014-04-15 Thread jmaher
I want to express my thanks to everyone who contributed to this thread.  We 
have a lot of passionate and smart people who care about this topic- thanks 
again for weighing in so far.

Below is a slightly updated policy from the original, and following that is an 
attempt to summarize the thread and turn what makes sense into actionable items.

= Policy for handling intermittent oranges = 

This policy will define an escalation path for when a single test case is 
identified to be leaking or failing and is causing enough disruption on the 
trees. Disruption is defined as:
1) Test case is on the list of top 20 intermittent failures on Orange Factor 
(http://brasstacks.mozilla.com/orangefactor/index.html)
2) It is causing oranges =8% of the time
3) We have 100 instances of this failure in the bug in the last 30 days

Escalation is a responsibility of all developers, although the majority will 
fall on the sheriffs.

Escalation path:
1) Ensure we have a bug on file, with the test author, reviewer, module owner, 
and any other interested parties, links to logs, etc.
2) We need to needinfo? and expect a response within 2 business days, this 
should be clear in a comment.
3) In the case we don't get a response, request a needinfo? from the module 
owner
with the expectation of 2 days for a response and getting someone to take 
action.
4) In the case we go another 2 days with no response from a module owner, we 
will disable the test.

Ideally we will work with the test author to either get the test fixed or 
disabled depending on available time or difficulty in fixing the test.  If a 
bug has activity and work is being done to address the issue, it is reasonable 
to expect the test will not be disabled.  Inactivity in the bug is the main 
cause for escalation.

This is intended to respect the time of the original test authors by not 
throwing emergencies in their lap, but also strike a balance with keeping the 
trees manageable.

Exceptions:
1) If this test has landed (or been modified) in the last 48 hours, we will 
most likely back out the patch with the test
2) If a test is failing at least 30% of the time, we will file a bug and 
disable the test first
3) When we are bringing a new platform online (Android 2.3, b2g, etc.) many 
tests will need to be disabled prior to getting the tests on tbpl.
4) In the rare case we are disabling the majority of the tests (either at once 
or slowly over time) for a given feature, we need to get the module owner to 
sign off on the current state of the tests.


= Documentation =
We have thousands of tests disabled, many are disabled for different build 
configurations or platforms. This can be dangerous as we slowly reduce our 
coverage. By running a daily report (bug 996183) to outline the total tests 
available vs each configuration (b2g, debug, osx, e10s, etc.) we can bring 
visibility to the state of each platform and if we are disabling more than we 
fix.

We need to have a clear guide on how to run the tests, how to write a test, how 
to debug a test, and use metadata to indicate if we have looked at this test 
and when.

When an intermittent bug is filed, we need to clearly outline what information 
will aid the most in reproducing and fixing this bug.  Without a documented 
process for fixing oranges, this falls on the shoulders of the original test 
authors and a few determined hackers.


= General Policy =
I have adjusted the above policy to mention backing out new tests which are not 
stable, working to identify a regression in the code or tests, and adding 
protection so we do not disable coverage for a specific feature completely. In 
addition, I added a clearer definition of what is a disruptive test and 
clarified the expectations around communicating in the bug vs escalating.

What is more important is the culture we have around commiting patches to 
Mozilla repositories. We need to decide as an organization if we care about 
zero oranges (or insert acceptable percentage). We also need to decide what is 
acceptable coverage levels and what our general policy is for test reviews (at 
checkin time and in the future). These need to be answered outside of this 
policy- but the sooner we answer these questions, the better we can all move 
forward towards the same goal.


= Tools =
Much of the discussion was around tools. As a member of the Automation and 
Tools team, I should be advocating for more tools, in this case I am leaning 
more towards less tools and better process.

One common problem is dealing with the noise around infrastructure and changing 
environments and test harnesses. Is this documented, how can we filter that 
out? Having our tools support ways to detect this and annotate changes 
unrelated to tests or builds will go a long way.  Related is updating our 
harnesses and the way we run tests so they are more repeatable.  I have filed 
bug 996504 to track work on this.

Another problem we can look at with tooling is annotating the expected outcome 

Re: Policy for disabling tests which run on TBPL

2014-04-15 Thread Kyle Huey
On Tue, Apr 15, 2014 at 6:21 AM, jmaher joel.ma...@gmail.com wrote:
 This policy will define an escalation path for when a single test case is 
 identified to be leaking or failing and is causing enough disruption on the 
 trees. Disruption is defined as:
 1) Test case is on the list of top 20 intermittent failures on Orange Factor 
 (http://brasstacks.mozilla.com/orangefactor/index.html)
 2) It is causing oranges =8% of the time
 3) We have 100 instances of this failure in the bug in the last 30 days

Are these conditions joined by 'and' or by 'or'?  If 'or', there will
always be at least 20 tests meeting this set of criteria ...

- Kyle
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-15 Thread jmaher
On Tuesday, April 15, 2014 9:42:25 AM UTC-4, Kyle Huey wrote:
 On Tue, Apr 15, 2014 at 6:21 AM, jmaher joel.ma...@gmail.com wrote:
 
  This policy will define an escalation path for when a single test case is 
  identified to be leaking or failing and is causing enough disruption on the 
  trees. Disruption is defined as:
 
  1) Test case is on the list of top 20 intermittent failures on Orange 
  Factor (http://brasstacks.mozilla.com/orangefactor/index.html)
 
  2) It is causing oranges =8% of the time
 
  3) We have 100 instances of this failure in the bug in the last 30 days
 
 
 
 Are these conditions joined by 'and' or by 'or'?  If 'or', there will
 
 always be at least 20 tests meeting this set of criteria ...
 
 
 
 - Kyle

Great question Kyle-

The top 20 doesn't always include specific tests- sometimes it is 
infrastructure, hardware/vm, or test harness, mozharness, etc. related. If a 
test meets any of the above criteria and is escalated, then we should expect to 
follow some basic criteria about either working on fixing it or disabling it as 
spelled out in the escalation path.

For the large majority of the cases, bugs filed for specific test cases will 
meet all 3 conditions.  We have had some cases where we have thousands of stars 
over years, but it isn't on the top 20 list all the time.  Likewise when we 
have 10 infra bugs, a frequent orange on the trees won't be in the top 20.

-Joel
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-15 Thread Karl Tomlinson
Thank you for putting this together.  It is important.

jmaher writes:

 This policy will define an escalation path for when a single test case is
 identified to be leaking or failing and is causing enough disruption on the
 trees.

 Exceptions:
 1) If this test has landed (or been modified) in the last 48 hours, we will
 most likely back out the patch with the test
 2) If a test is failing at least 30% of the time, we will file a bug and
 disable the test first

 I have adjusted the above policy to mention backing out new
 tests which are not stable, working to identify a regression in
 the code or tests,

I see the exception for regressions from test changes, but I
didn't notice mention of regressions from code.

If a test has started failing intermittently and is failing at
least 30% of the time, then I expect it is not difficult to
identify the trigger for the regression.  It is much more
difficult to identify an increase in frequency of failure.

Could we make it clear that preferred solution is to back out the
cause of the regression?  Either something in the opening
paragraph or perhaps add the exception:

If a regressing push or series of pushes can be identified then
the changesets in those pushes are reverted.

We won't always have a single push because other errors preventing
tests from running are often not backed out until after several
subsequent pushes.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-09 Thread Ehsan Akhgari

On 2014-04-08, 6:10 PM, Karl Tomlinson wrote:

I wonder whether the real problem here is that we have too many
bad tests that report false negatives, and these bad tests are
reducing the value of our testsuite in general.  Tests also need
to be well documented so that people can understand what a
negative report really means.  This is probably what is leading to
assumptions that disabling a test is the solution to a new
failure.


I think this is a great point!  Yes, we absolutely have a lot of bad 
tests which are written in a way that is prone to intermittent failures. 
 This means that a number of changes to something entirely irrelevant 
to the test in question can change its environment to a point where a 
remnant intermittent failure issue may suddenly start to show up. 
Sometimes for some of these tests that are not fixed in time the 
situation is even more confusing: a change in the environment causes it 
to start to fail, then another change causes it to stop failing, and 
then a while later another change may cause it to start failing again.


I think it would be very valuable to detect some of the known conditions 
[1] that may result in a test intermittently failing because of timing 
issues in our test harnesses to prevent from those badly written tests 
from getting checked in in the first place.  I've previously tried to do 
some of this work [2], and it would be fantastic if someone can pick up 
the ball there.


[1] 
https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=649012

Cheers,
Ehsan
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-09 Thread Gregory Szorc

On 4/8/14, 6:51 AM, James Graham wrote:

On 08/04/14 14:43, Andrew Halberstadt wrote:

On 07/04/14 11:49 AM, Aryeh Gregor wrote:

On Mon, Apr 7, 2014 at 6:12 PM, Ted Mielczarek t...@mielczarek.org
wrote:

If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.


To some degree, yes, marking a test as expected intermittent causes it
to lose value.  If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so.  That should be their
decision, not something forced on them by infrastructure issues
(everyone else will suffer if you don't find the cause for this
failure in your test).  Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.


I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.


So, what's the minimum level of infrastructure that you think would be
needed to go ahead with this plan? To me it seems like the current
system already isn't working very well, so the bar for moving forward
with a plan that would increase the amount of data we had available to
diagnose problems with intermittents, and reduce the amount of manual
labour needed in marking them, should be quite low.


The simple solution is to have a separate in-tree manifest annotation 
for intermittents. Put another way, we can describe exactly why we are 
not running a test. This is kinda/sorta the realm of bug 922581.


The harder solution is to have some service (like orange factor) keep 
track of the state of every test. We can have a feedback loop whereby 
test automation queries that service to see what tests should run and 
what the expected result is. Of course, we will want that integration to 
work locally so we have consistent test execution between automation and 
developer machines.


I see us inevitably deploying the harder solution. We'll eventually get 
to a point where we're able to do crazy things such as intelligently 
run only the sub-set of tests impacted by a check-in or attempting to 
run disabled tests to see if they magically started working again. I 
think we'll eventually realize that tracking this in a central service 
makes more sense than doing it in-tree (mainly because of the amount of 
data required to make some advanced determinations).


For the short term, I think we should enumerate the reasons we don't run 
a test (distinguishing between test isn't compatible and test isn't 
working is important) and annotate these separately in our test 
manifests. We can then modify our test automation to treat things 
differently. For example, we could:


1) Run failed tests multiple times. If it is intermittent but not marked 
as such, we fail the test run.
2) Run marked intermittent tests multiple times. If it works all 25 
times, fail the test run for inconsistent metadata.

3) Integrate intermittent failures into TBPL/Orange Factor better.

To address David Baron's concern about silently passing intermittently 
failing tests, yes, silently passing is wrong. But I would argue it is 
the lesser evil of disabling tests outright.


I think we can all agree that the current approach of disabling failing 
tests (the equivalent of sweeping dust under the rug) isn't sustainable. 
But if it's the Sherriffs' job to keep the trees green and their only 
available recourse is to disable tests, well, they are going to disable 
tests. We need more metadata and tooling around disabled tests and we 
needed it months ago.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-09 Thread L. David Baron
On Wednesday 2014-04-09 11:00 -0700, Gregory Szorc wrote:
 The simple solution is to have a separate in-tree manifest
 annotation for intermittents. Put another way, we can describe
 exactly why we are not running a test. This is kinda/sorta the realm
 of bug 922581.
 
 The harder solution is to have some service (like orange factor)
 keep track of the state of every test. We can have a feedback loop
 whereby test automation queries that service to see what tests
 should run and what the expected result is. Of course, we will want
 that integration to work locally so we have consistent test
 execution between automation and developer machines.

I think both of these are bad.

It should be visible near the tests whether they are running or not,
rather than out of band, so that module owners and those working on
the code are aware of the testing coverage.

Annotating something as intermittent is halfway to disabling, and
should thus be reviewed by test authors / module owners just like
disabling should be; it sounds like you're proposing changing that
as well.

The latter solution also breaks being able to describe a test run
with reference to a revision in the VCS repository.

-David

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla  https://www.mozilla.org/   턂
 Before I built a wall I'd ask to know
 What I was walling in or walling out,
 And to whom I was like to give offense.
   - Robert Frost, Mending Wall (1914)


signature.asc
Description: Digital signature
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-09 Thread Gregory Szorc

On 4/9/14, 11:29 AM, L. David Baron wrote:

On Wednesday 2014-04-09 11:00 -0700, Gregory Szorc wrote:

The simple solution is to have a separate in-tree manifest
annotation for intermittents. Put another way, we can describe
exactly why we are not running a test. This is kinda/sorta the realm
of bug 922581.

The harder solution is to have some service (like orange factor)
keep track of the state of every test. We can have a feedback loop
whereby test automation queries that service to see what tests
should run and what the expected result is. Of course, we will want
that integration to work locally so we have consistent test
execution between automation and developer machines.


I think both of these are bad.

It should be visible near the tests whether they are running or not,
rather than out of band, so that module owners and those working on
the code are aware of the testing coverage.


I would love this all to be in band and part of version control (in the 
test manifests). That's why I propose putting things in test manfiests 
today and for the foreseeable future.


I just think we'll ultimately reach a stage where we want to leverage 
big data for more intelligently running tests. We'll cross that bridge 
when we get to it, I reckon.



Annotating something as intermittent is halfway to disabling, and
should thus be reviewed by test authors / module owners just like
disabling should be; it sounds like you're proposing changing that
as well.


Absolutely not! I am very disappointed with the current dynamic between 
sheriffs and module owners and test authors because disabling tests is 
leading to worse test coverage and opening ourselves up to all kinds of 
risks. I'd like to think test authors and module owners should have the 
last word. But we have been doing a pretty crappy job of fixing our 
broken tests. I feel a lot of people just shrug shoulders and allow the 
test to be disabled (I'm guilty of it as much as anyone). From my 
perspective, it's difficult to convince the powers at be that fixing 
intermittent failures (that have been successfully swept under a rug and 
are out of sight and out of mind) is more important than implementing 
some shiny new feature (that shows up on an official goals list). I feel 
we all need to treat failing tests with more urgency. The engineering 
culture is not currently favoring that.



The latter solution also breaks being able to describe a test run
with reference to a revision in the VCS repository.


Not necessarily. It would add complexity to the service, but if captured 
as a requirement is certainly doable.


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-09 Thread Karl Tomlinson
Gregory Szorc writes:

 2) Run marked intermittent tests multiple times. If it works all
 25 times, fail the test run for inconsistent metadata.

We need to consider intermittently failing tests as failed, and we
need to only test things that always pass.

We can't rely on statistics to tell us about changes in test
results without huge expense.  If we have many intermittent tests
running many times, then sometimes they will fail in all 25 runs
and sometimes they will pass in all 25 runs.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-09 Thread Gregory Szorc

On 4/9/14, 2:07 PM, Karl Tomlinson wrote:

Gregory Szorc writes:


2) Run marked intermittent tests multiple times. If it works all
25 times, fail the test run for inconsistent metadata.


We need to consider intermittently failing tests as failed, and we
need to only test things that always pass.


I strongly reject the second part of that sentence. That world leads to 
a state where we know less and don't test as much. It throws away a huge 
amount of valuable data.



We can't rely on statistics to tell us about changes in test
results without huge expense.  If we have many intermittent tests
running many times, then sometimes they will fail in all 25 runs
and sometimes they will pass in all 25 runs.


I threw out 25 as a placeholder. We could have metadata differentiate 
between tests that are expected to fail 1% or 50% and treat them 
appropriately.


If you are concerned about costs of testing, I submit to you Microsoft's 
costs for not adequately testing the browser choice screen: a €561 
million fine [1]. I don't want to be the person explaining to the 
lawyers that we disabled a test that was verifying our legal compliance 
with a law or contract because it failed intermittently 2% of the time.


[1] http://europa.eu/rapid/press-release_IP-13-196_en.htm
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-09 Thread Chris Peterson

On 4/9/14, 11:48 AM, Gregory Szorc wrote:

I feel a lot of people just shrug shoulders and allow the test to be
disabled (I'm guilty of it as much as anyone). From my perspective, it's
difficult to convince the powers at be that fixing intermittent failures
(that have been successfully swept under a rug and are out of sight and
out of mind) is more important than implementing some shiny new feature
(that shows up on an official goals list). I feel we all need to treat
failing tests with more urgency. The engineering culture is not
currently favoring that.


How can we realign our testing priorities, short of some management 
decree? Sheriffs have had to disable entire test suites, but even that 
has not motivated test owners to fix their intermittent tests. Part of 
the problem, I understand, is that many tests have no owner, so the 
responsibility for even debugging test failures is not clear.


How much effort do we want to spend creating tools to manage and rerun 
known intermittent tests compared to diagnosing the root problems? 
Earlier in this thread, Ehsan said that, once upon a time, we had no 
intermittent tests. Someone else suggested (half-jokingly, I assume :) 
that requiring a 100% green TBPL before landing on inbound would be an 
effective motivator to fix intermittent tests.



chris

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-09 Thread Ehsan Akhgari

On 2014-04-09, 6:46 PM, Chris Peterson wrote:

On 4/9/14, 11:48 AM, Gregory Szorc wrote:

I feel a lot of people just shrug shoulders and allow the test to be
disabled (I'm guilty of it as much as anyone). From my perspective, it's
difficult to convince the powers at be that fixing intermittent failures
(that have been successfully swept under a rug and are out of sight and
out of mind) is more important than implementing some shiny new feature
(that shows up on an official goals list). I feel we all need to treat
failing tests with more urgency. The engineering culture is not
currently favoring that.


How can we realign our testing priorities, short of some management
decree?


Speaking from my experience on this over the years, I don't think we can 
(and we're working on that!)



Sheriffs have had to disable entire test suites, but even that
has not motivated test owners to fix their intermittent tests. Part of
the problem, I understand, is that many tests have no owner, so the
responsibility for even debugging test failures is not clear.

How much effort do we want to spend creating tools to manage and rerun
known intermittent tests compared to diagnosing the root problems?
Earlier in this thread, Ehsan said that, once upon a time, we had no
intermittent tests. Someone else suggested (half-jokingly, I assume :)
that requiring a 100% green TBPL before landing on inbound would be an
effective motivator to fix intermittent tests.


I am not convinced we need many additional tools at this point.  All of 
the tools in the world won't help unless we have people using them! 
Without that, I think discussing what tools to build might be premature.


Cheers,
Ehsan
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread Aryeh Gregor
On Tue, Apr 8, 2014 at 2:41 AM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote:
 What you're saying above is true *if* someone investigates the intermittent
 test failure and determines that the bug is not important.  But in my
 experience, that's not what happens at all.  I think many people treat
 intermittent test failures as a category of unimportant problems, and
 therefore some bugs are never investigated.  The fact of the matter is that
 most of these bugs are bugs in our tests, which of course will not impact
 our users directly, but I have occasionally come across bugs in our code
 code which are exposed as intermittent failures.  The real issue is that the
 work of identifying where the root of the problem is often time is the
 majority of work needed to fix the intermittent test failure, so unless
 someone is willing to investigate the bug we cannot say whether or not it
 impacts our users.

The same is true for many bugs.  The reported symptom might indicate a
much more extensive underlying problem.  The fact is, though,
thoroughly investigating every bug would take a ton of resources, and
is almost certainly not the best use of our manpower.  There are many
bugs that are *known* to affect many users that don't get fixed in a
timely fashion.  Things that probably won't affect a single user ever
at all, and which are likely to be a pain to track down (because
they're intermittent), should be prioritized relatively low.

 The thing that really makes me care about these intermittent failures a lot
 is that ultimately they make us have to trade either disabling a whole bunch
 of tests with being unable to manage our tree.  As more and more tests get
 disabled, we lose more and more test coverage, and that can have a much more
 severe impact on the health of our products than every individual
 intermittent test failure.

I think you hit the nail on the head, but I think there's a third
solution: automatically ignore known intermittent failures, in as
fine-grained a way as possible.  This means the test is still almost
as useful -- I think the vast majority of our tests will fail
consistently if the thing they're testing breaks, not fail
intermittently.  But it doesn't get in the way of managing the tree.
Yes, it reduces some tests' value slightly relative to fixing them,
but it's not a good use of our resources to try tracking down most
intermittent failures.  The status quo reduces those tests' value just
as much as automatic ignoring (because people will star known failure
patterns consistently), but imposes a large manual labor cost.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread Andrew Halberstadt

On 07/04/14 11:49 AM, Aryeh Gregor wrote:

On Mon, Apr 7, 2014 at 6:12 PM, Ted Mielczarek t...@mielczarek.org wrote:

If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.


To some degree, yes, marking a test as expected intermittent causes it
to lose value.  If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so.  That should be their
decision, not something forced on them by infrastructure issues
(everyone else will suffer if you don't find the cause for this
failure in your test).  Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.


I think this proposal would make more sense if the state of our 
infrastructure and tooling was able to handle it properly. Right now, 
automatically marking known intermittents would cause the test to lose 
*all* value. It's sad, but the only data we have about intermittents 
comes from the sheriffs manually starring them. There is also currently 
no way to mark a test KNOWN-RANDOM and automatically detect if it starts 
failing permanently. This means the failures can't be starred and become 
nearly impossible to discover, let alone diagnose.


As I mentioned in another post in this thread, we need better data and 
easier ways to drill through it. All I'm saying here is that I think 
things are probably worse than you picture them, and I think there is a 
lot of groundwork needed before it even makes sense to consider this.


-Andrew
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread James Graham

On 08/04/14 14:43, Andrew Halberstadt wrote:

On 07/04/14 11:49 AM, Aryeh Gregor wrote:

On Mon, Apr 7, 2014 at 6:12 PM, Ted Mielczarek t...@mielczarek.org
wrote:

If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.


To some degree, yes, marking a test as expected intermittent causes it
to lose value.  If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so.  That should be their
decision, not something forced on them by infrastructure issues
(everyone else will suffer if you don't find the cause for this
failure in your test).  Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.


I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.


So, what's the minimum level of infrastructure that you think would be 
needed to go ahead with this plan? To me it seems like the current 
system already isn't working very well, so the bar for moving forward 
with a plan that would increase the amount of data we had available to 
diagnose problems with intermittents, and reduce the amount of manual 
labour needed in marking them, should be quite low.


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread Ehsan Akhgari

On 2014-04-08, 9:51 AM, James Graham wrote:

On 08/04/14 14:43, Andrew Halberstadt wrote:

On 07/04/14 11:49 AM, Aryeh Gregor wrote:

On Mon, Apr 7, 2014 at 6:12 PM, Ted Mielczarek t...@mielczarek.org
wrote:

If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.


To some degree, yes, marking a test as expected intermittent causes it
to lose value.  If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so.  That should be their
decision, not something forced on them by infrastructure issues
(everyone else will suffer if you don't find the cause for this
failure in your test).  Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.


I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.


So, what's the minimum level of infrastructure that you think would be
needed to go ahead with this plan? To me it seems like the current
system already isn't working very well, so the bar for moving forward
with a plan that would increase the amount of data we had available to
diagnose problems with intermittents, and reduce the amount of manual
labour needed in marking them, should be quite low.


dbaron raised the point that there are tests which are supposed to fail 
intermittently if they detect a bug.  With that in mind, the tests 
cannot be marked as intermittently failing by the sheriffs, less so in 
an automated way (see the discussion in bug 918921).


I'm still not convinced that this idea will be worse than the status 
quo, but the fact that dbaron doesn't agree makes me hesitate.


But to answer your question, I think this is something which can be done 
in the test harness itself so we don't need any special infra support 
for it.  Note that I don't think that automatically marking such tests 
is a good idea either way.


Cheers,
Ehsan

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread Ehsan Akhgari

On 2014-04-08, 8:15 AM, Aryeh Gregor wrote:

On Tue, Apr 8, 2014 at 2:41 AM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote:

What you're saying above is true *if* someone investigates the intermittent
test failure and determines that the bug is not important.  But in my
experience, that's not what happens at all.  I think many people treat
intermittent test failures as a category of unimportant problems, and
therefore some bugs are never investigated.  The fact of the matter is that
most of these bugs are bugs in our tests, which of course will not impact
our users directly, but I have occasionally come across bugs in our code
code which are exposed as intermittent failures.  The real issue is that the
work of identifying where the root of the problem is often time is the
majority of work needed to fix the intermittent test failure, so unless
someone is willing to investigate the bug we cannot say whether or not it
impacts our users.


The same is true for many bugs.  The reported symptom might indicate a
much more extensive underlying problem.  The fact is, though,
thoroughly investigating every bug would take a ton of resources, and
is almost certainly not the best use of our manpower.  There are many
bugs that are *known* to affect many users that don't get fixed in a
timely fashion.  Things that probably won't affect a single user ever
at all, and which are likely to be a pain to track down (because
they're intermittent), should be prioritized relatively low.


I don't think that an analogy with normal bugs is accurate here.  These 
intermittent failure bugs are categorically treated differently than all 
other incoming bugs in my experience.



The thing that really makes me care about these intermittent failures a lot
is that ultimately they make us have to trade either disabling a whole bunch
of tests with being unable to manage our tree.  As more and more tests get
disabled, we lose more and more test coverage, and that can have a much more
severe impact on the health of our products than every individual
intermittent test failure.


I think you hit the nail on the head, but I think there's a third
solution: automatically ignore known intermittent failures, in as
fine-grained a way as possible.  This means the test is still almost
as useful -- I think the vast majority of our tests will fail
consistently if the thing they're testing breaks, not fail
intermittently.  But it doesn't get in the way of managing the tree.
Yes, it reduces some tests' value slightly relative to fixing them,
but it's not a good use of our resources to try tracking down most
intermittent failures.  The status quo reduces those tests' value just
as much as automatic ignoring (because people will star known failure
patterns consistently), but imposes a large manual labor cost.


I agree that automatically ignoring known intermittent failures that 
have been marked as such by a human is a good idea.


But let's also not forget that it won't be a one size fits all solution. 
 There are test failure scenarios such as timeouts and crashes which we 
can't easily retry (for timeouts because the test may leave its 
environment in a non-clean state).  There will also be cases where 
reloading the same test will actually test different things (because, 
for example, things have been cached, etc.).


Cheers,
Ehsan
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread James Graham

On 08/04/14 15:06, Ehsan Akhgari wrote:

On 2014-04-08, 9:51 AM, James Graham wrote:

On 08/04/14 14:43, Andrew Halberstadt wrote:

On 07/04/14 11:49 AM, Aryeh Gregor wrote:

On Mon, Apr 7, 2014 at 6:12 PM, Ted Mielczarek t...@mielczarek.org
wrote:

If a bug is causing a test to fail intermittently, then that test
loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.


To some degree, yes, marking a test as expected intermittent causes it
to lose value.  If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so.  That should be their
decision, not something forced on them by infrastructure issues
(everyone else will suffer if you don't find the cause for this
failure in your test).  Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.


I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.


So, what's the minimum level of infrastructure that you think would be
needed to go ahead with this plan? To me it seems like the current
system already isn't working very well, so the bar for moving forward
with a plan that would increase the amount of data we had available to
diagnose problems with intermittents, and reduce the amount of manual
labour needed in marking them, should be quite low.


dbaron raised the point that there are tests which are supposed to fail
intermittently if they detect a bug.  With that in mind, the tests
cannot be marked as intermittently failing by the sheriffs, less so in
an automated way (see the discussion in bug 918921).


Such tests are problematic indeed, but it seems like they're problematic 
in the current infrastructure too. For example if a test goes from 
always passing to failing 1 time in 10 when it regresses, the first time 
we see the regression is likely to be around 10 testruns after the 
problem is introduced. That presumably makes it rather hard to track 
down what when things went wrong. Or are we running such tests N times 
where N is some high enough number that we are confident that the test 
has a 95% (or whatever) chance of failing if there is actually a 
regression? If not maybe we should be. Or perhaps the idea of 
independent testruns isn't useful in the face of all the state we have.


In any case this kind of test could be explicitly excluded from the 
reruns, which would make the situation the same as it is today.



But to answer your question, I think this is something which can be done
in the test harness itself so we don't need any special infra support
for it.  Note that I don't think that automatically marking such tests
is a good idea either way.


The infra support I had in mind was something like automatically (doing 
something like) starring tests that only passed after being rerun or 
listing all tests that needed a rerun or having a tool to find the 
first build in which the test became intermittent. The goal of this 
extra infrastructure would be to get the new information about reruns 
out of the testharness and address the concern that doing automated 
reruns would mean people paying even less attention to intermittents 
than they do today.


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread Gavin Sharp
I see only two real goals for the proposed policy:
- ensure that module owners/peers have the opportunity to object to
any disable test decisions before they take effect
- set an expectation that intermittent orange failures are dealt with
promptly (dealt with first involves investigation, usually by a
developer familiar with the code, and can then lead to either them
being fixed, disabled, or ignored)

Neither of those happen reliably today. Sheriffs are failing to get
the help they need to investigate failures, which leads loss of
(sometimes quite important) test coverage when they decide to
unilaterally disable the relevant tests. Sheriffs should not be
disabling tests unilaterally; developers should not be ignoring
sheriff requests to investigate failures.

The policy is not intended to suggest that any particular outcome
(i.e. test disabling) is required.

Separately from all of that, we could definitely invest in better
tools for dealing with intermittent failures in general. Anecdotally,
I know chromium has some nice ways of dealing with them, for example.
But I see that a separate discussion not really related to the goals
above.

Gavin

On Tue, Apr 8, 2014 at 10:20 AM, L. David Baron dba...@dbaron.org wrote:
 On Tuesday 2014-04-08 14:51 +0100, James Graham wrote:
 So, what's the minimum level of infrastructure that you think would
 be needed to go ahead with this plan? To me it seems like the
 current system already isn't working very well, so the bar for
 moving forward with a plan that would increase the amount of data we
 had available to diagnose problems with intermittents, and reduce
 the amount of manual labour needed in marking them, should be quite
 low.

 Not sure what plan you're talking about, but:

 The first step I'd like to see is having better tools for finding
 where known intermittent failures regressed.  In particular, we
 should have:
  * the ability to retrigger a partial test run (not the whole
suite) on our testing infrastructure.  This doesn't always help,
since some failures will happen only in the context of the whole
suite, but I think it's likely to help most of the time.
  * auto-bisection tools for intermittent failures that use the above
ability when they can

 I think we're pretty good about backing out changesets that cause
 new intermittent failures that happen at ~20% or more failure rates.
 We need to get better about backing out for new intermittent
 failures that are intermittent at lower rates, and being able to do
 that is best done with better tools.


 (One piece of context I'm coming from:  there have been multiple
 times that the tests that I consider necessary to have enabled to
 allow people to add new CSS properties or values have failed
 intermittently at a reasonably high rate for a few months; I think
 both the start and end of these periods of failures has, in the
 cases where we found it, correlated with major or minor changes to
 the Javascript JIT.  I think those JIT bugs, if they shipped, were
 likely causing real problems for users, and we should be detecting
 those bugs rather than disabling our CSS testing coverage and
 putting us in a state where we can't add new CSS features.)


 I also don't think that moving the failure threshold is a long-term
 solution.  There will always be tests that hover on the edge of
 whatever the failure threshold is and give us trouble as a result; I
 think moving the threshold will only give temporary relief due to
 the history of writing tests to a stricter standard.  For example,
 if we retry intermittent failures up to 10 times to see if they
 pass, we'll end up with tests that fail 75% of the time and thus
 fail all 10 retries intermittently (5.6% of the time).

 -David

 --
 턞   L. David Baron http://dbaron.org/   턂
 턢   Mozilla  https://www.mozilla.org/   턂
  Before I built a wall I'd ask to know
  What I was walling in or walling out,
  And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)

 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread L. David Baron
On Tuesday 2014-04-08 11:41 -0700, Gavin Sharp wrote:
 I see only two real goals for the proposed policy:
 - ensure that module owners/peers have the opportunity to object to
 any disable test decisions before they take effect
 - set an expectation that intermittent orange failures are dealt with
 promptly (dealt with first involves investigation, usually by a
 developer familiar with the code, and can then lead to either them
 being fixed, disabled, or ignored)

I'm fine with the initial policy proposed at the top of the thread;
this part of the subthread seemed to be about a proposal to
auto-retry failing tests and report them as passing if they
intermittently pass; that's the bit I'm not comfortable with.

-David

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla  https://www.mozilla.org/   턂
 Before I built a wall I'd ask to know
 What I was walling in or walling out,
 And to whom I was like to give offense.
   - Robert Frost, Mending Wall (1914)


signature.asc
Description: Digital signature
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread Ehsan Akhgari

On 2014-04-08, 3:15 PM, Chris Peterson wrote:

On 4/8/14, 11:41 AM, Gavin Sharp wrote:

Separately from all of that, we could definitely invest in better
tools for dealing with intermittent failures in general. Anecdotally,
I know chromium has some nice ways of dealing with them, for example.
But I see that a separate discussion not really related to the goals
above.


Is fixing the known intermittent failures part of the plan? :)

Many of the known failures are test timeouts, which suggests some
low-hanging fruit in fixing test or network infrastructure problems:

http://brasstacks.mozilla.com/orangefactor/


Fixing intermittent timeouts is neither easier or harder than any other 
kind of intermittent failure.


Cheers,
Ehsan

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-08 Thread Karl Tomlinson
Aryeh Gregor writes:

 On Tue, Apr 8, 2014 at 2:41 AM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote:
 What you're saying above is true *if* someone investigates the
 intermittent test failure and determines that the bug is not
 important.  But in my experience, that's not what happens at
 all.  I think many people treat intermittent test failures as a
 category of unimportant problems, and therefore some bugs are
 never investigated.  The fact of the matter is that most of
 these bugs are bugs in our tests, which of course will not
 impact our users directly, but I have occasionally come across
 bugs in our code code which are exposed as intermittent
 failures.  The real issue is that the work of identifying where
 the root of the problem is often time is the majority of work
 needed to fix the intermittent test failure, so unless someone
 is willing to investigate the bug we cannot say whether or not
 it impacts our users.

 The same is true for many bugs.  The reported symptom might
 indicate a much more extensive underlying problem.  The fact is,
 though, thoroughly investigating every bug would take a ton of
 resources, and is almost certainly not the best use of our
 manpower.  There are many bugs that are *known* to affect many
 users that don't get fixed in a timely fashion.  Things that
 probably won't affect a single user ever at all, and which are
 likely to be a pain to track down (because they're
 intermittent), should be prioritized relatively low.

New intermittent failures are different from many user reported
bugs because they are known to be a regression and there is some
kind of indication of the regression window.

Regressions should be high priority.  People are getting by
without many new features but people have begun to depend on
existing features, so regressions break real sites and cause
confusion for many people.

The time to address regressions is ASAP, so that responsibility
can be handed over to the person causing the regression.  Waiting
too long means that backing out the cause of the regression is
likely to cause another regression.

I wonder whether the real problem here is that we have too many
bad tests that report false negatives, and these bad tests are
reducing the value of our testsuite in general.  Tests also need
to be well documented so that people can understand what a
negative report really means.  This is probably what is leading to
assumptions that disabling a test is the solution to a new
failure.

Getting bugs on file and seen by the right people is an important
part of dealing with this.  The tricky part is working out how to
prioritize and cope with these bugs.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-07 Thread James Graham

On 07/04/14 04:33, Andrew Halberstadt wrote:

On 06/04/14 08:59 AM, Aryeh Gregor wrote:

On Sat, Apr 5, 2014 at 12:00 AM, Ehsan Akhgari
ehsan.akhg...@gmail.com wrote:

Note that is only accurate to a certain point.  There are other
things which
we can do to guesswork our way out of the situation for Autoland, but of
course they're resource/time intensive (basically running orange
tests over
and over again, etc.)


Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail?  This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.


Many of our test runners have that ability. But doing this implies that
intermittents are always the fault of the test. We'd be missing whole
classes of regressions (notably race conditions).


In practice how effective are we at identifying bugs that lead to 
instability? Is it more common that we end up disabling the test, or 
marking it as known intermittent and learning to live with the 
instability, both of which options reduce the test coverage, or is it 
more common that we realise that there is a code reason for the 
intermittent, and get it fixed?


If it is the latter then making the instability as obvious as possible 
makes sense, and the current setup where we run each test once can be 
regarded as a compromise between the ideal setup where we run each test 
multiple times and flag it as a fail if it ever fails, and the needs of 
performance.


If the former is true, it makes a lot more sense to do reruns of the 
tests that fail in order to keep them active at all, and store 
information about the fact that reruns occurred so that we can see when 
a test started giving unexpected results. This does rely on having some 
mechanism to make people care about genuine intermittents that they 
caused, but maybe the right way to do that is to have some batch tool 
that takes all the tests that have become intermittent, and does reruns 
until it has identified the commits that introduced the intermittency, 
and then files P1 bugs on the developer(s) it identifies to fix their bugs.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-07 Thread Aryeh Gregor
On Mon, Apr 7, 2014 at 6:33 AM, Andrew Halberstadt
ahalberst...@mozilla.com wrote:
 Many of our test runners have that ability. But doing this implies that
 intermittents are always the fault of the test. We'd be missing whole
 classes of regressions (notably race conditions).

We already are, because we already will star (i.e., ignore) any
failure that looks like a known intermittent failure.  I'm only saying
we should automate that as much as possible.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-07 Thread Andrew Halberstadt

On 07/04/14 05:10 AM, James Graham wrote:

On 07/04/14 04:33, Andrew Halberstadt wrote:

On 06/04/14 08:59 AM, Aryeh Gregor wrote:

Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail?  This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.


Many of our test runners have that ability. But doing this implies that
intermittents are always the fault of the test. We'd be missing whole
classes of regressions (notably race conditions).


In practice how effective are we at identifying bugs that lead to
instability? Is it more common that we end up disabling the test, or
marking it as known intermittent and learning to live with the
instability, both of which options reduce the test coverage, or is it
more common that we realise that there is a code reason for the
intermittent, and get it fixed?


I would guess the former is true in most cases. But at least there we 
have a *chance* at tracking down and fixing the failure, even if it 
takes awhile before it becomes annoying enough to prioritize. If we made 
it so intermittents never annoyed anyone, there would be even less 
motivation to fix them. Yes in theory we would still have a list of top 
failing intermittents. In practice that list will be ignored.


Case in point, desktop xpcshell does this right now. Open a log and 
ctrl-f for Retrying tests. Most runs have a few failures that got 
retried. No one knows about these and no one looks at them. Publishing 
results somewhere easy to discover would definitely help, but I'm not 
convinced it will help much.


Doing this would also cause us to miss non-intermittent regressions, e.g 
where the ordering of tests tickles the platform the wrong way. On the 
retry, the test would get run in a completely different order and might 
show up green 100% of the time.


Either way, the problem is partly culture, partly due to not good enough 
tooling. I see where this proposal is coming from, but I think there are 
ways of tackling the problem head on. This seems kind of like a last resort.


Andrew
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-07 Thread Aryeh Gregor
On Mon, Apr 7, 2014 at 3:20 PM, Andrew Halberstadt
ahalberst...@mozilla.com wrote:
 I would guess the former is true in most cases. But at least there we have a
 *chance* at tracking down and fixing the failure, even if it takes awhile
 before it becomes annoying enough to prioritize. If we made it so
 intermittents never annoyed anyone, there would be even less motivation to
 fix them. Yes in theory we would still have a list of top failing
 intermittents. In practice that list will be ignored.

Is this better or worse than the status quo?  Just because a bug
happens to have made its way into our test suite doesn't mean it
should be high priority.  If the bug isn't causing known problems for
users, it makes sense to ignore it in favor of working on bugs that
are known to affect users.  Why not let the relevant developers make
that prioritization decision, and ignore the bug forever if they don't
think it's as important as other things they're working on?
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-07 Thread Ted Mielczarek
On 4/7/2014 9:02 AM, Aryeh Gregor wrote:
 On Mon, Apr 7, 2014 at 3:20 PM, Andrew Halberstadt
 ahalberst...@mozilla.com wrote:
 I would guess the former is true in most cases. But at least there we have a
 *chance* at tracking down and fixing the failure, even if it takes awhile
 before it becomes annoying enough to prioritize. If we made it so
 intermittents never annoyed anyone, there would be even less motivation to
 fix them. Yes in theory we would still have a list of top failing
 intermittents. In practice that list will be ignored.
 Is this better or worse than the status quo?  Just because a bug
 happens to have made its way into our test suite doesn't mean it
 should be high priority.  If the bug isn't causing known problems for
 users, it makes sense to ignore it in favor of working on bugs that
 are known to affect users.  Why not let the relevant developers make
 that prioritization decision, and ignore the bug forever if they don't
 think it's as important as other things they're working on?


If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.

It's difficult to say whether bugs we find via tests are more or less
important than bugs we find via users. It's entirely possible that lots
of the bugs that cause intermittent test failures cause intermittent
weird behavior for our users, we simply don't have any visibility into
that.

-Ted


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-07 Thread Aryeh Gregor
On Mon, Apr 7, 2014 at 6:12 PM, Ted Mielczarek t...@mielczarek.org wrote:
 If a bug is causing a test to fail intermittently, then that test loses
 value. It still has some value in that it can catch regressions that
 cause it to fail permanently, but we would not be able to catch a
 regression that causes it to fail intermittently.

To some degree, yes, marking a test as expected intermittent causes it
to lose value.  If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so.  That should be their
decision, not something forced on them by infrastructure issues
(everyone else will suffer if you don't find the cause for this
failure in your test).  Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-07 Thread Mike Hoye

On 2014-04-07, 11:12 AM, Ted Mielczarek wrote:
It's difficult to say whether bugs we find via tests are more or less 
important than bugs we find via users. It's entirely possible that 
lots of the bugs that cause intermittent test failures cause 
intermittent weird behavior for our users, we simply don't have any 
visibility into that.


Is there a way - or could there be a way - for us to push builds that 
generate intermittent-failure tests out to the larger Mozilla community?


And, more generally: I'd love it if we could economically tie in 
community engagement to our automated test process somehow.


- mhoye
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-07 Thread Ehsan Akhgari

On 2014-04-07, 11:49 AM, Aryeh Gregor wrote:

On Mon, Apr 7, 2014 at 6:12 PM, Ted Mielczarek t...@mielczarek.org wrote:

If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.


To some degree, yes, marking a test as expected intermittent causes it
to lose value.  If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so.  That should be their
decision, not something forced on them by infrastructure issues
(everyone else will suffer if you don't find the cause for this
failure in your test).  Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.


What you're saying above is true *if* someone investigates the 
intermittent test failure and determines that the bug is not important. 
 But in my experience, that's not what happens at all.  I think many 
people treat intermittent test failures as a category of unimportant 
problems, and therefore some bugs are never investigated.  The fact of 
the matter is that most of these bugs are bugs in our tests, which of 
course will not impact our users directly, but I have occasionally come 
across bugs in our code code which are exposed as intermittent failures. 
 The real issue is that the work of identifying where the root of the 
problem is often time is the majority of work needed to fix the 
intermittent test failure, so unless someone is willing to investigate 
the bug we cannot say whether or not it impacts our users.


The thing that really makes me care about these intermittent failures a 
lot is that ultimately they make us have to trade either disabling a 
whole bunch of tests with being unable to manage our tree.  As more and 
more tests get disabled, we lose more and more test coverage, and that 
can have a much more severe impact on the health of our products than 
every individual intermittent test failure.


Cheers,
Ehsan

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-06 Thread Aryeh Gregor
On Sat, Apr 5, 2014 at 12:00 AM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote:
 Note that is only accurate to a certain point.  There are other things which
 we can do to guesswork our way out of the situation for Autoland, but of
 course they're resource/time intensive (basically running orange tests over
 and over again, etc.)

Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail?  This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-06 Thread Ehsan Akhgari

On 2014-04-06, 8:59 AM, Aryeh Gregor wrote:

On Sat, Apr 5, 2014 at 12:00 AM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote:

Note that is only accurate to a certain point.  There are other things which
we can do to guesswork our way out of the situation for Autoland, but of
course they're resource/time intensive (basically running orange tests over
and over again, etc.)


Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail?  This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.


I don't think there is a good reason not to do that.  Please file a bug!

Cheers,
Ehsan

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-06 Thread Ed Morley

On 06 April 2014 14:58:24, Ehsan Akhgari wrote:

On 2014-04-06, 8:59 AM, Aryeh Gregor wrote:

Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail?


I don't think there is a good reason not to do that.  Please file a bug!


xpcshell already does this, bug 918921 is filed for mochitests, though 
dbaron raises valid concerns in the bug.


That said, I think something that would be worth doing (and is now more 
viable now that we've nearing completion for switching to manifests for 
tests) is having a 'mochitest-flaky' where tests that are marked in the 
manifest as being unreliable are run. (Similar to what the Chomium 
project currently does).


Cheers,

Ed
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-06 Thread Karl Tomlinson
On Fri, 4 Apr 2014 12:49:45 -0700 (PDT), jmaher wrote:

 overburdened in other ways (e.g., reviews).  the burden
 needs to be placed on the regressing change rather than the original
 author of the test.

 I am open to ideas to help figure out the offending changes.  My
 understanding is many of the test failures are due to small
 adjustments to the system or even the order the tests are run
 such that it causes the test to fail intermittently.

On Fri, 4 Apr 2014 13:19:48 -0700, Gavin Sharp wrote:

 On Fri, Apr 4, 2014 at 12:12 PM, L. David Baron dba...@dbaron.org wrote:
 In the latter case, the burden should fall on the regressing patch,
 and the regressing patch should get backed out instead of disabling
 the test.

 The majority of the time identifying the regressing patch is
 difficult, and something that requires help from someone who can debug
 the test. Hence step 1). Steps 2) and 3) are an escalation path that
 might result in disabling the test or fixing it, but the primary
 purpose is to ensure that decision is reached with full visibility
 from the relevant owners/developers.

Often enough reasonable effort does identify the cause of the regression.

For tests failing more than 20% of the time, it should not be too
difficult to identify the regressing push through retriggers.

The key question is who bears this responsibility?
David points out that authors and module owners can't keep up with
this.  How much investigation is required before the
responsibility passes to the author or owner?  I'd like to see
this addressed in a policy.

Sometimes, even after the regressing changes are identified, there
is no response from the author or reviewer of the regressing
changes.  A back out is the right action here, but this is not
always what happens.

A policy is important to ensure that there is some path though
even when there is no response from the author of the regressing
change, and then no response from the test author or owner.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-06 Thread Karl Tomlinson
On Fri, 4 Apr 2014 11:58:28 -0700 (PDT), jmaher wrote:

 Two exceptions:

 2) When we are bringing a new platform online (Android 2.3, b2g, etc.) many
 tests will need to be disabled prior to getting the tests on tbpl.

It makes sense to disable some tests so that others can run.

I assume bugs are filed for the disabled tests.

What is difficult here is that there is no clear path to ensure
that sufficient tests are running on the new platform before tests
are stopped on an older version of the platform.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-06 Thread Andrew Halberstadt

On 06/04/14 08:59 AM, Aryeh Gregor wrote:

On Sat, Apr 5, 2014 at 12:00 AM, Ehsan Akhgari ehsan.akhg...@gmail.com wrote:

Note that is only accurate to a certain point.  There are other things which
we can do to guesswork our way out of the situation for Autoland, but of
course they're resource/time intensive (basically running orange tests over
and over again, etc.)


Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail?  This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.


Many of our test runners have that ability. But doing this implies that 
intermittents are always the fault of the test. We'd be missing whole 
classes of regressions (notably race conditions).


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-06 Thread Andrew Halberstadt

On 04/04/14 03:44 PM, Ehsan Akhgari wrote:

On 2014-04-04, 3:12 PM, L. David Baron wrote:

Are you talking about newly-added tests, or tests that have been
passing for a long time and recently started failing?

In the latter case, the burden should fall on the regressing patch,
and the regressing patch should get backed out instead of disabling
the test.


We have no good way of identifying the regressing patch in many cases.


In cases where the regression isn't caught right away, orangefactor is 
currently the only tool at our disposal. While it is great for 
identifying priorities to fix, it is less great at drilling down through 
raw data to determine regression ranges. It is also very noisy which can 
make it hard to interpret the data.


I think we need to spend some time:
A) Exposing more raw data for tools like orangefactor to consume (e.g 
resource usage)

B) Figuring out how to separate the signal from the noise
C) Making it super easy to drill through this data to identify a 
regression range.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-04 Thread L. David Baron
On Friday 2014-04-04 11:58 -0700, jmaher wrote:
 As the sheriff's know it is frustrating to deal with hundreds of tests that 
 fail on a daily basis, but are intermittent.
 
 When a single test case is identified to be leaking or failing at least 10% 
 of the time, it is time to escalate.
 
 Escalation path:
 1) Ensure we have a bug on file, with the test author, reviewer, module 
 owner, and any other interested parties, links to logs, etc.
 2) We need to needinfo? and expect a response within 2 business days, this 
 should be clear in a comment.
 3) In the case we don't get a response, request a needinfo? from the module 
 owner
 with the expectation of 2 days for a response and getting someone to take 
 action.
 4) In the case we go another 2 days with no response from a module owner, we 
 will disable the test.

Are you talking about newly-added tests, or tests that have been
passing for a long time and recently started failing?

In the latter case, the burden should fall on the regressing patch,
and the regressing patch should get backed out instead of disabling
the test.

 Ideally we will work with the test author to either get the test fixed or 
 disabled depending on available time or difficulty in fixing the test.
 
 This is intended to respect the time of the original test authors by not 
 throwing emergencies in their lap, but also strike a balance with keeping the 
 trees manageable. 

If this plan is applied to existing tests, then it will lead to
style system mochitests being turned off due to other regressions
because I'm the person who wrote them and the module owner, and I
don't always have time to deal with regressions in other parts of
code (e.g., the JS engine) leading to these tests failing
intermittently.

If that happens, we won't have the test coverage we need to add new
CSS properties or values.

More generally, it places a much heavier burden on contributors who
have been part of the project longer, who are also likely to be
overburdened in other ways (e.g., reviews).  That's why the burden
needs to be placed on the regressing change rather than the original
author of the test.

 Two exceptions:
 1) If a test is failing at least 50% of the time, we will file a bug and 
 disable the test first

These 10% and 50% numbers don't feel right to me; I think the
thresholds should probably be substantially lower.  But I think it's
easier to think about these numbers in failures/day, at least for
me.

 2) When we are bringing a new platform online (Android 2.3, b2g, etc.) many 
 tests will need to be disabled prior to getting the tests on tbpl.

That's reasonable as long as work is done to try to get the tests
enabled (at a minimum, actually enabling all the tests that are
passing reliably, rather than stopping after enabling the passing
tests in only some directories).

-David

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla  https://www.mozilla.org/   턂
 Before I built a wall I'd ask to know
 What I was walling in or walling out,
 And to whom I was like to give offense.
   - Robert Frost, Mending Wall (1914)


signature.asc
Description: Digital signature
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-04 Thread Ehsan Akhgari

On 2014-04-04, 3:12 PM, L. David Baron wrote:

On Friday 2014-04-04 11:58 -0700, jmaher wrote:

As the sheriff's know it is frustrating to deal with hundreds of tests that 
fail on a daily basis, but are intermittent.

When a single test case is identified to be leaking or failing at least 10% of 
the time, it is time to escalate.

Escalation path:
1) Ensure we have a bug on file, with the test author, reviewer, module owner, 
and any other interested parties, links to logs, etc.
2) We need to needinfo? and expect a response within 2 business days, this 
should be clear in a comment.
3) In the case we don't get a response, request a needinfo? from the module 
owner
with the expectation of 2 days for a response and getting someone to take 
action.
4) In the case we go another 2 days with no response from a module owner, we 
will disable the test.


Are you talking about newly-added tests, or tests that have been
passing for a long time and recently started failing?

In the latter case, the burden should fall on the regressing patch,
and the regressing patch should get backed out instead of disabling
the test.


We have no good way of identifying the regressing patch in many cases.


Ideally we will work with the test author to either get the test fixed or 
disabled depending on available time or difficulty in fixing the test.

This is intended to respect the time of the original test authors by not 
throwing emergencies in their lap, but also strike a balance with keeping the 
trees manageable.


If this plan is applied to existing tests, then it will lead to
style system mochitests being turned off due to other regressions
because I'm the person who wrote them and the module owner, and I
don't always have time to deal with regressions in other parts of
code (e.g., the JS engine) leading to these tests failing
intermittently.

If that happens, we won't have the test coverage we need to add new
CSS properties or values.

More generally, it places a much heavier burden on contributors who
have been part of the project longer, who are also likely to be
overburdened in other ways (e.g., reviews).  That's why the burden
needs to be placed on the regressing change rather than the original
author of the test.


Two exceptions:
1) If a test is failing at least 50% of the time, we will file a bug and 
disable the test first


These 10% and 50% numbers don't feel right to me; I think the
thresholds should probably be substantially lower.  But I think it's
easier to think about these numbers in failures/day, at least for
me.


2) When we are bringing a new platform online (Android 2.3, b2g, etc.) many 
tests will need to be disabled prior to getting the tests on tbpl.


That's reasonable as long as work is done to try to get the tests
enabled (at a minimum, actually enabling all the tests that are
passing reliably, rather than stopping after enabling the passing
tests in only some directories).


One thing I want to note with my random guy who has been fixing tests 
all around the code base for a few years hat on: I think we 
collectively have been terrible at acting on intermittent failures.  I 
see this over and over again where someone is needinfo?ed or assigned to 
an orange bug and they *never* comment or interact with the bug in any 
way.  That is perhaps a cultural problem which we need to address, and 
I'm sure in a lot of cases is because the said person is busy with more 
important things, but it has one important implication that in a lot of 
the cases, the sheriffs basically have no way of knowing whether anyone 
ever is gong to work on the test failure in question.  I really really 
hate having to disable tests, but I'm entirely convinced that if we did 
not do that very aggressively, the current tree would basically be 
perma-orange.  And that makes it impossible for anyone to check in anything.


So please try to keep that in mind when thinking about this proposal.

Cheers,
Ehsan
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-04 Thread jmaher
 
  4) In the case we go another 2 days with no response from a module owner, 
  we will disable the test.
 
 
 
 Are you talking about newly-added tests, or tests that have been
 
 passing for a long time and recently started failing?
 
 
 
 In the latter case, the burden should fall on the regressing patch,
 
 and the regressing patch should get backed out instead of disabling
 
 the test.
 

I had overlooked a new test- I agree that backing it out is the right thing.


 
 If this plan is applied to existing tests, then it will lead to
 style system mochitests being turned off due to other regressions
 because I'm the person who wrote them and the module owner, and I
 don't always have time to deal with regressions in other parts of
 code (e.g., the JS engine) leading to these tests failing
 intermittently.
 
 If that happens, we won't have the test coverage we need to add new
 CSS properties or values.

Interesting point.  Are these tests failing often?  Can we invest some minimal 
time into these to make them more reliable from a test case or test harness 
perspective?

As long as there is a dialog in the bugs filed, I would find it hard to believe 
we would just disable a test and it come as a surprise.

 
 More generally, it places a much heavier burden on contributors who
 have been part of the project longer, who are also likely to be
 overburdened in other ways (e.g., reviews).  That's why the burden
 needs to be placed on the regressing change rather than the original
 author of the test.

I am open to ideas to help figure out the offending changes.  My understanding 
is many of the test failures are due to small adjustments to the system or even 
the order the tests are run such that it causes the test to fail intermittently.

I know there are a lot of new faces to the Mozilla community every month, could 
we offload some (not all) of this work to mozillians with less on their plate?
 
 
 These 10% and 50% numbers don't feel right to me; I think the
 thresholds should probably be substantially lower.  But I think it's
 easier to think about these numbers in failures/day, at least for
 me.

Great feedback on this.  Maybe we pick the top 10 from orange factor 
(http://brasstacks.mozilla.com/orangefactor/index.html), or we cut the numbers 
in half.  10% and 50% were sort of last resort numbers I came up with, ideally 
there would have already been a conversation/bug about the problem.

  2) When we are bringing a new platform online (Android 2.3, b2g, etc.) many 
  tests will need to be disabled prior to getting the tests on tbpl.
 
 That's reasonable as long as work is done to try to get the tests 
 enabled (at a minimum, actually enabling all the tests that are
 passing reliably, rather than stopping after enabling the passing
 tests in only some directories).

One this I have heard is coming online is a way to track the # of tests 
available/disabled per platform, that would really help with ensuring we are 
not ignoring thousands of tests on a new platform.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-04 Thread Chris Peterson

On 4/4/14, 1:19 PM, Gavin Sharp wrote:

The majority of the time identifying the regressing patch is
difficult


Identifying the regressing patch is only difficult because we have so 
many intermittently failing tests.


Intermittent oranges are one of the major blockers for Autoland. If TBPL 
never shows an entirely green test pass on mozilla-inbound, Autoland 
can't programmatically determine when a checkin-need patch can be safely 
landed.



chris
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-04 Thread Jonathan Griffin
With respect to Autoland, I think we'll need to figure out how to make 
it take intermittents into account.  I don't think we'll ever be a state 
with 0 intermittents.


Jonathan

On 4/4/2014 1:30 PM, Chris Peterson wrote:

On 4/4/14, 1:19 PM, Gavin Sharp wrote:

The majority of the time identifying the regressing patch is
difficult


Identifying the regressing patch is only difficult because we have so 
many intermittently failing tests.


Intermittent oranges are one of the major blockers for Autoland. If 
TBPL never shows an entirely green test pass on mozilla-inbound, 
Autoland can't programmatically determine when a checkin-need patch 
can be safely landed.



chris
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-04 Thread Ehsan Akhgari

On 2014-04-04, 4:30 PM, Chris Peterson wrote:

On 4/4/14, 1:19 PM, Gavin Sharp wrote:

The majority of the time identifying the regressing patch is
difficult


Identifying the regressing patch is only difficult because we have so
many intermittently failing tests.

Intermittent oranges are one of the major blockers for Autoland. If TBPL
never shows an entirely green test pass on mozilla-inbound, Autoland
can't programmatically determine when a checkin-need patch can be safely
landed.


Note that is only accurate to a certain point.  There are other things 
which we can do to guesswork our way out of the situation for Autoland, 
but of course they're resource/time intensive (basically running orange 
tests over and over again, etc.)

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-04 Thread Ehsan Akhgari

On 2014-04-04, 4:58 PM, Jonathan Griffin wrote:

With respect to Autoland, I think we'll need to figure out how to make
it take intermittents into account.  I don't think we'll ever be a state
with 0 intermittents.


That's not true, we were in that state once, before I stopped working on 
this issue.  We can get there again if we wanted to.  It's just a lot of 
hard work which won't scale if we only have one person doing it.


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-04 Thread Martin Thomson
On 2014-04-04, at 14:02, Ehsan Akhgari ehsan.akhg...@gmail.com wrote:

 That's not true, we were in that state once, before I stopped working on this 
 issue.  We can get there again if we wanted to.  It's just a lot of hard work 
 which won't scale if we only have one person doing it.

It’s self-correcting too.  Turn on auto land and require a green build to land 
a patch.  I’m certain that the number of intermittent failures will drop 
dramatically.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Policy for disabling tests which run on TBPL

2014-04-04 Thread L. David Baron
On Friday 2014-04-04 12:49 -0700, jmaher wrote:
  If this plan is applied to existing tests, then it will lead to
  style system mochitests being turned off due to other regressions
  because I'm the person who wrote them and the module owner, and I
  don't always have time to deal with regressions in other parts of
  code (e.g., the JS engine) leading to these tests failing
  intermittently.
  
  If that happens, we won't have the test coverage we need to add new
  CSS properties or values.
 
 Interesting point.  Are these tests failing often?  Can we invest some 
 minimal time into these to make them more reliable from a test case or test 
 harness perspective?

They're not failing often.  But there has been at least one (and I
think more) occasion where changes in very unrelated code (one was
tracked down to a JS JIT change) caused them, and a bunch of other
tests, to start failing intermittently at higher frequency.  I think
detecting this sort of pattern, of a low level change causing large
numbers of tests to fail at moderate frequency, should be detected
in a way that doesn't lead to us disabling large numbers of tests
(and then probably failing to re-enable them when the problem is
resolved).

I think we could use better tools for finding regression windows for
intermittent failures.

I'd also note that I've been trying to give frequent intermittent
oranges in layout a reasonably high priority, and encourage others
to do the same.  I think we've been doing a reasonably good job at
this lately, although rates of progress have varied.

-David

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla  https://www.mozilla.org/   턂
 Before I built a wall I'd ask to know
 What I was walling in or walling out,
 And to whom I was like to give offense.
   - Robert Frost, Mending Wall (1914)


signature.asc
Description: Digital signature
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform