On Wednesday, March 8, 2017 at 11:18:03 PM UTC+13, jma...@mozilla.com wrote:
> On Tuesday, March 7, 2017 at 11:45:38 PM UTC-5, Chris Pearce wrote:
> > I recommend that instead of classifying intermittents as tests which fail > 
> > 30 times per week, to instead classify tests that fail more than some 
> > threshold percent as intermittent. Otherwise on a week with lots of 
> > checkins, a test which isn't actually a problem could clear the threshold 
> > and cause unnecessary work for orange triage people and developers alike.
> > 
> > The currently published threshold is 8%:
> > 
> > https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy#Identifying_problematic_tests
> > 
> > 8% seems reasonable to me.
> > 
> > Also, whenever a test is disabled, not only should a bug be filed, but 
> > please _please_ need-info the test owner or at least someone on the 
> > affected team.
> > 
> > If a test for a feature is disabled without the maintainer of that feature 
> > knowing, then we are flying blind and we are putting the quality of our 
> > product at risk.
> > 
> > 
> > cpearce.
> >
> 
> Thanks cpearce for the concern here.  Regarding disabling tests, all tests 
> that we have disabled as part of the stockwell project have started out with 
> a triage where we ni the responsible party and the bug is filed in the 
> component where the test is associated with.  I assume if the bug is filed in 
> the right component others from the team will be made aware of this.  Right 
> now I assume the triage owner of a component is the owner of the tests and 
> can proxy the request to the correct person on the team (many times the 
> original author is on PTO, busy with a project, left the team, etc.).  Please 
> let me know if this is a false assumption and what we could do to better get 
> bugs in front of the right people.


In the past I have not always been made aware when my tests were disabled, 
which has lead to me feeling jaded.


> I agree 8% is a good number, the sheriff policy has other criteria (top 20 on 
> orange factor, 100 times/month).

Ok. Let's assume 8% is a reasonable threshold then...


> We picked 30 times/week as that is where bugs start becoming frequent enough 
> to easily reproduce (locally or on try)

I disagree.

I often find that oranges I get pinged on are in fact not easy to reproduce, 
and it takes a few weeks of elapsed time to solve them due to them typically 
only reproducing on Try, and me needing to work on other high priority bugs 
concurrently. Which means there's a context switch overhead too as I balance 
everything I'm working on.


> and it would be reasonable to expect a fix.

I think it's unreasonable to assume that developers can drop whatever they're 
doing and turn around a fix in a two weeks, given how long these things often 
take to fix, and given that developers often have a pre-existing list of other 
high priority stuff to work on.


>  There is ambiguity when using a %, on a low volume week (as most of december 
> was) we see <500 pushes/week, also the % doesn't indicate the amount of times 
> the test was run- this is affected by SETA (reducing tests for 4/5 commits to 
> save on load) and by people doing retriggers/backfills.  If last week the 
> test was 8%, and it is 7% this week- do we ignore it?
> 
> Picking a single number like 30 times/7days removes ambiguity and ensures 
> that we can stay focused on things and don't have to worry about 
> recalculations.  It is true on lower volume weeks that 30 times/7days doesn't 
> happen as frequently, yet we have always had many bugs to work on with that 
> threshold.

It sounds like the problem actually is that we haven't taken the time to 
implement good data collection.

Given your example of about 500 pushes/week 30 in December, 30 failures in 500 
pushes is a 6% failure rate, well below the 8% rate the sheriffs beholden to 
enforce.

So given that you say that an 8% threshold is reasonable, then 30 failures/week 
is already a too low threshold.

If we saw 30 failures in a 1000 pushes/week, that would be a 3% failure rate, 
but by your reasoning that would be considered worthy of investigation.

I don't think it's reasonable to consider a test failing 3% of the time as 
worthy of investigation.

To me, it feels like we're setting ourselves up for creating unnecessary 
crises, and unnecessary tension between the stockwell people and the 
developers. 

I think:
* Acceptable failure rates as expressed as an absolute number aren't 
meaningful; we should be expressing acceptable rates as a percentage.
* Two weeks is simply an unreasonable amount of time in which to expect a fix 
for an intermittent.


_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to