On Wednesday, March 8, 2017 at 11:18:03 PM UTC+13, jma...@mozilla.com wrote: > On Tuesday, March 7, 2017 at 11:45:38 PM UTC-5, Chris Pearce wrote: > > I recommend that instead of classifying intermittents as tests which fail > > > 30 times per week, to instead classify tests that fail more than some > > threshold percent as intermittent. Otherwise on a week with lots of > > checkins, a test which isn't actually a problem could clear the threshold > > and cause unnecessary work for orange triage people and developers alike. > > > > The currently published threshold is 8%: > > > > https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy#Identifying_problematic_tests > > > > 8% seems reasonable to me. > > > > Also, whenever a test is disabled, not only should a bug be filed, but > > please _please_ need-info the test owner or at least someone on the > > affected team. > > > > If a test for a feature is disabled without the maintainer of that feature > > knowing, then we are flying blind and we are putting the quality of our > > product at risk. > > > > > > cpearce. > > > > Thanks cpearce for the concern here. Regarding disabling tests, all tests > that we have disabled as part of the stockwell project have started out with > a triage where we ni the responsible party and the bug is filed in the > component where the test is associated with. I assume if the bug is filed in > the right component others from the team will be made aware of this. Right > now I assume the triage owner of a component is the owner of the tests and > can proxy the request to the correct person on the team (many times the > original author is on PTO, busy with a project, left the team, etc.). Please > let me know if this is a false assumption and what we could do to better get > bugs in front of the right people.
In the past I have not always been made aware when my tests were disabled, which has lead to me feeling jaded. > I agree 8% is a good number, the sheriff policy has other criteria (top 20 on > orange factor, 100 times/month). Ok. Let's assume 8% is a reasonable threshold then... > We picked 30 times/week as that is where bugs start becoming frequent enough > to easily reproduce (locally or on try) I disagree. I often find that oranges I get pinged on are in fact not easy to reproduce, and it takes a few weeks of elapsed time to solve them due to them typically only reproducing on Try, and me needing to work on other high priority bugs concurrently. Which means there's a context switch overhead too as I balance everything I'm working on. > and it would be reasonable to expect a fix. I think it's unreasonable to assume that developers can drop whatever they're doing and turn around a fix in a two weeks, given how long these things often take to fix, and given that developers often have a pre-existing list of other high priority stuff to work on. > There is ambiguity when using a %, on a low volume week (as most of december > was) we see <500 pushes/week, also the % doesn't indicate the amount of times > the test was run- this is affected by SETA (reducing tests for 4/5 commits to > save on load) and by people doing retriggers/backfills. If last week the > test was 8%, and it is 7% this week- do we ignore it? > > Picking a single number like 30 times/7days removes ambiguity and ensures > that we can stay focused on things and don't have to worry about > recalculations. It is true on lower volume weeks that 30 times/7days doesn't > happen as frequently, yet we have always had many bugs to work on with that > threshold. It sounds like the problem actually is that we haven't taken the time to implement good data collection. Given your example of about 500 pushes/week 30 in December, 30 failures in 500 pushes is a 6% failure rate, well below the 8% rate the sheriffs beholden to enforce. So given that you say that an 8% threshold is reasonable, then 30 failures/week is already a too low threshold. If we saw 30 failures in a 1000 pushes/week, that would be a 3% failure rate, but by your reasoning that would be considered worthy of investigation. I don't think it's reasonable to consider a test failing 3% of the time as worthy of investigation. To me, it feels like we're setting ourselves up for creating unnecessary crises, and unnecessary tension between the stockwell people and the developers. I think: * Acceptable failure rates as expressed as an absolute number aren't meaningful; we should be expressing acceptable rates as a percentage. * Two weeks is simply an unreasonable amount of time in which to expect a fix for an intermittent. _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform