This mail is to introduce the work to tackle the flaky tests in our build.

*Why is it important?*
- Our build history sucks, last 175 post-commit runs failed. We need to
make it useful.
- To better understand our code’s testing status, more importantly it’s
weak points.
- We know those 2-3 tests which keep failing every now and then, but not
those ~10 nasty ones which fail like 1 out of 50 times, and screw our build.
- This isn’t something that can be done manually on a daily basis. We need
automation.

*Changes made so far:*
Code changes: HBASE-15839
<https://issues.apache.org/jira/browse/HBASE-15839>  (Umbrella issue)

*Jenkins changes:*


[Diagram link:
https://issues.apache.org/jira/secure/attachment/12804292/Screen%20Shot%202016-05-16%20at%204.02.46%20PM.png
]
​
*(new job) HBase-Find-Flaky-Tests*: Gets test reports of recent builds of
post-commit job (TRUNK_matrix) and HBase-Flaky-Tests job (see below) to
find flaky tests. Frequency of run determines how fast we catch test
regressions. So if we run it every 4 hours, any test which started failing
in post-commit job (TRUNK_matrix) in last 4 hour will be blacklisted.

*(new job) HBase-Flaky-Tests*: This job runs only the flaky tests. The aim
is to run this job back-to-back to collect as many runs as we can. Higher
the run rate, the better will be our system at catching the flaky tests. We
currently run it hourly. so we’ll be able to keep track of flaky tests with
~5% failure rate or more.

*Post-commit (TRUNK_matrix) and pre-commit jobs*: Exclude these flaky tests.


*So what if a bad commit makes a good test bad?*
Since the test is not bad, it’ll run in next post-commit and will fail.
Next run of HBase-Find-Flaky-Tests will  pick it up and blacklist it.
Blacklisting will help keep the post-commit job and more importantly
pre-commit job clean, a problem we face quite often.

*Are we just tucking away are shit?*
Nope, this will help us:
- first, Maintain a list of bad test (we lack that today).
- second, make our build greener to the point that a failed/red build is
something we worry about seriously.

Once we are confident that the system is working fine, we’ll setup up
HBase-Find-Flaky-Tests job to send reports to dev@hbase so that devs know
about the bad tests. If it remains hidden somewhere in a jenkins job’s
archive, it’s unlike that we’ll actively work on getting them fixed :).

I'll keep posting further updates on this thread.

-- Appy

Reply via email to