On our unit tests...

Stack Fri, 23 Oct 2015 14:55:47 -0700

A few of us have been doing cleanup over the last month or so (see
HBASE-14420). As a project, we had let our unit test suite go to seed. It
was an anthology of mysterious crashes, zombies and flakes.


We are not done yet but tests are mostly stable again with patch builds
passing close to 100% of the time as long as the patch is good and trunk
and branch-1/branch-1.2 are tending back toward being blue always. Hanging
tests have been fixed and or disabled to be put back after scrubbing.
Mysterious surefire crashes/timeouts have been addressed by purging a
problematic test set that we intend to re-add after tuneup and fix. There
are still a few flakies in the mix.

This is a petition that we go out of our way going forward to keep OUR test
suite blue. We'll all be more productive if we can keep it this way.
Patches will land faster because there'll be less friction getting them in
(Landing big patches was taking me a week before starting in on this
effort). We'll catch a slew of problems before commit. New devs won't be
confounded by mysterious unrelated test fails. There'll be no need to keep
up an arcane knowledge of 'known flakies' or hanging tests or the need for
expending extra effort and resources doing 'look-it-works-locally-for-me'
test runs locally.

St.Ack

Below are some further notes for those interested in build and work done to
our test rig recently; ugly detail is over in HBASE-14420.

Until an alternative shows up, our Apache Jenkins needs to run blue always
if we want to do community development. True, Apache Jenkins is a trying
environment in which to run tests, but it is shared, public, and I have yet
to come across a hang or failure that was Apache-Jenkins-only; the only
difference I've seen is that the incidence of hangs and flakies is higher
on Apache.

The test-patch.sh script had some hacking done to it mostly removing code
that was finding and killing zombies. We were reporting ANY concurrent
build as a zombie, even those that were not hbase tests, and killing them
in the belief that they were leftovers from previous runs (the script had a
few different techniques for finding and executing adjacent processes).
This made some sense when we were supposed to be the only test running on
the box but this has not been true for a long time. Killing was
papering-over the fact that we were leaving zombies after us.

The Jenkins build configuration also had zombie code from test-patch.sh in
it (still does -- a TODO). Builds now dump out test machine load and
listing of what else is running on the box at test start to give a sense of
how loaded the test box is.

I feel particularly bad for the new contributors. They have it hard enough
already checking out a fat project with a slow build system with hours of
tests to run to verify changes. Lets spare them the added barrier of a
confounding experience when their nice patch throws up a mysterious jenkins
fail on submit.

On our unit tests...

Reply via email to