A few of us have been doing cleanup over the last month or so (see HBASE-14420). As a project, we had let our unit test suite go to seed. It was an anthology of mysterious crashes, zombies and flakes.
We are not done yet but tests are mostly stable again with patch builds passing close to 100% of the time as long as the patch is good and trunk and branch-1/branch-1.2 are tending back toward being blue always. Hanging tests have been fixed and or disabled to be put back after scrubbing. Mysterious surefire crashes/timeouts have been addressed by purging a problematic test set that we intend to re-add after tuneup and fix. There are still a few flakies in the mix. This is a petition that we go out of our way going forward to keep OUR test suite blue. We'll all be more productive if we can keep it this way. Patches will land faster because there'll be less friction getting them in (Landing big patches was taking me a week before starting in on this effort). We'll catch a slew of problems before commit. New devs won't be confounded by mysterious unrelated test fails. There'll be no need to keep up an arcane knowledge of 'known flakies' or hanging tests or the need for expending extra effort and resources doing 'look-it-works-locally-for-me' test runs locally. St.Ack Below are some further notes for those interested in build and work done to our test rig recently; ugly detail is over in HBASE-14420. Until an alternative shows up, our Apache Jenkins needs to run blue always if we want to do community development. True, Apache Jenkins is a trying environment in which to run tests, but it is shared, public, and I have yet to come across a hang or failure that was Apache-Jenkins-only; the only difference I've seen is that the incidence of hangs and flakies is higher on Apache. The test-patch.sh script had some hacking done to it mostly removing code that was finding and killing zombies. We were reporting ANY concurrent build as a zombie, even those that were not hbase tests, and killing them in the belief that they were leftovers from previous runs (the script had a few different techniques for finding and executing adjacent processes). This made some sense when we were supposed to be the only test running on the box but this has not been true for a long time. Killing was papering-over the fact that we were leaving zombies after us. The Jenkins build configuration also had zombie code from test-patch.sh in it (still does -- a TODO). Builds now dump out test machine load and listing of what else is running on the box at test start to give a sense of how loaded the test box is. I feel particularly bad for the new contributors. They have it hard enough already checking out a fat project with a slow build system with hours of tests to run to verify changes. Lets spare them the added barrier of a confounding experience when their nice patch throws up a mysterious jenkins fail on submit.
