On Thu, Nov 5, 2015 at 8:07 AM, Andrew Purtell <andrew.purt...@gmail.com> wrote:
> > Hanging tests have been fixed and or disabled to be put back after > scrubbing. > > What do you think about an interim step that adds a flakey test category > and a profile that disables them only on builds.a.o., i.e. the Jenkins job > configuration turns them off. Is that possible? I'd like to continue > running these on my build rigs since they are better endowed than build.a.o > resources. Or at least a profile that can turn them on? > > We could do such a thing. Probably better than the current hackery where the test is just disabled with JIRAs to fix ...sometime. > > This is a petition that we go out of our way going forward to keep OUR > test suite blue. > > Big +1 here > > Yeah. Its got to be a group thing. > BTW it turns out after seeing the results of your effort that most of my > issues with builds.a.o were probably due to the broken zombie killing > thing. That's why locally run stuff (also under Jenkins sometimes btw) was > just so much more stable. Can we have review and SCM of our build > configurations somehow going forward? > > Makes sense (and still work to do on zombie detector). Let me work on it. St.Ack > > > > > On Oct 23, 2015, at 2:54 PM, Stack <st...@duboce.net> wrote: > > > > A few of us have been doing cleanup over the last month or so (see > > HBASE-14420). As a project, we had let our unit test suite go to seed. It > > was an anthology of mysterious crashes, zombies and flakes. > > > > We are not done yet but tests are mostly stable again with patch builds > > passing close to 100% of the time as long as the patch is good and trunk > > and branch-1/branch-1.2 are tending back toward being blue always. > Hanging > > tests have been fixed and or disabled to be put back after scrubbing. > > Mysterious surefire crashes/timeouts have been addressed by purging a > > problematic test set that we intend to re-add after tuneup and fix. There > > are still a few flakies in the mix. > > > > This is a petition that we go out of our way going forward to keep OUR > test > > suite blue. We'll all be more productive if we can keep it this way. > > Patches will land faster because there'll be less friction getting them > in > > (Landing big patches was taking me a week before starting in on this > > effort). We'll catch a slew of problems before commit. New devs won't be > > confounded by mysterious unrelated test fails. There'll be no need to > keep > > up an arcane knowledge of 'known flakies' or hanging tests or the need > for > > expending extra effort and resources doing 'look-it-works-locally-for-me' > > test runs locally. > > > > St.Ack > > > > Below are some further notes for those interested in build and work done > to > > our test rig recently; ugly detail is over in HBASE-14420. > > > > Until an alternative shows up, our Apache Jenkins needs to run blue > always > > if we want to do community development. True, Apache Jenkins is a trying > > environment in which to run tests, but it is shared, public, and I have > yet > > to come across a hang or failure that was Apache-Jenkins-only; the only > > difference I've seen is that the incidence of hangs and flakies is higher > > on Apache. > > > > The test-patch.sh script had some hacking done to it mostly removing code > > that was finding and killing zombies. We were reporting ANY concurrent > > build as a zombie, even those that were not hbase tests, and killing them > > in the belief that they were leftovers from previous runs (the script > had a > > few different techniques for finding and executing adjacent processes). > > This made some sense when we were supposed to be the only test running on > > the box but this has not been true for a long time. Killing was > > papering-over the fact that we were leaving zombies after us. > > > > The Jenkins build configuration also had zombie code from test-patch.sh > in > > it (still does -- a TODO). Builds now dump out test machine load and > > listing of what else is running on the box at test start to give a sense > of > > how loaded the test box is. > > > > I feel particularly bad for the new contributors. They have it hard > enough > > already checking out a fat project with a slow build system with hours of > > tests to run to verify changes. Lets spare them the added barrier of a > > confounding experience when their nice patch throws up a mysterious > jenkins > > fail on submit. >