By way of illustration of how loaded Apache build boxes can be: Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 383), ProcessCount=142 (was 144), AvailableMemoryMB=819 (was 892), ConnectionCount=0 (was 0)
This seems to have caused a test that usually passes to fail: https://issues.apache.org/jira/browse/HBASE-9023 St.Ack On Mon, Jul 22, 2013 at 11:49 AM, Stack <[email protected]> wrote: > Below is a state of hbase 0.95/trunk unit tests (Includes a little > taxonomy of test failure type definitions). > > On Andrew's ec2 build box, 0.95 is passing most of the time: > > http://54.241.6.143/job/HBase-0.95/ > http://54.241.6.143/job/HBase-0.95-Hadoop-2/ > > It is not as good on Apache build box but it is getting better: > > https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/ > https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/ > > On Apache, I have seen loads up in the 500s and all file descriptors used > according to the little resources report printed at the end of each test. > If these numbers are to be believed (TBD), we may never achieve 100% pass > rate on Apache builds. > > Andrew's ec2 builds run the integration tests too where the apache builds > do not -- sometimes we'll fail an integration test run which makes the > Andrew ec2 red/green ratio look worse that it actually is. > > Trunk builds lag. They are being worked on. > > We seem to be over the worst of the flakey unit tests. We have a few > stragglers still but they are being hunted down by the likes of the > merciless Jimmy Xiang and Jeffrey Zhong. > > The "zombies" have been mostly nailed too (where "zombies" are tests that > refuse to die continuing after the suite has completed causing the build to > fail). The zombie trap from test-patch.sh was ported over to apache and > ec2 build and it caught the last of undying. > > We are now into a new phase where "all" tests pass but the build still > fails. Here is an example: > http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The > only clue I have to go on is the fact that when we fail, the number of > tests run is less than the total that shows for a successful run. > > Unless anyone has a better idea, to figure why the hang, I compare the > list of tests that show in a good run vs. those of a bad run. Tests that > are in the good run but missing from the bad run are deemed suspect. In > the absence of other evidence or other ideas, I am blaming these > "invisibles" for the build fail. > > Here is an example: > > This is a good 0.95 hadoop2 run (notice how we are running integration > tests tooooo and they succeed!! On hadoop2!!!!): > > http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/ > > In hbase-server module: > > Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19 > > > This is a bad run: > > http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/ > > Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18 > > > If I compare tests, the successful run has: > > > Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed > > > ... where the bad run does not show the above test. > TestHLogSplitCompressed has 34 tests one of which is disabled so that > would seem to account for the discrepancy. > > I've started to disable tests that fail likes this putting them aside for > original authors or the interested to take a look to see why they fail > occasionally. I put them aside so we can enjoy passing builds in the > meantime. I've already moved aside or disabled a few tests and test > classes: > > TestMultiTableInputFormat > TestReplicationKillSlaveRS > TestHCM.testDeleteForZKConnLeak was disabled > > ... and a few others. > > Finally (if you are still reading), I would suggest that test failures in > hadoopqa are now more worthy of investigation. Illustrative is what > happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections" > where the patch had +1s and on its first run, a unit test failed (though it > passed locally). The second run obscured the first run's failure. After > digging by another, the patch had actually broken the first test (though it > looked unrelated). I would suggest that now tests are healthier, test > failures are worth paying more attention too. > > Yours, > St.Ack > > > >
