Slightly related, sorry for hijacking: I can't get HBase trunk to build. In particular TestHCM.testClusterStatus always fails for me. I tried on my own Jenkins as well as my IDE (IntelliJ) with the same result (two different machines, CentOS & Mac OS).
mvn -U -PrunAllTests -Dmaven.test.redirectTestOutputToFile=true -Dit.test=noItTest clean install <http://pastebin.com/upFjq09A> >From my MacBook's command line I got the test to pass using the same command but not in Jenkins or from IntelliJ. I'm happy to post in a new thread if this is distracting and no one else has seen this before. Any ideas? Thanks, Lars On Tue, Jul 23, 2013 at 7:01 AM, Stack <[email protected]> wrote: > nvm. I read the resourcechecker code. It is just printing out before and > afters so my speculation that we are up against fd limits is just off. > > Back to figuring out why tests fail at random.... > > St.Ack > > > On Mon, Jul 22, 2013 at 9:50 PM, Stack <[email protected]> wrote: > >> Here is another from tail of >> https://issues.apache.org/jira/browse/HBASE-5995 >> >> 2013-07-23 01:23:29,574 INFO [pool-1-thread-1] >> hbase.ResourceChecker(171): after: >> regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart Thread=39 (was >> 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) - OpenFileDescriptor >> LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was >> 368), ProcessCount=144 (was 142) - ProcessCount LEAK? -, >> AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0) >> >> This one showed up as a zombie too; stuck. >> >> Or here, https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/, >> where we'd had a nice run of passing tests, of a sudden a test that I've >> not seen fail before, fails: >> >> https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/4282/ >> >> >> org.apache.hadoop.hbase.master.TestActiveMasterManager.testActiveMasterManagerFromZK >> >> Near the end of the test, the resource checker reports: >> * >> * >> >> - Thread LEAK? -, OpenFileDescriptor=100 (was 92) - OpenFileDescriptor >> LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=328 (was >> 331), ProcessCount=138 (was 138), AvailableMemoryMB=1223 (was 1246), >> ConnectionCount=0 (was 0) >> >> >> >> Getting tests to pass on these build boxes (other than hadoopqa which is a >> different set of machines) seems unattainable. >> >> I will write infra about the 40k to see if they can do something about >> that. >> >> St.Ack >> >> >> >> >> On Mon, Jul 22, 2013 at 9:13 PM, Stack <[email protected]> wrote: >> >>> By way of illustration of how loaded Apache build boxes can be: >>> >>> Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor LEAK? >>> -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 383), >>> ProcessCount=142 (was 144), AvailableMemoryMB=819 (was 892), >>> ConnectionCount=0 (was 0) >>> >>> This seems to have caused a test that usually passes to fail: >>> https://issues.apache.org/jira/browse/HBASE-9023 >>> >>> St.Ack >>> >>> >>> On Mon, Jul 22, 2013 at 11:49 AM, Stack <[email protected]> wrote: >>> >>>> Below is a state of hbase 0.95/trunk unit tests (Includes a little >>>> taxonomy of test failure type definitions). >>>> >>>> On Andrew's ec2 build box, 0.95 is passing most of the time: >>>> >>>> http://54.241.6.143/job/HBase-0.95/ >>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/ >>>> >>>> It is not as good on Apache build box but it is getting better: >>>> >>>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/ >>>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/ >>>> >>>> On Apache, I have seen loads up in the 500s and all file descriptors >>>> used according to the little resources report printed at the end of each >>>> test. If these numbers are to be believed (TBD), we may never achieve 100% >>>> pass rate on Apache builds. >>>> >>>> Andrew's ec2 builds run the integration tests too where the apache >>>> builds do not -- sometimes we'll fail an integration test run which makes >>>> the Andrew ec2 red/green ratio look worse that it actually is. >>>> >>>> Trunk builds lag. They are being worked on. >>>> >>>> We seem to be over the worst of the flakey unit tests. We have a few >>>> stragglers still but they are being hunted down by the likes of the >>>> merciless Jimmy Xiang and Jeffrey Zhong. >>>> >>>> The "zombies" have been mostly nailed too (where "zombies" are tests >>>> that refuse to die continuing after the suite has completed causing the >>>> build to fail). The zombie trap from test-patch.sh was ported over to >>>> apache and ec2 build and it caught the last of undying. >>>> >>>> We are now into a new phase where "all" tests pass but the build still >>>> fails. Here is an example: >>>> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The >>>> only clue I have to go on is the fact that when we fail, the number of >>>> tests run is less than the total that shows for a successful run. >>>> >>>> Unless anyone has a better idea, to figure why the hang, I compare the >>>> list of tests that show in a good run vs. those of a bad run. Tests that >>>> are in the good run but missing from the bad run are deemed suspect. In >>>> the absence of other evidence or other ideas, I am blaming these >>>> "invisibles" for the build fail. >>>> >>>> Here is an example: >>>> >>>> This is a good 0.95 hadoop2 run (notice how we are running integration >>>> tests tooooo and they succeed!! On hadoop2!!!!): >>>> >>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/ >>>> >>>> In hbase-server module: >>>> >>>> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19 >>>> >>>> >>>> This is a bad run: >>>> >>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/ >>>> >>>> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18 >>>> >>>> >>>> If I compare tests, the successful run has: >>>> >>>> > Running >>>> org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed >>>> >>>> >>>> ... where the bad run does not show the above test. >>>> TestHLogSplitCompressed has 34 tests one of which is disabled so that >>>> would seem to account for the discrepancy. >>>> >>>> I've started to disable tests that fail likes this putting them aside >>>> for original authors or the interested to take a look to see why they fail >>>> occasionally. I put them aside so we can enjoy passing builds in the >>>> meantime. I've already moved aside or disabled a few tests and test >>>> classes: >>>> >>>> TestMultiTableInputFormat >>>> TestReplicationKillSlaveRS >>>> TestHCM.testDeleteForZKConnLeak was disabled >>>> >>>> ... and a few others. >>>> >>>> Finally (if you are still reading), I would suggest that test failures >>>> in hadoopqa are now more worthy of investigation. Illustrative is what >>>> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections" >>>> where the patch had +1s and on its first run, a unit test failed (though it >>>> passed locally). The second run obscured the first run's failure. After >>>> digging by another, the patch had actually broken the first test (though it >>>> looked unrelated). I would suggest that now tests are healthier, test >>>> failures are worth paying more attention too. >>>> >>>> Yours, >>>> St.Ack >>>> >>>> >>>> >>>> >>> >>
