After TestSplitTransactionOnCluster timed out, I saw:
"main" prio=10 tid=0x0000000043a39000 nid=0x73f9 in Object.wait()
[0x0000000040208000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000ac99d8d8> (a
org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread)
at java.lang.Thread.join(Thread.java:1186)
- locked <0x00000000ac99d8d8> (a
org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread)
at java.lang.Thread.join(Thread.java:1239)
at
org.apache.hadoop.hbase.util.JVMClusterUtil.shutdown(JVMClusterUtil.java:224)
at
org.apache.hadoop.hbase.LocalHBaseCluster.shutdown(LocalHBaseCluster.java:423)
at
org.apache.hadoop.hbase.MiniHBaseCluster.shutdown(MiniHBaseCluster.java:417)
at
org.apache.hadoop.hbase.HBaseTestingUtility.shutdownMiniHBaseCluster(HBaseTestingUtility.java:462)
at
org.apache.hadoop.hbase.HBaseTestingUtility.shutdownMiniCluster(HBaseTestingUtility.java:438)
at
org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.after(TestSplitTransactionOnCluster.java:77)
I wonder how many such hung surefirebooterXXX processes there're on the
Jenkins build machine.
On Mon, Jun 27, 2011 at 2:11 PM, Stack <[email protected]> wrote:
> On Sun, Jun 26, 2011 at 5:57 PM, Mikhail Bautin <[email protected]> wrote:
> > I am working on porting some features to HBase trunk, and it looks like
> there are problems with unit tests in the trunk right now.
>
> Yes. We're working on getting them fixed.
>
> > What seems strange to me is that a single test timeout terminates the
> whole test suite.
>
> I'm not sure how to change that. Here's config. for surefire maven
> plugin:
> http://maven.apache.org/plugins/maven-surefire-plugin/test-mojo.html
>
> Here is comment on the test timeout flag:
>
> forkedProcessTimeoutInSeconds int 2.4 Kill the forked test
> process
> after a certain number of seconds. If set to 0, wait forever for the
> process, never timing out.
>
> I suppose if surefire steps in because test has gone on too long, then
> it fails the test run.
>
> I messed with the junit timeout annotation -- a few of our tests have
> this set -- and if its timeout elapses, tests keep going just failing
> this single test.
>
> So it seems as though when the maven surefire plugin has to step in,
> we don't keep going. I don't see a config. to change this (maybe
> someone knows?)
>
>
> > Also, apparently the continuous integration server does not handle this
> situation correctly, because the page at
> https://builds.apache.org/job/HBase-TRUNK/1989/ claims that there are "no
> failures", which is far from truth.
>
> Well, it does report the build as failed.
>
> > To summarize, I think this raises the following questions:
> >
> > * Could the changes in
> https://builds.apache.org/job/HBase-TRUNK/1987/ introduce this problem
> with unit test timeouts, and how can it be fixed?
>
> I don't think the commits that went into build 1987 responsible. I've
> seen this behavior in the past (where timeout of a single test fails
> the build but doesn't show in the test report).
>
>
> > * Should we make the unit test suite resilient to timeouts in
> individual tests? It could skip the timed-out test and continue rather than
> terminate.
>
> Not all of our tests are junit 4 but for those that are, we can use
> the timeout annotation; e.g. add it to these tests that tend to go on
> for the > the surefire configured 15 minute timeout so our tests
> continue to run. Let me do this.
>
> > * How can we fix the continuous integration server so that it does not
> report "zero failures" in this case where half of the most part of the suite
> did not even run?
> >
>
> By following the prescription above.
>
> Thanks for raising this issue Mikhail.
>
> St.Ack
>