[ https://issues.apache.org/jira/browse/HBASE-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Dimiduk updated HBASE-12403: --------------------------------- Attachment: HBASE-12403.00.patch Patch exposes timeouts from Action methods as configuration and IntegrationTestMTTR increases the timeout for startRs from 1 minute to 3 minutes. > IntegrationTestMTTR flaky due to aggressive RS restart timeout > -------------------------------------------------------------- > > Key: HBASE-12403 > URL: https://issues.apache.org/jira/browse/HBASE-12403 > Project: HBase > Issue Type: Test > Components: integration tests > Reporter: Nick Dimiduk > Priority: Minor > Attachments: HBASE-12403.00.patch > > > TL;DR: the CM RestartRS action timeout is only 60 seconds. Considering the RS > must connect to the Master before it can be online, this is not long enough > time in an environment where the Master can also be killed. > Failure from the console says the test failed because a > RestartRsHoldingMetaAction timed out. > {noformat} > Caused by: java.io.IOException: did timeout waiting for region server to > start:ip-172-31-42-248.ec2.internal > at > org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:153) > at org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:93) > at > org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.restartRs(RestartActionBaseAction.java:52) > at > org.apache.hadoop.hbase.chaos.actions.RestartRsHoldingMetaAction.perform(RestartRsHoldingMetaAction.java:38) > at > org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:559) > at > org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:550) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This is only reported at the end of the test run. There's no indication as to > when during the test run this failure happened. The timeout on the start RS > operation is 60 seconds. > Hacking out the start/stop messages from the logs during the time window when > this test ran, it appears that at one point the RS took 2min 12s between when > it was launched and when it reported for duty > {noformat} > Fri Oct 31 14:53:17 UTC 2014 Starting regionserver on ip-172-31-42-248 > 2014-10-31 14:55:29,049 INFO [regionserver60020] regionserver.HRegionServer: > Serving as ip-172-31-42-248.ec2.internal,60020,1414767238992, RpcServer on > ip-172-31-42-248.ec2.internal/172.31.42.248:60020, sessionid=0x249661c2b7b0118 > {noformat} > The RS came up without incident. It spent 1min 4s of that time waiting on the > master to start, attempted to report for duty from 14:54:28 to 14:55:24. -- This message was sent by Atlassian JIRA (v6.3.4#6332)