[ 
https://issues.apache.org/jira/browse/HBASE-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Dimiduk updated HBASE-12403:
---------------------------------
    Attachment: HBASE-12403.00.patch

Patch exposes timeouts from Action methods as configuration and 
IntegrationTestMTTR increases the timeout for startRs from 1 minute to 3 
minutes.

> IntegrationTestMTTR flaky due to aggressive RS restart timeout
> --------------------------------------------------------------
>
>                 Key: HBASE-12403
>                 URL: https://issues.apache.org/jira/browse/HBASE-12403
>             Project: HBase
>          Issue Type: Test
>          Components: integration tests
>            Reporter: Nick Dimiduk
>            Priority: Minor
>         Attachments: HBASE-12403.00.patch
>
>
> TL;DR: the CM RestartRS action timeout is only 60 seconds. Considering the RS 
> must connect to the Master before it can be online, this is not long enough 
> time in an environment where the Master can also be killed.
> Failure from the console says the test failed because a 
> RestartRsHoldingMetaAction timed out.
> {noformat}
> Caused by: java.io.IOException: did timeout waiting for region server to 
> start:ip-172-31-42-248.ec2.internal
> at 
> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:153)
> at org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:93)
> at 
> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.restartRs(RestartActionBaseAction.java:52)
> at 
> org.apache.hadoop.hbase.chaos.actions.RestartRsHoldingMetaAction.perform(RestartRsHoldingMetaAction.java:38)
> at 
> org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:559)
> at 
> org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:550)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This is only reported at the end of the test run. There's no indication as to 
> when during the test run this failure happened. The timeout on the start RS 
> operation is 60 seconds.
> Hacking out the start/stop messages from the logs during the time window when 
> this test ran, it appears that at one point the RS took 2min 12s between when 
> it was launched and when it reported for duty
> {noformat}
> Fri Oct 31 14:53:17 UTC 2014 Starting regionserver on ip-172-31-42-248
> 2014-10-31 14:55:29,049 INFO  [regionserver60020] regionserver.HRegionServer: 
> Serving as ip-172-31-42-248.ec2.internal,60020,1414767238992, RpcServer on 
> ip-172-31-42-248.ec2.internal/172.31.42.248:60020, sessionid=0x249661c2b7b0118
> {noformat}
> The RS came up without incident. It spent 1min 4s of that time waiting on the 
> master to start, attempted to report for duty from 14:54:28 to 14:55:24.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to