[jira] [Commented] (HBASE-12403) IntegrationTestMTTR flaky due to aggressive RS restart timeout

Hudson (JIRA) Sat, 01 Nov 2014 13:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193436#comment-14193436
 ]


Hudson commented on HBASE-12403:
--------------------------------

FAILURE: Integrated in HBase-0.98 #647 (See 
[https://builds.apache.org/job/HBase-0.98/647/])
HBASE-12403 IntegrationTestMTTR flaky due to aggressive RS restart timeout 
(ndimiduk: rev 414bed7197097db4e2ce638f46d9996fdfb305b1)
* hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
* hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java


> IntegrationTestMTTR flaky due to aggressive RS restart timeout
> --------------------------------------------------------------
>
>                 Key: HBASE-12403
>                 URL: https://issues.apache.org/jira/browse/HBASE-12403
>             Project: HBase
>          Issue Type: Test
>          Components: integration tests
>            Reporter: Nick Dimiduk
>            Assignee: Nick Dimiduk
>            Priority: Minor
>             Fix For: 2.0.0, 0.98.8, 0.99.2
>
>         Attachments: HBASE-12403.00.patch
>
>
> TL;DR: the CM RestartRS action timeout is only 60 seconds. Considering the RS 
> must connect to the Master before it can be online, this is not long enough 
> time in an environment where the Master can also be killed.
> Failure from the console says the test failed because a 
> RestartRsHoldingMetaAction timed out.
> {noformat}
> Caused by: java.io.IOException: did timeout waiting for region server to 
> start:ip-172-31-42-248.ec2.internal
> at 
> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:153)
> at org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:93)
> at 
> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.restartRs(RestartActionBaseAction.java:52)
> at 
> org.apache.hadoop.hbase.chaos.actions.RestartRsHoldingMetaAction.perform(RestartRsHoldingMetaAction.java:38)
> at 
> org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:559)
> at 
> org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:550)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This is only reported at the end of the test run. There's no indication as to 
> when during the test run this failure happened. The timeout on the start RS 
> operation is 60 seconds.
> Hacking out the start/stop messages from the logs during the time window when 
> this test ran, it appears that at one point the RS took 2min 12s between when 
> it was launched and when it reported for duty
> {noformat}
> Fri Oct 31 14:53:17 UTC 2014 Starting regionserver on ip-172-31-42-248
> 2014-10-31 14:55:29,049 INFO  [regionserver60020] regionserver.HRegionServer: 
> Serving as ip-172-31-42-248.ec2.internal,60020,1414767238992, RpcServer on 
> ip-172-31-42-248.ec2.internal/172.31.42.248:60020, sessionid=0x249661c2b7b0118
> {noformat}
> The RS came up without incident. It spent 1min 4s of that time waiting on the 
> master to start, attempted to report for duty from 14:54:28 to 14:55:24.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12403) IntegrationTestMTTR flaky due to aggressive RS restart timeout

Reply via email to