[ 
https://issues.apache.org/jira/browse/HBASE-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565682#comment-16565682
 ] 

stack commented on HBASE-20992:
-------------------------------

For now upped hbase.client.retries.number from its default of 15 to 25. Would 
be better if all recovered inside two minutes.

> MTTR, Chaos, and ITBLL
> ----------------------
>
>                 Key: HBASE-20992
>                 URL: https://issues.apache.org/jira/browse/HBASE-20992
>             Project: HBase
>          Issue Type: Sub-task
>          Components: integration tests, MTTR
>            Reporter: stack
>            Priority: Major
>
> I've been having trouble getting a sustained, large ITBLL run to complete 
> over the last few days. I'm seeing a bunch of the below:
>  * A region splits or is moved
>  * Chaos kills the Master in the middle of the Split or Move Procedure after 
> a Region has been offlined
>  * Master takes a while to come back whether because it is not started until 
> a couple of minutes have passed and then there is some recovery to be done.
> So a region can be offline for minutes. Default we retry up to 16 times which 
> ends up at about 2.5 minutes before we give up.
> So, I can up the retries when running larger tests but also, the region 
> should come back online faster. 
> Let me hang ITBLL fixes/notes off here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to