[ https://issues.apache.org/jira/browse/HBASE-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565682#comment-16565682 ]
stack commented on HBASE-20992: ------------------------------- For now upped hbase.client.retries.number from its default of 15 to 25. Would be better if all recovered inside two minutes. > MTTR, Chaos, and ITBLL > ---------------------- > > Key: HBASE-20992 > URL: https://issues.apache.org/jira/browse/HBASE-20992 > Project: HBase > Issue Type: Sub-task > Components: integration tests, MTTR > Reporter: stack > Priority: Major > > I've been having trouble getting a sustained, large ITBLL run to complete > over the last few days. I'm seeing a bunch of the below: > * A region splits or is moved > * Chaos kills the Master in the middle of the Split or Move Procedure after > a Region has been offlined > * Master takes a while to come back whether because it is not started until > a couple of minutes have passed and then there is some recovery to be done. > So a region can be offline for minutes. Default we retry up to 16 times which > ends up at about 2.5 minutes before we give up. > So, I can up the retries when running larger tests but also, the region > should come back online faster. > Let me hang ITBLL fixes/notes off here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)