[ 
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418863#comment-13418863
 ] 

Gregory Chanan commented on HBASE-5843:
---------------------------------------

Looks great so far, nkeywal.

Some questions:

{quote}
2) Kill -9 of a RS; wait for all regions to become online again:
0.92: 980s
0.96: ~13s
=> The 180s gap comes from HBASE-5844. For master, HBASE-5926 is not tested but 
should bring similar results.
{quote}

I'm confused as to what the 180s gap refers to.  I see 980 (test 2) - 800 
(test1) = 180, but that is against 0.92, which doesn't have HBASE-5970, right?  
Could you clarify?

{quote}
3) Start of the cluster after a clean stop; wait for all regions to
become online.
0.92: ~1020s
0.94: ~1023s (tested once only)
0.96: ~31s
=> The benefit is visible at startup
=> This does not come from something implemented for 0.94
{quote}

Awesome.. We think this is also due to HBASE-5970 and HBASE-6109? (since I 
assume HBASE-5844 and HBASE-5926 do not apply in this case).

{quote}
7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long 
it takes to have all the regions available.
0.92) 180s detection time+ then hangs twice out of 2 tests.
0.96) 14s (hangs once out of 3)
=> There's a bug 
{quote}
Has a JIRA been filed?

{quote}
Test to be changed to get a real difference when we need to replay the wal.
{quote}
Could you clarify what you mean here?

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: 
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a 
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to