[ 
https://issues.apache.org/jira/browse/HBASE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546185#comment-13546185
 ] 

nkeywal commented on HBASE-7407:
--------------------------------

bq. We have a DoNotRetryIOException. Does this mean all other exceptions are 
retriable? 
Unfortunately no. For example, ServerNotRunningYetException is retriable but 
RegionServerStoppedException is not. Still they both extends IOException.

bq. Do we need a PleaseRetryException?
If we want to distinguish the different cases between:
- retriable
- not retriable
- don't know 

then yes :-). And the don't know can means 'not coded' or 'I really don't know'.

bq. Why do we need another lock? The caller of processRegionsInTransition 
should already have the lock, right?
Yes you're right. I will fix this.

bq. One enhancement to the original logic we can do, is that we can time out 
those region transitions earlier so that timeout monitor can reassign them 
earlier, if needed.
I'm not a big fan, it's adding an extra case. Resending seems much better. 
What's the issue you're seeing?




                
> TestMasterFailover under tests some cases and over tests some others
> --------------------------------------------------------------------
>
>                 Key: HBASE-7407
>                 URL: https://issues.apache.org/jira/browse/HBASE-7407
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment, test
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 7407.v1.patch, 7407.v2.patch, 7407.v3.patch
>
>
> The tests are done with this settings:
>     conf.setInt("hbase.master.assignment.timeoutmonitor.period", 2000);
>     conf.setInt("hbase.master.assignment.timeoutmonitor.timeout", 4000);
> As a results:
> 1) some tests seems to work, but in real life, the recovery would take 5 
> minutes or more, as in production there always higher. So we don't see the 
> real issues.
> 2) The tests include specific cases that should not happen in production. It 
> works because the timeout catches everything, but these scenarios do not need 
> to be optimized, as they cannot happen. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to