[ 
https://issues.apache.org/jira/browse/HBASE-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13619492#comment-13619492
 ] 

ramkrishna.s.vasudevan commented on HBASE-8127:
-----------------------------------------------

[~rajesh23]
I think you and Jeffrey had thought of different approaches and finally decided 
on patch_2.
I tried removing the code in AM from your patch and all the testcases including 
TestMasterFailOver is passing.
So may be handling for the Disabled and Disabling during master initialization 
when in process RIT may not be needed i fee.
The SSH code will any way be triggered for the RS that went down during master 
start up(for the RS that went down during start up).
Now there we try to handle the case based on the ZKTable state.  That should be 
fine.  And infact SSH in normal cases also will now handle for 
DISABLED/DISABLING tables.
What you feel?
bq.Let us take an example. There are 250 regions in RIT need to process during 
initialization and 250 regions of DISABLED table are opened before a server 
went down. In that case SSH will take same time as master initialization 
becuase in both the cases we need to read from zk.
I can see that you have given an example to substantiate your point.  But i do 
feel like having same code and doing same job will make our code less readable 
and later any bug over here we need to address in 2 places.
                
> Region of a disabling or disabled table could be stuck in transition state 
> when RS dies during Master initialization
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8127
>                 URL: https://issues.apache.org/jira/browse/HBASE-8127
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.5
>            Reporter: Jeffrey Zhong
>            Assignee: rajeshbabu
>             Fix For: 0.94.7
>
>         Attachments: HBASE-8127_94_2.patch, HBASE-8127_94_3.patch, 
> HBASE-8127_feedback.patch, HBASE-8127.patch, hbase-8127_v1.patch, 
> reproduce-hang.patch
>
>
> The issue happens when a RS dies during a master starts up. After the RS 
> reports open to the new master instance and dies immediately thereafter, the 
> RITs of disabling tables(or disabled table) on the died RS will be in RIT 
> state forever.
> I attached a patch to simulate the situation and you can run the following 
> command to reproduce the issue:
> {code}mvn test -PlocalTests 
> -Dtest=TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS{code}
> Basically, we skip regions of a dead server inside 
> AM.processDeadServersAndRecoverLostRegions as the following code and relies 
> on SSH to process those skipped regions:
> {code}
>           for (Pair<HRegionInfo, Result> deadRegion : deadServer.getValue()) {
>             nodes.remove(deadRegion.getFirst().getEncodedName());
>           }
> {code} 
> While in SSH, we skip regions of disabling(or disabled table) again by 
> function processDeadRegion. Finally comes to the issue that RITs of 
> disabling(or disabled table) stuck there forever.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to