[ 
https://issues.apache.org/jira/browse/HBASE-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282580#comment-13282580
 ] 

ramkrishna.s.vasudevan commented on HBASE-5916:
-----------------------------------------------

@Chunhui
The suggestion given above can simply be avoided by taking a the actual online 
servers list after getting the logFolders.  This will ensure that we donot 
split any new RS that has checked in.

In joinCluster(), as per the existing code if any new server has checked in and 
the root/meta had got assigned to it in joincluster we may think that it is an 
dead server because we alerady have passed the online servers.  Hence we are 
trying to get the actual online list as per the patch.

The problem that you have mentioned here
bq.if Regionserver A with startcode 001 is restarted, and then Regionserver A 
with startcode 002 is in the onlineServers, but Regionserver A with startcode 
001 is in the process by SSH, not in the deadServers

This we are trying to avoid in our current v6 patch, by not remvoing from dead 
servers, any restarted server that is coming up during master initialization. 
Later after master initialization we try to clear the dead server which matches 
with the current online servers with same host name and port.

There are other problems during SSH and master initialization that may lead to 
double assignment or concurrent modification exception.  These things we will 
address in a new JIRA.
Pls review the current patch and provide your suggestions.
                
> RS restart just before master intialization we make the cluster non operative
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-5916
>                 URL: https://issues.apache.org/jira/browse/HBASE-5916
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>             Fix For: 0.94.1
>
>         Attachments: HBASE-5916_trunk.patch, HBASE-5916_trunk_1.patch, 
> HBASE-5916_trunk_1.patch, HBASE-5916_trunk_2.patch, HBASE-5916_trunk_3.patch, 
> HBASE-5916_trunk_4.patch, HBASE-5916_trunk_v5.patch
>
>
> Consider a case where my master is getting restarted.  RS that was alive when 
> the master restart started, gets restarted before the master initializes the 
> ServerShutDownHandler.
> {code}
> serverShutdownHandlerEnabled = true;
> {code}
> In this case when the RS tries to register with the master, the master will 
> try to expire the server but the server cannot be expired as still the 
> serverShutdownHandler is not enabled.
> This case may happen when i have only one RS gets restarted or all the RS 
> gets restarted at the same time.(before assignRootandMeta).
> {code}
> LOG.info(message);
>       if (existingServer.getStartcode() < serverName.getStartcode()) {
>         LOG.info("Triggering server recovery; existingServer " +
>           existingServer + " looks stale, new server:" + serverName);
>         expireServer(existingServer);
>       }
> {code}
> If another RS is brought up then the cluster comes back to normalcy.
> May be a very corner case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to