[ 
https://issues.apache.org/jira/browse/HBASE-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283349#comment-13283349
 ] 

ramkrishna.s.vasudevan commented on HBASE-5916:
-----------------------------------------------

@Chunhui
I like your idea too.  As i said we are planning to raise an improvement 
activity for master restart and SSH.
Because even with the above approach i will tell one more scenario which is 
problematic.  Pls note that the scenario can come even without your suggestion 
also.

Two region servers are there.  Both went down when the flow is in 
AM.joinCluster(). Now as no RS is there at that time we will not make any 
assignment. And all will go into RIT mode waiting for timeout monitor. Now SSH 
is also waiting as the master initialization is not complete(this step is as 
per your suggestion).  Now suppose there are 100 regions all are waiting for 
getting assigned.
Now if a new RS comes up as there is a code in TimeoutMonitor
{code}
 if (regionState.getStamp() + timeout <= now) {
           //decide on action upon timeout
            actOnTimeOut(regionState);
          } else if (this.allRegionServersOffline && !allRSsOffline) {
            // if some RSs just came back online, we can start the
            // the assignment right away
            actOnTimeOut(regionState);
          }
{code}
It will immediately trigger assignment.  At the same time as master 
initialization has already been done and so we are able to carry on assignment 
with SSH also.  This will lead to double assignment.  Actually in defect 
HBASe-5816 Stack was suggesting to have one common queue where any assignment 
will be done so that SSH will not interfere with that or viceversa.  
I suggest we can get in the patch that addresses the current JIRa problem and 
work on a diff JIRA that will help me to address the master restart and SSH 
area which is troublesome.

                
> RS restart just before master intialization we make the cluster non operative
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-5916
>                 URL: https://issues.apache.org/jira/browse/HBASE-5916
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>             Fix For: 0.94.1
>
>         Attachments: HBASE-5916_trunk.patch, HBASE-5916_trunk_1.patch, 
> HBASE-5916_trunk_1.patch, HBASE-5916_trunk_2.patch, HBASE-5916_trunk_3.patch, 
> HBASE-5916_trunk_4.patch, HBASE-5916_trunk_v5.patch, 
> HBASE-5916_trunk_v6.patch, HBASE-5916_trunk_v7.patch, HBASE-5916v8.patch
>
>
> Consider a case where my master is getting restarted.  RS that was alive when 
> the master restart started, gets restarted before the master initializes the 
> ServerShutDownHandler.
> {code}
> serverShutdownHandlerEnabled = true;
> {code}
> In this case when the RS tries to register with the master, the master will 
> try to expire the server but the server cannot be expired as still the 
> serverShutdownHandler is not enabled.
> This case may happen when i have only one RS gets restarted or all the RS 
> gets restarted at the same time.(before assignRootandMeta).
> {code}
> LOG.info(message);
>       if (existingServer.getStartcode() < serverName.getStartcode()) {
>         LOG.info("Triggering server recovery; existingServer " +
>           existingServer + " looks stale, new server:" + serverName);
>         expireServer(existingServer);
>       }
> {code}
> If another RS is brought up then the cluster comes back to normalcy.
> May be a very corner case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to