[jira] [Commented] (HBASE-5916) RS restart just before master intialization we make the cluster non operative

chunhui shen (JIRA) Tue, 22 May 2012 19:06:43 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281357#comment-13281357
 ]


chunhui shen commented on HBASE-5916:
-------------------------------------

I have a doubt about patchV5
{code}
+    Set<ServerName> actualDeadServers = this.serverManager.getDeadServers();
     for (Map.Entry<ServerName, List<Pair<HRegionInfo, Result>>> deadServer: 
deadServers.entrySet()) {
+      // skip regions of dead servers because SSH will process regions during 
rs expiration. 
+      // see HBASE-5916
+      if(actualDeadServers.contains(deadServer.getKey())){
+        continue;
+      }
{code}

Let's see the ServerManager#getDeadServers()
{code}
public Set<ServerName> getDeadServers() {
    return this.deadservers.clone();
  }
public synchronized Set<ServerName> clone() {
    Set<ServerName> clone = new HashSet<ServerName>(this.deadServers.size());
    clone.addAll(this.deadServers);
    return clone;
  }
public boolean cleanPreviousInstance(final ServerName newServerName) {
    ServerName sn =
      ServerName.findServerWithSameHostnamePort(this.deadServers, 
newServerName);
    if (sn == null) return false;
    return this.deadServers.remove(sn);
  }
{code}

if Regionserver A with startcode 001 is restarted, and then Regionserver A with 
startcode 002 is in the onlineServers, but Regionserver A with startcode 001 is 
in the process by SSH, not in the deadServers

So we will multi assign regions carried by the Regionserver A with startcode 
001.

BTW, to fix this issue, why doing the following is not enough?{code}
-      throw new PleaseHoldException(message);
+      if (services.isServerShutdownHandlerEnabled()) {
+        // master has completed the initialization
+        throw new PleaseHoldException(message);
+      }{code}

Correct me if wrong, Thanks



                
> RS restart just before master intialization we make the cluster non operative
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-5916
>                 URL: https://issues.apache.org/jira/browse/HBASE-5916
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1, 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>             Fix For: 0.94.1
>
>         Attachments: HBASE-5916_trunk.patch, HBASE-5916_trunk_1.patch, 
> HBASE-5916_trunk_1.patch, HBASE-5916_trunk_2.patch, HBASE-5916_trunk_3.patch, 
> HBASE-5916_trunk_4.patch, HBASE-5916_trunk_v5.patch
>
>
> Consider a case where my master is getting restarted.  RS that was alive when 
> the master restart started, gets restarted before the master initializes the 
> ServerShutDownHandler.
> {code}
> serverShutdownHandlerEnabled = true;
> {code}
> In this case when the RS tries to register with the master, the master will 
> try to expire the server but the server cannot be expired as still the 
> serverShutdownHandler is not enabled.
> This case may happen when i have only one RS gets restarted or all the RS 
> gets restarted at the same time.(before assignRootandMeta).
> {code}
> LOG.info(message);
>       if (existingServer.getStartcode() < serverName.getStartcode()) {
>         LOG.info("Triggering server recovery; existingServer " +
>           existingServer + " looks stale, new server:" + serverName);
>         expireServer(existingServer);
>       }
> {code}
> If another RS is brought up then the cluster comes back to normalcy.
> May be a very corner case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5916) RS restart just before master intialization we make the cluster non operative

Reply via email to