[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593935#comment-14593935
 ] 

Andrew Purtell commented on HBASE-13937:
----------------------------------------

bq. We will not treat exceptions coming from server ping differently, but 
instead will keep retrying to ping.
Reply

Looking at the V2 patch. So we check once if the server is in the dead list and 
then proceed to ping. This patch hoists out this check:
{code}
+      synchronized (this.onlineServers) {
+        if (this.deadservers.isDeadServer(server)) {
+          return false;
+        }
+      }
{code}
that HBASE-13172 put into the ping loop. We retain this change from HBASE-13172:
{code}
@@ -851,13 +858,21 @@ public class ServerManager {
           return info != null && info.hasServerName()
             && server.getStartcode() == info.getServerName().getStartCode();
         }
+      } catch (RegionServerStoppedException | ServerNotRunningYetException e) {
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("Couldn't reach " + server, e);
+        }
+        break;
       } catch (IOException ioe) {
-        LOG.debug("Couldn't reach " + server + ", try=" + 
retryCounter.getAttemptTimes()
-          + " of " + retryCounter.getMaxAttempts(), ioe);
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("Couldn't reach " + server + ", try=" + 
retryCounter.getAttemptTimes() + " of "
+              + retryCounter.getMaxAttempts(), ioe);
+        }
         try {
           retryCounter.sleepUntilNextRetry();
         } catch(InterruptedException ie) {
           Thread.currentThread().interrupt();
+          break;
         }
       }
     }
{code}
that breaks out of the ping loop if we catch RegionServerStoppedException or 
ServerNotRunningYetException.

This lgtm for application to 0.98, modulo the multicatch (Java 7+ only) will 
need to be converted to equivalent Java 6 idiom.

> Partially revert HBASE-13172 
> -----------------------------
>
>                 Key: HBASE-13937
>                 URL: https://issues.apache.org/jira/browse/HBASE-13937
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Region Assignment
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
>         Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to