keith-turner commented on code in PR #6049:
URL: https://github.com/apache/accumulo/pull/6049#discussion_r2873648857


##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -1192,15 +1219,35 @@ private List<TabletMigration> 
checkMigrationSanity(Set<TabletServerId> current,
               > MAX_BAD_STATUS_COUNT) {
             if (shutdownServerRateLimiter.tryAcquire()) {
               log.warn("attempting to stop {}", server);
-              try {
-                TServerConnection connection2 = 
tserverSet.getConnection(server);
-                if (connection2 != null) {
-                  connection2.halt(managerLock);
+              var gracefulHaltTimer = 
tserverHaltRpcAttempts.computeIfAbsent(server,
+                  s -> new GracefulHaltTimer(getConfiguration()));
+              if (gracefulHaltTimer.shouldForceHalt()) {
+                log.warn("tserver {} is not responding to halt requests, 
deleting zlock", server);
+                var zk = getContext().getZooReaderWriter();
+                var iid = getContext().getInstanceID();
+                String tserversPath = Constants.ZROOT + "/" + iid + 
Constants.ZTSERVERS;
+                try {
+                  ServiceLock.deleteLocks(zk, tserversPath, 
server.getHostAndPort()::equals,
+                      log::info, false);
+                  tserverHaltRpcAttempts.remove(server);
+                  badServers.remove(server);
+                } catch (KeeperException | InterruptedException e) {
+                  log.error("Failed to delete zlock for server {}", server, e);
+                }
+              } else {
+                try {
+                  TServerConnection connection2 = 
tserverSet.getConnection(server);
+                  if (connection2 != null) {
+                    connection2.halt(managerLock);
+                  }
+                } catch (TTransportException e1) {
+                  // ignore: it's probably down so log the exception at trace
+                  log.trace("error attempting to halt tablet server {}", 
server, e1);
+                } catch (Exception e2) {
+                  log.info("error talking to troublesome tablet server {}", 
server, e2);
+                } finally {
+                  gracefulHaltTimer.startTimer();

Review Comment:
   Seems like this has the potential to keep resetting the timer.  Maybe 
startTimer should only create a new timer object if there is not one.



##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -1143,6 +1146,30 @@ private List<TabletMigration> 
checkMigrationSanity(Set<TabletServerId> current,
 
   }
 
+  /**
+   * This class tracks details about the haltRPCs used
+   */
+  private static class GracefulHaltTimer {
+
+    Duration maxHaltGraceDuration;
+    Timer timer;
+
+    public GracefulHaltTimer(AccumuloConfiguration config) {
+      timer = null;
+      maxHaltGraceDuration =
+          
Duration.ofMillis(config.getTimeInMillis(Property.MANAGER_TSERVER_HALT_DURATION));
+    }
+
+    public void startTimer() {
+      timer = Timer.startNew();
+    }
+
+    public boolean shouldForceHalt() {

Review Comment:
   This code is called in thread pool so over time different threads may call 
startTimer() and shouldForceHalt().  The threads may not see previous threads 
changes.  If both methods are made synchronized then threads would always see 
changes.  Also defends against concurrent calls, but not sure that will happen 
w/ the current code.  Seeing changes does seem to be an issue though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to