Copilot commented on code in PR #13345:
URL: https://github.com/apache/cloudstack/pull/13345#discussion_r3354005739


##########
engine/orchestration/src/main/java/com/cloud/agent/manager/ClusteredAgentManagerImpl.java:
##########
@@ -285,8 +290,9 @@ protected AgentAttache createAttache(final HostVO host) {
             _agents.put(host.getId(), attache);
         }
         if (old != null) {
-            logger.debug("Remove stale agent attache from current management 
server");
-            removeAgent(old, Status.Removed);
+            logger.debug("Remove stale agent attache from current management 
server {}", _nodeId);
+            // just remove agent but do not deinitialize
+            removeAgent(old.getId(), attache);
         }

Review Comment:
   Bug: this call removes the newly created forwarding attache from the _agents 
map. At this point the map already contains `attache`, so 
`removeAgent(old.getId(), attache)` will `remove(hostId)` and return without 
restoring it, leaving no attache registered for the host on this node.



##########
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java:
##########
@@ -1376,62 +1763,129 @@ protected AgentAttache createAttacheForConnect(final 
HostVO host, final Link lin
         return attache;
     }
 
-    private AgentAttache sendReadyAndGetAttache(HostVO host, ReadyCommand 
ready, Link link, StartupCommand[] startupCmds) throws ConnectionException {
+    private AgentAttache sendReadyAndGetAttache(HostVO host, ReadyCommand 
ready, Link link, StartupCommand[] startup) throws ConnectionException {
         AgentAttache attache;
         GlobalLock joinLock = getHostJoinLock(host.getId());
         try {
-            if (!joinLock.lock(60)) {
-                throw new ConnectionException(true, String.format("Unable to 
acquire lock on host %s, to process agent connection", host));
+            long processStart = System.currentTimeMillis();
+            if (joinLock.lock(getTimeoutSec())) {
+                logProcessingStart(host, joinLock);
+                try {
+                    updateReadyCommandWithMSList(host, ready, startup);
+                    attache = createAndNotifyAttache(host, link, startup);
+                } finally {
+                    logProcessingFinish(host, joinLock, processStart);
+                    joinLock.unlock();
+                }
+            } else {
+                throw createLockAcquisitionException(host, joinLock, 
processStart);
             }
-
-            logger.debug("Acquired lock on host {}, to process agent 
connection", host);
-            attache = connectHostAgent(host, ready, link, startupCmds, 
joinLock);
         } finally {
             joinLock.releaseRef();
         }
 
         return attache;
     }
 
-    private AgentAttache connectHostAgent(HostVO host, ReadyCommand ready, 
Link link, StartupCommand[] startupCmds, GlobalLock joinLock) throws 
ConnectionException {
-        AgentAttache attache;
-        try {
-            final List<String> agentMSHostList = new ArrayList<>();
-            String lbAlgorithm = null;
-            if (startupCmds != null && startupCmds.length > 0) {
-                final String agentMSHosts = startupCmds[0].getMsHostList();
-                if (StringUtils.isNotEmpty(agentMSHosts)) {
-                    String[] msHosts = agentMSHosts.split("@");
-                    if (msHosts.length > 1) {
-                        lbAlgorithm = msHosts[1];
-                    }
-                    
agentMSHostList.addAll(Arrays.asList(msHosts[0].split(",")));
+    private void logProcessingStart(HostVO host, GlobalLock joinLock) {
+        StringBuilder msgBuilder = getSummaryMsgBuilder("Processing Host 
connection started",
+                EventTypes.EVENT_HOST_RECONNECT, host.getUuid(), 
host.getName(), joinLock.getName(), null, null, null);
+        logger.info(msgBuilder.toString());
+    }
+
+    private void logProcessingFinish(HostVO host, GlobalLock joinLock, long 
processStart) {
+        long processFinish = System.currentTimeMillis() - processStart;
+        StringBuilder msgBuilder = getSummaryMsgBuilder("Processing Host 
connection finished",
+                EventTypes.EVENT_HOST_RECONNECT, host.getUuid(), 
host.getName(), joinLock.getName(), null, null,
+                processFinish);
+        logger.fatal(msgBuilder.toString());
+    }
+
+    private void updateReadyCommandWithMSList(HostVO host, ReadyCommand ready, 
StartupCommand[] startup) {
+        List<String> agentMSHostList = new ArrayList<>();
+        String lbAlgorithm = null;
+
+        if (startup != null) {
+            String agentMSHosts = startup[0].getMsHostList();
+            if (StringUtils.isNotEmpty(agentMSHosts)) {
+                String[] msHosts = agentMSHosts.split("@");

Review Comment:
   Potential AIOOBE: `startup[0]` is accessed when `startup != null` but 
without checking `startup.length > 0`. An empty startup array would crash the 
connect flow.



##########
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java:
##########
@@ -1544,24 +2008,26 @@ protected void runInContext() {
     }
 
     protected void connectAgent(final Link link, final Command[] cmds, final 
Request request) {
-        // send startupanswer to agent in the very beginning, so agent can 
move on without waiting for the answer for an undetermined time, if we put this 
logic into another
-        // thread pool.
-        final StartupAnswer[] answers = new StartupAnswer[cmds.length];
+        // send startup answer to agent in the very beginning, so agent can 
move on without waiting for the answer for an undetermined time,
+        // if we put this logic into another thread pool.
+        Map<String, String> backoffConfiguration = 
ConfigKeyUtil.toMap(BackoffConfiguration.value());
+        StartupAnswer[] answers = new StartupAnswer[cmds.length];
         Command cmd;
         for (int i = 0; i < cmds.length; i++) {
             cmd = cmds[i];
-            if (cmd instanceof StartupRoutingCommand || cmd instanceof 
StartupProxyCommand || cmd instanceof StartupSecondaryStorageCommand ||
-                    cmd instanceof StartupStorageCommand) {
-                answers[i] = new StartupAnswer((StartupCommand) cmds[i], 0, 
"", "", mgmtServiceConf.getPingInterval());
-                break;
+            if (cmd instanceof StartupRoutingCommand || cmd instanceof 
StartupProxyCommand || cmd instanceof StartupSecondaryStorageCommand
+                    || cmd instanceof StartupStorageCommand) {
+                StartupAnswer answer = new StartupAnswer((StartupCommand) 
cmds[i], 0, "", "", mgmtServiceConf.getPingInterval());
+                answer.setParams(backoffConfiguration);
+                
answer.setAgentHostStatusCheckDelaySec(AgentHostStatusCheckDelay.value());
+                answers[i] = answer;
             }
         }
-        Response response;
-        response = new Response(request, answers[0], _nodeId, -1);
+        Response response = new Response(request, answers[0], _nodeId, -1);

Review Comment:
   Possible NPE: the loop fills `answers[i]`, but the response always uses 
`answers[0]`. If the first command isn't a Startup* command, `answers[0]` stays 
null and `new Response(...)` will fail. Even when it is, the loop should stop 
once the first startup answer is prepared.



##########
framework/cluster/src/main/java/com/cloud/cluster/ClusterServiceServletImpl.java:
##########
@@ -137,6 +140,12 @@ public boolean ping(final String callingPeer) throws 
RemoteException {
 
     private String executePostMethod(final CloseableHttpClient client, final 
HttpPost method) {
         String result = null;
+        String request = null;
+        try {
+            request = EntityUtils.toString(method.getEntity(), 
Charset.defaultCharset());
+        } catch (Exception e) {
+            logger.warn("Failed to retrieve request entity for POST {}", 
serviceUrl, e);
+        }

Review Comment:
   `executePostMethod` always materializes the request entity into a String, 
even when debug logging is disabled. This adds overhead and also risks logging 
sensitive internal payloads (e.g. gsonPackage) when debug is enabled. Consider 
only building/logging the request string when needed, and redacting 
large/sensitive fields.



##########
utils/src/main/java/com/cloud/utils/nio/Link.java:
##########
@@ -354,16 +376,28 @@ public InetSocketAddress getSocketAddress() {
         return _addr;
     }
 
+    public Integer getLocalPort() {
+        return _localPort;
+    }
+
     public String getIpAddress() {
         return _addr.getAddress().toString();
     }
 
     public synchronized void terminated() {
+        if (LOGGER.isTraceEnabled()) {
+            LOGGER.debug("Terminating connection to {}", _addr);
+        }
         _key = null;

Review Comment:
   Logging-level mismatch: this is guarded by `isTraceEnabled()` but logs at 
DEBUG level. That makes the message appear at DEBUG even when TRACE is disabled 
(and also means the guard is ineffective).



##########
utils/src/main/java/com/cloud/utils/backoff/impl/ExponentialWithJitterBackoff.java:
##########
@@ -0,0 +1,166 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+package com.cloud.utils.backoff.impl;
+
+import com.cloud.utils.DateUtil;
+import com.cloud.utils.NumbersUtil;
+import com.cloud.utils.backoff.BackoffAlgorithm;
+import com.cloud.utils.backoff.BackoffFactory;
+import com.cloud.utils.component.AdapterBase;
+
+import java.security.SecureRandom;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Random;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Exponential backoff with up/down cycling.
+ * Delay grows exponentially until a maximum, then decreases back to base, 
then repeats.
+ *
+ * @author mprokopchuk
+ */
+public class ExponentialWithJitterBackoff extends AdapterBase implements 
BackoffAlgorithm,
+        ExponentialWithJitterBackoffMBean {
+
+    /**
+     * Property name for the minimal delay to be used either by {@code 
agent.properties} file or by configuration key.
+     */
+    public static final String MIN_DELAY_MS_CONFIG_KEY = 
"backoff.min_delay_ms";
+
+    /**
+     * Property name for the maximal delay to be used either by {@code 
agent.properties} file or by configuration key.
+     */
+    public static final String MAX_DELAY_MS_CONFIG_KEY = 
"backoff.max_delay_ms";
+
+    /**
+     * Default value for minimal delay for the property {@link 
ExponentialWithJitterBackoff#MIN_DELAY_MS_DEFAULT}.
+     */
+    public static final int MIN_DELAY_MS_DEFAULT = 5_000;
+
+    /**
+     * Default value for maximal delay for the property {@link 
ExponentialWithJitterBackoff#MAX_DELAY_MS_DEFAULT}.
+     */
+    public static final int MAX_DELAY_MS_DEFAULT = 15_000;
+
+    private final Map<String, Thread> asleep = new ConcurrentHashMap<>();
+    private final Random random = new SecureRandom();
+
+    private int minDelayMs;
+    private int maxDelayMs;
+    private int maxAttempts;
+    private int attemptNumber;
+    private boolean increasing;
+
+    @Override
+    public void waitBeforeRetry() {
+        boolean interrupted = false;
+        long waitMs = getTimeToWait();
+        Thread current = Thread.currentThread();
+        try {
+            asleep.put(current.getName(), current);
+            logger.debug(String.format("Going to sleep for %s", 
DateUtil.formatMillis(waitMs)));
+            Thread.sleep(waitMs);
+            logger.debug(String.format("Sleep done for %s", 
DateUtil.formatMillis(waitMs)));
+        } catch (InterruptedException e) {
+            logger.info(String.format("Thread %s interrupted while waiting for 
retry", current.getName()), e);
+        } finally {
+            asleep.remove(current.getName());
+            calculateNextAttempt();
+            if (interrupted) {
+                Thread.currentThread().interrupt();
+            }
+        }
+    }
+
+    /**
+     * Calculates next attempt and direction.
+     */
+    private void calculateNextAttempt() {
+        if (increasing) {
+            int nextAttemptNumber = attemptNumber + 1;
+            increasing = getNextDelay() <= maxDelayMs && nextAttemptNumber <= 
maxAttempts;
+            if (increasing) {
+                attemptNumber = nextAttemptNumber;
+            }
+        } else {
+            int nextAttemptNumber = Math.max(attemptNumber - 1, 0);
+            increasing = nextAttemptNumber == 0;
+            if (!increasing) {
+                attemptNumber = nextAttemptNumber;
+            }
+        }
+    }
+
+    @Override
+    public void reset() {
+        attemptNumber = 0;
+    }
+
+    @Override
+    public Map<String, String> getConfiguration() {
+        Map<String, String> configuration = new HashMap<>();
+        configuration.put(MIN_DELAY_MS_CONFIG_KEY, String.valueOf(minDelayMs));
+        configuration.put(MAX_DELAY_MS_CONFIG_KEY, String.valueOf(maxDelayMs));
+        configuration.put(BackoffFactory.BACKOFF_IMPLEMENTATION_KEY, 
this.getClass().getName());
+        return configuration;
+    }
+
+    @Override
+    public boolean configure(String name, Map<String, Object> params) {
+        minDelayMs = NumbersUtil.parseInt((String) 
params.get(MIN_DELAY_MS_CONFIG_KEY), MIN_DELAY_MS_DEFAULT);
+        maxDelayMs = NumbersUtil.parseInt((String) 
params.get(MAX_DELAY_MS_CONFIG_KEY), MAX_DELAY_MS_DEFAULT);
+        maxAttempts = (int) Math.round(Math.log((double) maxDelayMs / 
minDelayMs) / Math.log(2));
+
+        attemptNumber = random.nextInt(maxAttempts + 1);
+        increasing = random.nextBoolean();
+        // do nothing
+        return true;
+    }
+
+    @Override
+    public Collection<String> getWaiters() {
+        return asleep.keySet();
+    }
+
+    @Override
+    public boolean wakeup(String threadName) {
+        Thread th = asleep.get(threadName);
+        if (th != null) {
+            th.interrupt();
+            return true;
+        }
+        return false;
+    }
+
+    private long getNextDelay() {
+        return (long) Math.min(minDelayMs * Math.pow(2, attemptNumber), 
maxDelayMs);
+    }
+
+    @Override
+    public long getTimeToWait() {
+        long delay = getNextDelay();
+        int jitter = random.nextInt((int) delay / 2);
+        return delay + jitter;
+    }

Review Comment:
   `getTimeToWait()` can throw `IllegalArgumentException` when `delay / 2` is 0 
(e.g., if `minDelayMs` is configured as 0 or 1). It also doesn't guard against 
invalid `minDelayMs/maxDelayMs` values (division by zero / negative delays).



##########
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java:
##########
@@ -1376,62 +1763,129 @@ protected AgentAttache createAttacheForConnect(final 
HostVO host, final Link lin
         return attache;
     }
 
-    private AgentAttache sendReadyAndGetAttache(HostVO host, ReadyCommand 
ready, Link link, StartupCommand[] startupCmds) throws ConnectionException {
+    private AgentAttache sendReadyAndGetAttache(HostVO host, ReadyCommand 
ready, Link link, StartupCommand[] startup) throws ConnectionException {
         AgentAttache attache;
         GlobalLock joinLock = getHostJoinLock(host.getId());
         try {
-            if (!joinLock.lock(60)) {
-                throw new ConnectionException(true, String.format("Unable to 
acquire lock on host %s, to process agent connection", host));
+            long processStart = System.currentTimeMillis();
+            if (joinLock.lock(getTimeoutSec())) {
+                logProcessingStart(host, joinLock);
+                try {
+                    updateReadyCommandWithMSList(host, ready, startup);
+                    attache = createAndNotifyAttache(host, link, startup);
+                } finally {
+                    logProcessingFinish(host, joinLock, processStart);
+                    joinLock.unlock();
+                }
+            } else {
+                throw createLockAcquisitionException(host, joinLock, 
processStart);
             }
-
-            logger.debug("Acquired lock on host {}, to process agent 
connection", host);
-            attache = connectHostAgent(host, ready, link, startupCmds, 
joinLock);
         } finally {
             joinLock.releaseRef();
         }
 
         return attache;
     }
 
-    private AgentAttache connectHostAgent(HostVO host, ReadyCommand ready, 
Link link, StartupCommand[] startupCmds, GlobalLock joinLock) throws 
ConnectionException {
-        AgentAttache attache;
-        try {
-            final List<String> agentMSHostList = new ArrayList<>();
-            String lbAlgorithm = null;
-            if (startupCmds != null && startupCmds.length > 0) {
-                final String agentMSHosts = startupCmds[0].getMsHostList();
-                if (StringUtils.isNotEmpty(agentMSHosts)) {
-                    String[] msHosts = agentMSHosts.split("@");
-                    if (msHosts.length > 1) {
-                        lbAlgorithm = msHosts[1];
-                    }
-                    
agentMSHostList.addAll(Arrays.asList(msHosts[0].split(",")));
+    private void logProcessingStart(HostVO host, GlobalLock joinLock) {
+        StringBuilder msgBuilder = getSummaryMsgBuilder("Processing Host 
connection started",
+                EventTypes.EVENT_HOST_RECONNECT, host.getUuid(), 
host.getName(), joinLock.getName(), null, null, null);
+        logger.info(msgBuilder.toString());
+    }
+
+    private void logProcessingFinish(HostVO host, GlobalLock joinLock, long 
processStart) {
+        long processFinish = System.currentTimeMillis() - processStart;
+        StringBuilder msgBuilder = getSummaryMsgBuilder("Processing Host 
connection finished",
+                EventTypes.EVENT_HOST_RECONNECT, host.getUuid(), 
host.getName(), joinLock.getName(), null, null,
+                processFinish);
+        logger.fatal(msgBuilder.toString());

Review Comment:
   `logProcessingFinish` logs at FATAL level even on a normal successful 
connect path. This will likely pollute logs / monitoring and may trigger alerts 
despite success.



##########
engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java:
##########
@@ -1376,62 +1763,129 @@ protected AgentAttache createAttacheForConnect(final 
HostVO host, final Link lin
         return attache;
     }
 
-    private AgentAttache sendReadyAndGetAttache(HostVO host, ReadyCommand 
ready, Link link, StartupCommand[] startupCmds) throws ConnectionException {
+    private AgentAttache sendReadyAndGetAttache(HostVO host, ReadyCommand 
ready, Link link, StartupCommand[] startup) throws ConnectionException {
         AgentAttache attache;
         GlobalLock joinLock = getHostJoinLock(host.getId());
         try {
-            if (!joinLock.lock(60)) {
-                throw new ConnectionException(true, String.format("Unable to 
acquire lock on host %s, to process agent connection", host));
+            long processStart = System.currentTimeMillis();
+            if (joinLock.lock(getTimeoutSec())) {
+                logProcessingStart(host, joinLock);
+                try {
+                    updateReadyCommandWithMSList(host, ready, startup);
+                    attache = createAndNotifyAttache(host, link, startup);
+                } finally {
+                    logProcessingFinish(host, joinLock, processStart);
+                    joinLock.unlock();
+                }
+            } else {
+                throw createLockAcquisitionException(host, joinLock, 
processStart);
             }
-
-            logger.debug("Acquired lock on host {}, to process agent 
connection", host);
-            attache = connectHostAgent(host, ready, link, startupCmds, 
joinLock);
         } finally {
             joinLock.releaseRef();
         }
 
         return attache;
     }
 
-    private AgentAttache connectHostAgent(HostVO host, ReadyCommand ready, 
Link link, StartupCommand[] startupCmds, GlobalLock joinLock) throws 
ConnectionException {
-        AgentAttache attache;
-        try {
-            final List<String> agentMSHostList = new ArrayList<>();
-            String lbAlgorithm = null;
-            if (startupCmds != null && startupCmds.length > 0) {
-                final String agentMSHosts = startupCmds[0].getMsHostList();
-                if (StringUtils.isNotEmpty(agentMSHosts)) {
-                    String[] msHosts = agentMSHosts.split("@");
-                    if (msHosts.length > 1) {
-                        lbAlgorithm = msHosts[1];
-                    }
-                    
agentMSHostList.addAll(Arrays.asList(msHosts[0].split(",")));
+    private void logProcessingStart(HostVO host, GlobalLock joinLock) {
+        StringBuilder msgBuilder = getSummaryMsgBuilder("Processing Host 
connection started",
+                EventTypes.EVENT_HOST_RECONNECT, host.getUuid(), 
host.getName(), joinLock.getName(), null, null, null);
+        logger.info(msgBuilder.toString());
+    }
+
+    private void logProcessingFinish(HostVO host, GlobalLock joinLock, long 
processStart) {
+        long processFinish = System.currentTimeMillis() - processStart;
+        StringBuilder msgBuilder = getSummaryMsgBuilder("Processing Host 
connection finished",
+                EventTypes.EVENT_HOST_RECONNECT, host.getUuid(), 
host.getName(), joinLock.getName(), null, null,
+                processFinish);
+        logger.fatal(msgBuilder.toString());
+    }
+
+    private void updateReadyCommandWithMSList(HostVO host, ReadyCommand ready, 
StartupCommand[] startup) {
+        List<String> agentMSHostList = new ArrayList<>();
+        String lbAlgorithm = null;
+
+        if (startup != null) {
+            String agentMSHosts = startup[0].getMsHostList();
+            if (StringUtils.isNotEmpty(agentMSHosts)) {
+                String[] msHosts = agentMSHosts.split("@");
+                if (msHosts.length > 1) {
+                    lbAlgorithm = msHosts[1];
                 }
+                agentMSHostList.addAll(Arrays.asList(msHosts[0].split(",")));
             }
+        }
 
-            if 
(!indirectAgentLB.compareManagementServerListAndLBAlgorithm(host.getId(), 
host.getDataCenterId(), agentMSHostList, lbAlgorithm)) {
-                final List<String> newMSList = 
indirectAgentLB.getManagementServerList(host.getId(), host.getDataCenterId(), 
null);
-                ready.setMsHostList(newMSList);
-                String newLBAlgorithm = indirectAgentLB.getLBAlgorithmName();
-                ready.setLbAlgorithm(newLBAlgorithm);
-                logger.debug("Agent's management server host list or lb 
algorithm is not up to date, sending list and algorithm update: {}, {}", 
newMSList, newLBAlgorithm);
-            }
+        if 
(!indirectAgentLB.compareManagementServerListAndLBAlgorithm(host.getId(), 
host.getDataCenterId(), agentMSHostList, lbAlgorithm)) {
+            setReadyCommandMSList(host, ready);
+        }
+    }
 
-            final List<String> avoidMsList = _mshostDao.listNonUpStateMsIPs();
-            ready.setAvoidMsHostList(avoidMsList);
-            
ready.setLbCheckInterval(indirectAgentLB.getLBPreferredHostCheckInterval(host.getClusterId()));
-            ready.setArch(host.getArch().getType());
+    private void setReadyCommandMSList(HostVO host, ReadyCommand ready) {
+        List<String> newMSList = 
indirectAgentLB.getManagementServerList(host.getId(), host.getDataCenterId(), 
null);
+        ready.setMsHostList(newMSList);
+        List<String> avoidMsList = _mshostDao.listNonUpStateMsIPs();
+        ready.setAvoidMsHostList(avoidMsList);
+        ready.setLbAlgorithm(indirectAgentLB.getLBAlgorithmName());
+        
ready.setLbCheckInterval(indirectAgentLB.getLBPreferredHostCheckInterval(host.getClusterId()));
+        logger.debug("Agent's management server host list is not up to date, 
sending list update:" + newMSList);
+    }
 
-            attache = createAttacheForConnect(host, link);
-            attache = notifyMonitorsOfConnection(attache, startupCmds, false);
-        } finally {
-            joinLock.unlock();
+    private AgentAttache createAndNotifyAttache(HostVO host, Link link, 
StartupCommand[] startup) throws ConnectionException {
+        AgentAttache attache = createAttacheForConnect(host, link);
+        return notifyMonitorsOfConnection(attache, startup, false);
+    }
+
+    private ConnectionException createLockAcquisitionException(HostVO host, 
GlobalLock joinLock, long processStart) {
+        long processFinish = System.currentTimeMillis() - processStart;
+        String title = "Failed to connect Host";
+        String summary = String.format("Failure due to unable to acquire lock 
%s during %s seconds",
+                joinLock.getName(), 60);
+

Review Comment:
   `createLockAcquisitionException` hard-codes `60` seconds in the summary, but 
the actual lock acquisition timeout is randomized via `getTimeoutSec()`. This 
makes the error summary misleading during troubleshooting.



##########
agent/src/main/java/com/cloud/agent/Agent.java:
##########
@@ -623,98 +559,351 @@ public Task create(final Task.Type type, final Link 
link, final byte[] data) {
         return new ServerHandler(type, link, data);
     }
 
-    protected void reconnect(final Link link) {
-        reconnect(link, null, false);
+    protected void closeAndTerminateLink(Link link) {
+        Optional.ofNullable(link)
+                .map(Link::attachment)
+                .filter(ServerAttache.class::isInstance)
+                .map(ServerAttache.class::cast)
+                .ifPresentOrElse(ServerAttache::disconnect, () -> {
+                    if (link != null) {
+                        link.close();
+                        link.terminated();
+                    }
+                });
     }
 
-    protected void reconnect(final Link link, String preferredMSHost, boolean 
forTransfer) {
-        if (!(forTransfer || reconnectAllowed)) {
-            logger.debug("Reconnect requested but it is not allowed {}", () -> 
getLinkLog(link));
+    protected void stopAndCleanupConnection() {
+        if (connection == null) {
             return;
         }
-        cancelStartupTask();
-        closeAndTerminateLink(link);
-        closeAndTerminateLink(this.link);
-        setLink(null);
-        cancelTasks();
-        serverResource.disconnected();
-        logger.info("Lost connection to host: {}. Attempting reconnection 
while we still have {} commands in progress.", shell.getConnectedHost(), 
commandsInProgress.get());
-        stopAndCleanupConnection(true);
-        String host = preferredMSHost;
-        if (org.apache.commons.lang3.StringUtils.isBlank(host)) {
-            host = shell.getNextHost();
-        }
-        List<String> avoidMSHostList = shell.getAvoidHosts();
-        do {
-            if (CollectionUtils.isEmpty(avoidMSHostList) || 
!avoidMSHostList.contains(host)) {
-                connection = new NioClient(getAgentName(), host, 
shell.getPort(), shell.getWorkers(), shell.getSslHandshakeTimeout(), this);
-                logger.info("Reconnecting to host: {}", host);
-                try {
-                    connection.start();
-                } catch (final NioConnectionException e) {
-                    logger.info("Attempted to re-connect to the server, but 
received an unexpected exception, trying again...", e);
-                    stopAndCleanupConnection(false);
-                }
+        NioConnection connection = this.connection;
+        connection.stop();
+        try {
+            connection.cleanUp();
+        } catch (final IOException e) {
+            logger.warn("Fail to clean up old connection", e);
+        }
+
+        try {
+            while (connection.isStartup()) {
+                logger.debug("Waiting for connection graceful stop");
+                shell.getBackoffAlgorithm().waitBeforeRetry();
+                connection.stop();
             }
-            shell.getBackoffAlgorithm().waitBeforeRetry();
-            host = shell.getNextHost();
-        } while (!connection.isStartup());
-        shell.updateConnectedHost(((NioClient)connection).getHost());
-        logger.info("Connected to the host: {}", shell.getConnectedHost());
+        } catch (Exception e) {
+            logger.warn("Failed to gracefully stop connection", e);
+        }
+        logger.debug("Connection stopped");
+    }
+
+    /**
+     * Select the host to reconnect to based on priority:
+     * 1. preferredHost if defined and not blank
+     * 2. Link's socket address IP if available and not null
+     * 3. shell.getNextHost() if the above two options are not met
+     *
+     * @param preferredHost the preferred host to connect to
+     * @param link the current link which may contain socket address 
information
+     * @return the host to connect to
+     */
+    protected String selectReconnectionHost(String preferredHost, Link link) {
+        return Optional.ofNullable(preferredHost)
+                .filter(org.apache.commons.lang3.StringUtils::isNotBlank)
+                .orElseGet(() -> Optional.ofNullable(link)
+                        .map(Link::getSocketAddress)
+                        .map(InetSocketAddress::getAddress)
+                        .map(InetAddress::getHostAddress)
+                        .orElseGet(shell::getNextHost));
     }
 
-    protected void closeAndTerminateLink(final Link link) {
-        if (link == null) {
+    /**
+     * Reconnect to Management Server.
+     *
+     * @param link           - connection holder
+     * @param preferredHost  - if defined, reconnect will be performed to this 
Host first,
+     *                       otherwise will be used {@link 
IAgentShell#getNextHost()}
+     * @param forceReconnect - expected to be true if called by {@link 
MigrateAgentConnectionCommand},
+     *                       this is only "switch Management Server", it does 
not perform full Host Connect process.
+     */
+    protected void reconnect(Link link, String preferredHost, boolean 
forceReconnect) {
+        if (!reconnectLock.compareAndSet(false, true)) {
+            logger.warn("Reconnect is already running, exiting");
             return;
         }
-        link.close();
-        link.terminated();
+        String requestedLink = 
Optional.ofNullable(link).map(Link::toString).orElse("N/A");
+        String currentLink = 
Optional.ofNullable(this.link).map(Link::toString).orElse("N/A");
+        logger.info("Reconnect info: provided link: {}, agent link: {}, 
preferred host: {}, force" +
+                " reconnect: {}", requestedLink, currentLink, preferredHost, 
forceReconnect);
+
+        try {
+            logger.debug("Obtained reconnect lock");
+            if (!(forceReconnect || reconnectAllowed)) {
+                if (logger.isDebugEnabled()) {
+                    logger.debug("Reconnect requested but it is not allowed 
{}", link);
+                }
+                return;
+            }
+
+            if (isReconnectStormDetected(link, preferredHost, requestedLink, 
currentLink)) {
+                return;
+            }
+
+            cleanupConnectionBeforeReconnect(link);
+            // start with preferred host
+            String host = selectReconnectionHost(preferredHost, link);
+
+            String hostLog = LogUtils.getHostLog(host, shell.getPort());
+            List<String> avoidMsHostList = 
Optional.ofNullable(shell.getAvoidHosts()).orElseGet(List::of);
+            // pointer to the first element of "refuse loop"
+            AtomicReference<String> firstRefuseLoopHostRef = new 
AtomicReference<>(null);
+            // to break deadlock where "non-avoid" MS Hosts are down and only 
"avoid" are up
+            AtomicBoolean ignoreAvoidMsHostListRef = new AtomicBoolean(false);
+            do {
+                AtomicBoolean skipTimeoutRef = new AtomicBoolean(false);
+                String parentLogContextId = (String) 
ThreadContext.get("logcontextid");
+                if (parentLogContextId != null) {
+                    ThreadContext.put("logcontextid-parent", 
parentLogContextId);
+                }
+                ThreadContext.put("logcontextid", 
UuidUtils.first(UUID.randomUUID().toString()));
+                if (ignoreAvoidMsHostListRef.get() || 
!avoidMsHostList.contains(host)) {
+                    connection = new NioClient(getAgentName(), host, 
shell.getPort(), shell.getWorkers(),
+                            shell.getSslHandshakeTimeout(), this);
+                    logger.info("Reconnecting to host: {}", hostLog);
+                    try {
+                        connection.start();
+                        // successfully connected, skip the rest
+                        continue;
+                    } catch (Exception e) {
+                        logReconnectionFailure(e, hostLog);
+
+                        try {
+                            stopAndCleanupConnection();
+                        } catch (Exception ex) {
+                            logger.warn("Got an exception during stop and 
cleanup connection", e);
+                        }
+
+                        updateRefuseLoopState(e, host, firstRefuseLoopHostRef, 
ignoreAvoidMsHostListRef, skipTimeoutRef);
+                    }
+                } else {
+                    logger.debug("Next host {} is in avoid list, skipped", 
hostLog);
+                    if 
(org.apache.commons.lang3.StringUtils.isBlank(preferredHost)) {
+                        logHostLists(avoidMsHostList);
+                        skipTimeoutRef.set(true);
+                    }
+                }
+                if (!skipTimeoutRef.get()) {
+                    shell.getBackoffAlgorithm().waitBeforeRetry();
+                }
+                host = shell.getNextHost();
+                hostLog = LogUtils.getHostLog(host, shell.getPort());
+                logger.debug("Next host to connect: {}", hostLog);
+            } while (!connection.isStartup());
+            // successfully connected
+            shell.updateConnectedHost(((NioClient) connection).getHost());
+            String msg = String.format("Connected to the host: %s (%s)", 
shell.getConnectedHost(), this.link);
+            logger.info(msg);
+        } finally {
+            reconnectLock.set(false);
+            logger.debug("Removed reconnect lock");
+        }
     }
 
-    protected void stopAndCleanupConnection(boolean waitForStop) {
-        if (connection == null) {
-            return;
+    /**
+     * Handles "Connection refused" loop detection and determines if backoff 
timeout should be skipped.
+     * Manages refuse loop state to detect when all management servers have 
been tried and need
+     * to ignore avoid list to prevent deadlock.
+     *
+     * @param e the exception from connection attempt
+     * @param host the current host being attempted
+     * @param firstRefuseLoopHostRef reference to first host in refuse loop 
(modified by this method)
+     * @param ignoreAvoidMsHostListRef flag to ignore avoid list (modified by 
this method)
+     * @return true if timeout should be skipped (connection refused), false 
otherwise
+     */
+    private void updateRefuseLoopState(Exception e, String host, 
AtomicReference<String> firstRefuseLoopHostRef, AtomicBoolean 
ignoreAvoidMsHostListRef, AtomicBoolean skipTimeoutRef) {
+        // we are skipping timeout for "Connection refused" to not waste time 
on down MS
+        boolean skipTimeout = Optional.ofNullable(e.getCause())
+                .filter(ConnectException.class::isInstance)
+                .map(Throwable::getMessage)
+                .filter(CONNECTION_REFUSED_MSG::equalsIgnoreCase)
+                .isPresent();
+        skipTimeoutRef.set(skipTimeout);
+        String firstRefuseLoopHost = firstRefuseLoopHostRef.get();
+        // for each "Connection refused" (maybe need to have a copy of 
variable with better name)
+        // start "refuse loop"
+        if (skipTimeout && firstRefuseLoopHost == null) {
+            firstRefuseLoopHostRef.set(host);
+            ignoreAvoidMsHostListRef.set(false);
+            logger.debug("Started refuse loop for host {}", 
firstRefuseLoopHost);
+            // closed "refuse loop"
+        } else if (skipTimeout && firstRefuseLoopHost.equalsIgnoreCase(host)) {
+            ignoreAvoidMsHostListRef.set(true);
+            logger.debug("Closed refuse loop for host {}", 
firstRefuseLoopHost);
+            // got non "refuse" related issue, break "refuse loop"
+        } else if (!skipTimeout && (firstRefuseLoopHostRef != null || 
ignoreAvoidMsHostListRef.get())) {
+            logger.debug("Broke refuse loop for host {} by {}", 
firstRefuseLoopHost, host);
+            firstRefuseLoopHostRef.set(null);
+            ignoreAvoidMsHostListRef.set(false);
         }
-        connection.stop();
+    }
+
+    /**
+     * Logs reconnection failure with appropriate level based on rejection 
reason.
+     * If connection was rejected due to max concurrent connections limit 
(Broken pipe),
+     * logs as warning. Otherwise logs as info.
+     *
+     * @param e the exception that occurred during reconnection attempt
+     * @param hostLog the formatted host log string for logging
+     */
+    private void logReconnectionFailure(Exception e, String hostLog) {
+        // check if got NIO Connection exception, caused by IO Exception 
"Broken pipe"
+        boolean rejectedByMs = 
Optional.of(e).filter(NioConnectionException.class::isInstance)
+                .map(Exception::getCause)
+                .filter(IOException.class::isInstance)
+                .map(IOException.class::cast)
+                .map(IOException::getMessage)
+                .filter(BROKEN_PIPE_MSG::equalsIgnoreCase)
+                .isPresent();
+        if (rejectedByMs) {
+            logger.warn("Attempted to re-connect to {}, but rejected" +
+                    " due to 'agent.max.concurrent.new.connections' reached 
limit," +
+                    " will try again", hostLog, e);
+        } else {
+            logger.info("Attempted to re-connect to {}, but got exception," +
+                    " will try again", hostLog, e);
+        }
+    }
+
+    /**
+     * Logs all management server host lists for debugging reconnection logic.
+     * Outputs defined hosts, hosts to avoid, and calculated available hosts.
+     *
+     * @param avoidMsHostList list of management server hosts to avoid during 
reconnection
+     */
+    private void logHostLists(List<String> avoidMsHostList) {
+        logger.debug("Preferred host is not defined");
         try {
-            connection.cleanUp();
-        } catch (final IOException e) {
-            logger.warn("Fail to clean up old connection. {}", e);
+            List<String> hostsList = Optional.ofNullable(shell.getHosts())
+                    .map(Arrays::asList)
+                    .orElseGet(List::of);
+
+            List<String> hostsShortList = new ArrayList<>(hostsList);
+            hostsShortList.removeAll(avoidMsHostList);
+
+            logger.info("Defined hosts: {} Avoid hosts: {} Available hosts: 
{}",
+                    String.join(", ", hostsList), String.join(", ", 
avoidMsHostList), String.join(", ", hostsShortList));
+        } catch (Exception e) {
+            logger.warn("Failed to calculate next host logic", e);
         }
-        if (!waitForStop) {
-            return;
+    }
+
+    /**
+     * Cleans up current connection state before attempting reconnection.
+     * Stops host connect process, terminates links, cancels scheduled tasks,
+     * notifies server resource about disconnection, and resets connection 
tracking.
+     *
+     * @param link the link that triggered reconnection
+     */
+    private void cleanupConnectionBeforeReconnect(Link link) {
+        String lastConnectedHost = shell.getConnectedHost();
+        try {
+            // reset Host status track and Startup process initiating
+            logger.debug("Stopping Host Connect process");
+            hostConnectProcess.stop();
+            closeAndTerminateLink(link);
+            closeAndTerminateLink(this.link);
+            setLink(null);
+            cancelTasks();
+            serverResource.disconnected();
+            stopAndCleanupConnection();
+            shell.updateConnectedHost(null);
+        } catch (Exception ex) {
+            logger.error("Failed to cleanup previous connection", ex);
+        }
+        logger.info("Lost connection to host: {}. Attempting reconnection 
while we still have" +
+                " {} commands in progress.", lastConnectedHost, 
commandsInProgress.get());
+    }
+
+    /**
+     * Detects reconnection storm by checking if the reconnect request is 
redundant.
+     * This prevents processing stale reconnection requests for old links when
+     * agent has already established a new connection.
+     *
+     * @param link the link requesting reconnection
+     * @param preferredHost the preferred host to reconnect to (may be null)
+     * @param requestedLink string representation of the requested link for 
logging
+     * @param currentLink string representation of the current agent link for 
logging
+     * @return true if reconnection storm is detected and request should be 
skipped, false otherwise
+     */
+    private boolean isReconnectStormDetected(Link link, String preferredHost, 
String requestedLink, String currentLink) {
+        logger.debug("Calling storm guard");
+        boolean reconnectForCurrentLink = link == this.link;
+        boolean currentLinkTerminated = this.link != null && 
this.link.isTerminated();
+        boolean reconnectForNewHost = this.hostname != null && 
this.hostname.equals(preferredHost);
+        // if none of the above is true
+        boolean stormDetected = ! (reconnectForCurrentLink || 
currentLinkTerminated || reconnectForNewHost);
+        // connection storm guard
+        if (stormDetected) {
+            logger.warn("Reconnect requested for the connection {} but current 
connection is " +
+                    "{} and preferred host {}, skipping", requestedLink, 
!currentLinkTerminated ? currentLink : currentLink + " (terminated)", 
preferredHost);
         }
-        do {
-            shell.getBackoffAlgorithm().waitBeforeRetry();
-        } while (connection.isStartup());
+        return stormDetected;
     }
 
-    public void processStartupAnswer(final Answer answer, final Response 
response, final Link link) {
-        boolean answerValid = cancelStartupTask();
-        final StartupAnswer startup = (StartupAnswer)answer;
+    public void processStartupAnswer(final StartupAnswer startup, final 
Response response, final Link link) {
+        setBackoffAlgorithm(startup);
+
         if (!startup.getResult()) {
-            logger.error("Not allowed to connect to the server: {}", 
answer.getDetails());
+            logger.error("Not allowed to connect to the server: {}", 
startup.getDetails());
             if (serverResource != null && !serverResource.isExitOnFailures()) {
                 logger.trace("{} does not allow exit on failure, reconnecting",
                         serverResource.getClass().getSimpleName());
-                reconnect(link);
+                logger.info("Reconnecting for {}", link);
+                requestHandler.submit(() -> reconnect(link, null, false));
                 return;
             }
+            logger.fatal("Got unsuccessful result {} from the answer {}, 
details: {}",
+                    startup.getResult(), startup.getClass().getSimpleName(), 
startup.getDetails());
             System.exit(1);
         }
-        if (!answerValid) {
+
+        boolean processWasRunning = hostConnectProcess.stop();
+        if (!processWasRunning) {
             logger.warn("Threw away a startup answer because we're 
reconnecting.");
             return;
         }
 
+        handleStartupAnswer(startup, response, link);
+    }
+
+    private void handleStartupAnswer(StartupAnswer startup, Response response, 
Link link) {
         logger.info("Process agent startup answer, agent [id: {}, uuid: {}, 
name: {}] connected to the server",
                 startup.getHostId(), startup.getHostUuid(), 
startup.getHostName());
 
         setId(startup.getHostId());
-        setUuid(startup.getHostUuid());
-        setName(startup.getHostName());
+        // older builds do not send host uuid and names, do not set it, 
otherwise null pointer exception will be thrown
+        String hostUuid = startup.getHostUuid();
+        if (org.apache.commons.lang3.StringUtils.isNotEmpty(hostUuid)) {
+            setUuid(hostUuid);
+        }
+        String hostName = startup.getHostName();
+        if (org.apache.commons.lang3.StringUtils.isNotEmpty(hostName)) {
+            setName(hostUuid);
+        }

Review Comment:
   Bug: when `hostName` is present, the agent name is set to `hostUuid` instead 
of `hostName`, so the in-memory agent name becomes wrong (and may even become 
null if UUID was not provided).



##########
utils/src/main/java/com/cloud/utils/nio/NioConnection.java:
##########
@@ -80,66 +81,72 @@ public abstract class NioConnection implements 
Callable<Boolean> {
     protected ExecutorService _executor;
     protected ExecutorService _sslHandshakeExecutor;
     protected CAService caService;
-    protected Set<SocketChannel> socketChannels = 
ConcurrentHashMap.newKeySet();
-    protected Integer sslHandshakeTimeout = null;
+    protected Set<SocketChannel> socketChannels;
+    protected Integer sslHandshakeTimeout;
     private final int factoryMaxNewConnectionsCount;
-    protected boolean blockNewConnections;
+    protected volatile boolean blockNewConnections;
 
     public NioConnection(final String name, final int port, final int workers, 
final HandlerFactory factory) {
+        socketChannels = ConcurrentHashMap.newKeySet();
         _name = name;
-        _isRunning = false;
-        blockNewConnections = false;
-        _selector = null;
         _port = port;
         _workers = workers;
         _factory = factory;
         this.factoryMaxNewConnectionsCount = 
factory.getMaxConcurrentNewConnectionsCount();
-        initWorkersExecutor();
-        initSSLHandshakeExecutor();
     }
 
-    public void setCAService(final CAService caService) {
+    public void setCAService(CAService caService) {
         this.caService = caService;
     }
 
     public void start() throws NioConnectionException {
         _todos = new ArrayList<>();
 
+        if (_executor == null || _executor.isShutdown() || 
_executor.isTerminated()) {
+            initWorkersExecutor();
+        }
+
+        if (_sslHandshakeExecutor == null || 
_sslHandshakeExecutor.isShutdown() || _sslHandshakeExecutor.isTerminated()) {
+            initSSLHandshakeExecutor();
+        }
+
         try {
             init();
         } catch (final ConnectException e) {
-            logger.warn("Unable to connect to remote: is there a server 
running on port {}?", _port, e);
-            throw new NioConnectionException(e.getMessage(), e);
-        } catch (final IOException e) {
-            logger.error("Unable to initialize the threads.", e);
-            throw new NioConnectionException(e.getMessage(), e);
+            String msg = String.format("Unable to connect to remote: is there 
a server running on port %s: %s", _port, e.getMessage());
+            throw new NioConnectionException(msg, e);
         } catch (final Exception e) {
-            logger.error("Unable to initialize the threads due to unknown 
exception.", e);
-            throw new NioConnectionException(e.getMessage(), e);
+            String msg = String.format("Unable to initialize the threads: %s", 
e.getMessage());
+            throw new NioConnectionException(msg, e);
         }
         _isStartup = true;
 
-        if (_executor.isShutdown()) {
-            initWorkersExecutor();
-        }
-        if (_sslHandshakeExecutor.isShutdown()) {
-            initSSLHandshakeExecutor();
-        }
         _threadExecutor = Executors.newSingleThreadExecutor(new 
NamedThreadFactory(this._name + "-NioConnectionHandler"));
         _isRunning = true;
+        // in case start() called after stop()
         blockNewConnections = false;
         _futureTask = _threadExecutor.submit(this);
     }
 
     public void stop() {
-        _executor.shutdown();
-        _sslHandshakeExecutor.shutdown();
-        _isRunning = false;
-        blockNewConnections = true;
+        if (_executor != null) {
+            if (logger.isTraceEnabled()) {
+                logger.trace("Shutting down handler tasks");
+            }
+            _executor.shutdownNow();
+        }
+        if (_sslHandshakeExecutor != null) {
+            if (logger.isTraceEnabled()) {
+                logger.trace("Shutting down SSL Handshake executor");
+            }
+            _sslHandshakeExecutor.shutdownNow();
+        }
         if (_threadExecutor != null) {
             _futureTask.cancel(false);
-            _threadExecutor.shutdown();
+            _threadExecutor.shutdownNow();
         }

Review Comment:
   `stop()` cancels `_futureTask` without a null-check. Since executor 
initialization/submit is now deferred to `start()`, a caller that calls 
`stop()` after a failed/partial `start()` can hit an NPE here.



##########
utils/src/main/java/com/cloud/utils/backoff/impl/ExponentialWithJitterBackoff.java:
##########
@@ -0,0 +1,166 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+package com.cloud.utils.backoff.impl;
+
+import com.cloud.utils.DateUtil;
+import com.cloud.utils.NumbersUtil;
+import com.cloud.utils.backoff.BackoffAlgorithm;
+import com.cloud.utils.backoff.BackoffFactory;
+import com.cloud.utils.component.AdapterBase;
+
+import java.security.SecureRandom;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Random;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Exponential backoff with up/down cycling.
+ * Delay grows exponentially until a maximum, then decreases back to base, 
then repeats.
+ *
+ * @author mprokopchuk
+ */
+public class ExponentialWithJitterBackoff extends AdapterBase implements 
BackoffAlgorithm,
+        ExponentialWithJitterBackoffMBean {
+
+    /**
+     * Property name for the minimal delay to be used either by {@code 
agent.properties} file or by configuration key.
+     */
+    public static final String MIN_DELAY_MS_CONFIG_KEY = 
"backoff.min_delay_ms";
+
+    /**
+     * Property name for the maximal delay to be used either by {@code 
agent.properties} file or by configuration key.
+     */
+    public static final String MAX_DELAY_MS_CONFIG_KEY = 
"backoff.max_delay_ms";
+
+    /**
+     * Default value for minimal delay for the property {@link 
ExponentialWithJitterBackoff#MIN_DELAY_MS_DEFAULT}.
+     */
+    public static final int MIN_DELAY_MS_DEFAULT = 5_000;
+
+    /**
+     * Default value for maximal delay for the property {@link 
ExponentialWithJitterBackoff#MAX_DELAY_MS_DEFAULT}.
+     */
+    public static final int MAX_DELAY_MS_DEFAULT = 15_000;
+
+    private final Map<String, Thread> asleep = new ConcurrentHashMap<>();
+    private final Random random = new SecureRandom();
+
+    private int minDelayMs;
+    private int maxDelayMs;
+    private int maxAttempts;
+    private int attemptNumber;
+    private boolean increasing;
+
+    @Override
+    public void waitBeforeRetry() {
+        boolean interrupted = false;
+        long waitMs = getTimeToWait();
+        Thread current = Thread.currentThread();
+        try {
+            asleep.put(current.getName(), current);
+            logger.debug(String.format("Going to sleep for %s", 
DateUtil.formatMillis(waitMs)));
+            Thread.sleep(waitMs);
+            logger.debug(String.format("Sleep done for %s", 
DateUtil.formatMillis(waitMs)));
+        } catch (InterruptedException e) {
+            logger.info(String.format("Thread %s interrupted while waiting for 
retry", current.getName()), e);
+        } finally {
+            asleep.remove(current.getName());
+            calculateNextAttempt();
+            if (interrupted) {
+                Thread.currentThread().interrupt();
+            }
+        }

Review Comment:
   `waitBeforeRetry()` never re-sets the interrupt flag: `interrupted` is 
initialized to false and never set true in the catch block, so thread 
interruption gets swallowed.



##########
framework/db/src/main/java/com/cloud/utils/db/GlobalLock.java:
##########
@@ -45,127 +43,248 @@
   * </p>
   */
 public class GlobalLock {
-    protected Logger logger = LogManager.getLogger(getClass());
+    protected final static Logger logger = 
LogManager.getLogger(GlobalLock.class);
 
     private String name;
-    private int lockCount = 0;
-    private Thread ownerThread = null;
-
-    private int referenceCount = 0;
-    private long holdingStartTick = 0;
-
-    private static Map<String, GlobalLock> s_lockMap = new HashMap<String, 
GlobalLock>();
 
+    /**
+     * DB lock count.
+     * Increments on {@link GlobalLock#lock(int)} and decrements on {@link 
GlobalLock#unlock()}.
+     * Upon {@link GlobalLock#unlock()}, if {@link GlobalLock#lockCount} is 
less than 1, then lock removed from DB
+     */
+    private int lockCount;
+
+    /**
+     * Internal (in-memory) lock count.
+     * Increments on {@link GlobalLock#addRef()} and indirectly on {@link 
GlobalLock#getInternLock(String)} and
+     * decrements on {@link GlobalLock#releaseRef()}, {@link 
GlobalLock#unlock()} and on {@link GlobalLock#lock(int)}
+     * if DB lock is unsuccessful
+     */
+    private int referenceCount;
+
+    /**
+     * Thread that owns lock. If lock called from different thread, it will be 
waiting for the owner to unlock it
+     * within requested timeout. If owner thread call {@link 
GlobalLock#lock(int)} again, then
+     * {@link GlobalLock#lockCount} will be incremented.
+     * If {@link GlobalLock#unlock()} called by owner thread, or DB lock will 
be unsuccessful, then owner thread will be
+     * nullified.
+     */
+    private Thread ownerThread;
+
+    /**
+     * Variable to hold lock duration in milliseconds. Used for information 
only.
+     */
+    private long holdingStartTick;
+
+    /**
+     * Holds all created locks.
+     */
+    private static Map<String, GlobalLock> s_lockMap = new HashMap<>();
+
+    /**
+     * Create lock.
+     *
+     * @param name lock name
+     */
     private GlobalLock(String name) {
         this.name = name;
     }
 
+    /**
+     * Increment reference count to lock.
+     *
+     * @return reference count
+     */
     public int addRef() {
         synchronized (this) {
             referenceCount++;
             return referenceCount;
         }
     }
 
+    /**
+     * Decrement reference count to lock.
+     *
+     * @return reference count
+     */
     public int releaseRef() {
-        int refCount;
-
         boolean needToRemove = false;
         synchronized (this) {
+            if (logger.isDebugEnabled()) {
+                logger.debug("Releasing reference for internal lock {}, 
reference count: {}, lock count: {}",
+                        name, referenceCount, lockCount);
+            }
             referenceCount--;
-            refCount = referenceCount;
-
-            if (referenceCount < 0)
-                logger.warn("Unmatched Global lock " + name + " reference 
usage detected, check your code!");
 
-            if (referenceCount == 0)
+            if (referenceCount < 0) {
+                logger.warn("Unmatched internal lock {} reference usage 
detected (reference count: {}, " +
+                        "lock count: {}), check your code!", name, 
referenceCount, lockCount);
+            } else if (referenceCount < 1) {
                 needToRemove = true;
+            }
         }
 
-        if (needToRemove)
+        if (needToRemove) {
+            if (logger.isDebugEnabled()) {
+                logger.debug("Need to release internal lock {}", name);
+            }
             releaseInternLock(name);
+        }
+        if (logger.isDebugEnabled()) {
+            logger.debug("Released reference for lock {}, reference count: 
{}", name, referenceCount);
+        }
+        return referenceCount;
+    }
 
-        return refCount;
+    public static boolean isLockAvailable(String name) {
+        if (logger.isDebugEnabled()) {
+            logger.debug("Checking lock present for {}", name);
+        }
+        boolean result = false;
+        try {
+            result = DbUtil.isFreeLock(name);
+        } finally {
+            if (logger.isDebugEnabled()) {
+                logger.debug("Result of checking lock present for {}: {}", 
name, result);
+            }
+        }
+        return result;
     }
 
+    /**
+     * Registers internal lock (in memory) object. Does not create any lock in 
DB yet.
+     *
+     * @param name lock name
+     * @return lock object
+     */
     public static GlobalLock getInternLock(String name) {
         synchronized (s_lockMap) {
+            GlobalLock lock;
             if (s_lockMap.containsKey(name)) {
-                GlobalLock lock = s_lockMap.get(name);
-                lock.addRef();
-                return lock;
+                lock = s_lockMap.get(name);
+                if (logger.isDebugEnabled()) {
+                    logger.debug("Internal lock {} already exists with 
reference count {} and lock count {}",
+                            name, lock.referenceCount, lock.lockCount);
+                }
             } else {
-                GlobalLock lock = new GlobalLock(name);
-                lock.addRef();
+                lock = new GlobalLock(name);
+                if (logger.isDebugEnabled()) {
+                    logger.debug("Internal lock {} does not exist, adding", 
name);
+                }
                 s_lockMap.put(name, lock);
-                return lock;
             }
+            lock.addRef();
+            if (logger.isDebugEnabled()) {
+                logger.debug("Added reference to internal lock {}, reference 
count {}, lock count {}",
+                        name, lock.referenceCount, lock.lockCount);
+            }
+            return lock;
         }
     }
 
+    /**
+     * Unregister internal lock (in memory) object. Does not remove any lock 
from DB.
+     *
+     * @param name lock name
+     */
     private void releaseInternLock(String name) {
         synchronized (s_lockMap) {
             GlobalLock lock = s_lockMap.get(name);
             if (lock != null) {
-                if (lock.referenceCount == 0)
+                if (lock.referenceCount == 0) {
+                    if (logger.isDebugEnabled()) {
+                        logger.debug("Released internal lock {}", name);
+                    }
                     s_lockMap.remove(name);
+                } else {
+                    if (logger.isDebugEnabled()) {
+                        logger.debug("Not releasing internal lock {} as it has 
references count: {}, lock count: {}",
+                                name, lock.referenceCount, lock.lockCount);
+                    }
+                }
             } else {
-                logger.warn("Releasing " + name + ", but it is already 
released.");
+                logger.warn("Internal lock {} already released", name);
             }
         }
     }
 
+    /**
+     * Acquire or join existing DB lock.
+     *
+     * @param timeoutSeconds time in seconds during which lock needs to be 
obtained (it is not the lock duration)
+     * @return true if lock successfully obtained
+     */
     public boolean lock(int timeoutSeconds) {
         int remainingMilliSeconds = timeoutSeconds * 1000;
         Profiler profiler = new Profiler();
         boolean interrupted = false;
         try {
             while (true) {
                 synchronized (this) {
-                    if (ownerThread != null && ownerThread == 
Thread.currentThread()) {
-                        logger.warn("Global lock re-entrance detected");
-
+                    if (ownerThread == Thread.currentThread()) {
+                        logger.warn("Global lock {} re-entrance detected, 
owner thread: {}, reference count: {}, " +
+                                "lock count: {}", getThreadName(ownerThread), 
name, referenceCount, lockCount);
+                        // if it is re-entrance, then we may have more lock 
counts than needed?

Review Comment:
   Logging bug: this message's placeholders don't match the arguments. It says 
"Global lock {lockName} ... owner thread {threadName}", but the first argument 
passed is the thread name and the second is the lock name, so the log line is 
swapped/misleading.



##########
core/src/main/java/com/cloud/resource/ServerResource.java:
##########
@@ -84,7 +84,8 @@ default StartupCommand[] initialize(boolean 
isTransferredConnection) {
     void setAgentControl(IAgentControl agentControl);
 
     default boolean isExitOnFailures() {
-        return true;
+        // true would cause unnecessary Agent service restart, don't want it 
by default
+        return false;
     }

Review Comment:
   This changes the default behavior for *all* `ServerResource` implementations 
that don't override `isExitOnFailures()`. Previously the agent would exit by 
default; now it will keep running/reconnecting by default, which is a 
significant behavioral change beyond indirect-agent connection handling.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to