vishesh92 opened a new pull request, #8089:
URL: https://github.com/apache/cloudstack/pull/8089
### Description
Depending on the agents' configuration, restarting a management server
(preferred MS for the agent) will make the agent connect to another management
server (non preferred MS). When the preferred MS comes back up, agent will try
to disconnect with non-preferred MS and connect with the preferred MS. A race
condition can happen during this process in which disconnection from
non-preferred MS completes after the connection with preferred MS. This leads
to agent to go into an `Alert` state. During this time, agent is still sending
Ping to the preferred MS.
This PR solves this issue by:
* Taking a lock in database which ensures that for a host, only connection
or disconnection can happen across different MS.
* While processing the `Ping` command if the Host is not in `Up` state, we
request the agent to send a startup command again to the connection. If the
startup is successful, the agent will come back in Up state.
<!--- Describe your changes in DETAIL - And how has behaviour functionally
changed. -->
<!-- For new features, provide link to FS, dev ML discussion etc. -->
<!-- In case of bug fix, the expected and actual behaviours, steps to
reproduce. -->
To reproduce the issue,
1. apply the patch below.
```
diff --git
a/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
index b74c11cf..aec8a8dd 100644
---
a/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
+++
b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
@@ -836,6 +836,9 @@ public class AgentManagerImpl extends ManagerBase
implements AgentManager, Handl
removeAgent(attache, nextStatus);
// update the DB
+ try {
+ Thread.sleep(5000L);
+ } catch (Exception e){}
if (host != null && transitState) {
disconnectAgent(host, event, _nodeId);
}
```
2. Setup an environment with two management servers and below global
configuration
```
agent.lb.enabled = true
indirect.agent.lb.algorithm = roundrobin
indirect.agent.lb.check.interval = 30
```
3. Restart the preferred management server.
4. Check status of host in database or in the UI
<!-- When "Fixes: #<id>" is specified, the issue/PR will automatically be
closed when this PR gets merged -->
<!-- For addressing multiple issues/PRs, use multiple "Fixes: #<id>" -->
<!-- Fixes: # -->
<!---
*********************************************************************************
-->
<!--- NOTE: AUTOMATATION USES THE DESCRIPTIONS TO SET LABELS AND PRODUCE
DOCUMENTATION. -->
<!--- PLEASE PUT AN 'X' in only **ONE** box -->
<!---
*********************************************************************************
-->
### Types of changes
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] build/CI
### Feature/Enhancement Scale or Bug Severity
#### Feature/Enhancement Scale
- [ ] Major
- [x] Minor
#### Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [ ] Major
- [x] Minor
- [ ] Trivial
### Screenshots (if appropriate):
### How Has This Been Tested?
1. Tried by applying the above patch.
2. Updated the state of a host to `Alert` state in database. After getting a
ping, it gets a startup command after which it turns back to `Up` state.
<!-- Please describe in detail how you tested your changes. -->
<!-- Include details of your testing environment, and the tests you ran to
-->
#### How did you try to break this feature and the system with this change?
<!-- see how your change affects other areas of the code, etc. -->
<!-- Please read the
[CONTRIBUTING](https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md)
document -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]