vishesh92 opened a new pull request, #8089:
URL: https://github.com/apache/cloudstack/pull/8089

   ### Description
   
   Depending on the agents' configuration, restarting a management server 
(preferred MS for the agent) will make the agent connect to another management 
server (non preferred MS). When the preferred MS comes back up, agent will try 
to disconnect with non-preferred MS and connect with the preferred MS. A race 
condition can happen during this process in which disconnection from 
non-preferred MS completes after the connection with preferred MS. This leads 
to agent to go into an `Alert` state. During this time, agent is still sending 
Ping to the preferred MS.
   
   This PR solves this issue by:
   * Taking a lock in database which ensures that for a host, only connection 
or disconnection can happen across different MS.
   * While processing the `Ping` command if the Host is not in `Up` state, we 
request the agent to send a startup command again to the connection. If the 
startup is successful, the agent will come back in Up state.
   
   <!--- Describe your changes in DETAIL - And how has behaviour functionally 
changed. -->
   
   <!-- For new features, provide link to FS, dev ML discussion etc. -->
   <!-- In case of bug fix, the expected and actual behaviours, steps to 
reproduce. -->
   To reproduce the issue,
   1. apply the patch below.
   ```
   diff --git 
a/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
 
b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
   index b74c11cf..aec8a8dd 100644
   --- 
a/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
   +++ 
b/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java
   @@ -836,6 +836,9 @@ public class AgentManagerImpl extends ManagerBase 
implements AgentManager, Handl
    
            removeAgent(attache, nextStatus);
            // update the DB
   +        try {
   +            Thread.sleep(5000L);
   +        } catch (Exception e){}
            if (host != null && transitState) {
                disconnectAgent(host, event, _nodeId);
            }
   ```
   2. Setup an environment with two management servers and below global 
configuration
   ```
   agent.lb.enabled = true
   indirect.agent.lb.algorithm = roundrobin
   indirect.agent.lb.check.interval = 30
   ```
   3. Restart the preferred management server.
   4. Check status of host in database or in the UI
   
   <!-- When "Fixes: #<id>" is specified, the issue/PR will automatically be 
closed when this PR gets merged -->
   <!-- For addressing multiple issues/PRs, use multiple "Fixes: #<id>" -->
   <!-- Fixes: # -->
   
   <!--- 
*********************************************************************************
 -->
   <!--- NOTE: AUTOMATATION USES THE DESCRIPTIONS TO SET LABELS AND PRODUCE 
DOCUMENTATION. -->
   <!--- PLEASE PUT AN 'X' in only **ONE** box -->
   <!--- 
*********************************************************************************
 -->
   
   ### Types of changes
   
   - [ ] Breaking change (fix or feature that would cause existing 
functionality to change)
   - [ ] New feature (non-breaking change which adds functionality)
   - [x] Bug fix (non-breaking change which fixes an issue)
   - [ ] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   - [ ] build/CI
   
   ### Feature/Enhancement Scale or Bug Severity
   
   #### Feature/Enhancement Scale
   
   - [ ] Major
   - [x] Minor
   
   #### Bug Severity
   
   - [ ] BLOCKER
   - [ ] Critical
   - [ ] Major
   - [x] Minor
   - [ ] Trivial
   
   
   ### Screenshots (if appropriate):
   
   
   ### How Has This Been Tested?
   1. Tried by applying the above patch.
   2. Updated the state of a host to `Alert` state in database. After getting a 
ping, it gets a startup command after which it turns back to `Up` state.
   <!-- Please describe in detail how you tested your changes. -->
   <!-- Include details of your testing environment, and the tests you ran to 
-->
   
   #### How did you try to break this feature and the system with this change?
   
   <!-- see how your change affects other areas of the code, etc. -->
   
   
   <!-- Please read the 
[CONTRIBUTING](https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md) 
document -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to