rozaqi commented on issue #13054:
URL: https://github.com/apache/cloudstack/issues/13054#issuecomment-4323833739
Update based on investigation from the support team:
Based on investigation from the support team, the following factors were
identified:
1. A design limitation in the CloudStack agent “Ping” mechanism
2. Performance issues observed on AlmaLinux 8 (not reproducible on AlmaLinux
9)
3. Node-level configuration related to hostname resolution
### Summary of Findings
CloudStack agents send periodic “Ping” messages containing the state of
running VMs. If Ping data collection is delayed, the reported VM state may
already be outdated, leading to stale information being accepted by the
management server.
### Root Cause Details
During the Ping process, the agent performs multiple RPC calls to
`libvirtd`, including repeated hostname resolution per VM.
On AlmaLinux 8 nodes, hostname resolution is slow when not defined in
`/etc/hosts` (~0.33s per call). Since this is executed multiple times per VM,
the delay can accumulate up to ~30 seconds depending on VM count.
As a result, outdated VM state may be reported, causing database
inconsistencies, incorrect host operations, and potential split-brain scenarios
(duplicate VM instances).
### Workaround:
Adding the hostname entry to `/etc/hosts` reduces resolution time from ~0.3s
to ~0.0s and eliminates the delay.
Example:
`echo "xx.xx.xx.xx <hostname> <hostname>" >> /etc/hosts`
Even with this workaround, there remains a potential race condition in the
current design where stale Ping data may still be accepted.
This reduces resolution time to ~0.00s and eliminates the delay.
before
`for _ in {1..5}; do /usr/bin/time -f 'ahosts=%es' getent ahosts "$(hostname
-s)" >/dev/null; done`
Result: - ~0.33s per resolution call
after workaround
`for _ in {1..5}; do /usr/bin/time -f 'ahosts=%es' getent ahosts "$(hostname
-s)" >/dev/null; done`
Result: - ~0.00s per resolution call
Even with this workaround, there remains a potential race condition where
stale Ping data may still be accepted.
Sharing this for visibility and to help improve handling of stale Ping data.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]