timmyzhu opened a new issue, #13497:
URL: https://github.com/apache/cloudstack/issues/13497
### problem
Symptom: My hosts are getting stuck in an Alert state with cloudstack
4.22.1.0. Restarting the agents, rebooting the hosts, and even reinstalling and
re-adding the hosts does not fix the issue.
Cause: When the management server sends a ReadyCommand to the agent, it
takes an excessively long time, so the management server tries to reinitialize
the agent and eventually just kills the connection. The agent is able to
communicate with the management server perfectly fine, so it is not a network
issue or SSL issue as the SSL handshake succeeded and logs indicate they are
able to communicate.
Root cause: The ReadyCommand process was modified in 4.22.1.0 such that it
could be excessively slow. The change comes from #12970 in the
detectVddkLibDir() function, which is called even if we do not use any instance
conversion or VDDK. The function executes a shell command defined in
VDDK_AUTODETECT_PATH_CMD, which performs a linux find search over the entire
host OS. This should never be on the critical path or on anything that needs to
complete quickly. We have large, mounted network filesystems in our hosts, so
trying to search the entire filesystem will take minutes and lead to the
timeouts and the corresponding Alert state.
### versions
Cloudstack version: 4.22.1.0
Hypervisor: KVM
Storage: NFS mounted filesystems
### The steps to reproduce the bug
1. Have a complex host OS filesystem with many directories, some of which
may be network mounted. Basically any setup where doing a search of the entire
filesystem from the root directory takes more than a few minutes.
2. Restart an agent on a host.
3. Management server will show the Alert state after being in the Connecting
state for a couple minutes.
### What to do about it?
Workaround: Till a fix can be implemented, my current workaround is to
define a dummy vddk directory for each host and provide this directory in the
agent.properties files under vddk.lib.dir. This avoids the expensive search,
which allows my hosts to finish the ReadyCommand quickly and enter the Up
state. Here's an example script that performs the workaround:
```shell
#!/bin/bash
sudo mkdir -p /workaround/vmware-vix-disklib-distrib/lib64
sudo touch /workaround/vmware-vix-disklib-distrib/lib64/libvixDiskLib.so
if ! sudo grep -q "vddk.lib.dir" /etc/cloudstack/agent/agent.properties; then
echo "vddk.lib.dir=/workaround/vmware-vix-disklib-distrib" | sudo tee -a
/etc/cloudstack/agent/agent.properties
fi
```
Fix: I don't know what the desired long-term fix is, but it should
definitely not involve recursively searching the entire root filesystem when
trying to connect a host to the management server. Removing the library
directory auto-detection may be the easiest fix since users could just specify
the library path if they choose to enable the optional vddk feature. Another
possibility is to ensure the optional features are enabled before trying to
search for libraries. The hostSupportsVddk function executes the
hostSupportsInstanceConversion() function at the end, but it could be done
earlier before the expensive detectVddkLibDir() function is called. However,
changes like this may be hiding the true issue of performing an expensive
filesystem search in the critical path of connecting hosts. If there's a faster
way of finding the library, that would be an ideal solution, but that may not
be possible without knowing where it's installed. Restricting the search to
well-known library ins
tallation locations may be one way to reduce the search time. Lastly, it would
be good if the command had a timeout specified rather than the default timeout,
which is 1 hour. I saw some other places use the timeout specified in the
agent.properties file, but that didn't apply to this command. Users may
struggle to find detailed timeout configurations, so this wouldn't be a great
fix, but at least it would allow the timeout to be user-controllable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]