Hi,

The last two days I noticed an incident on a cluster where HA kicked in because a host was marked as down since the Agent disconnected.

The problem was that libvirt didn't respond to the call the agent was doing.

The underlying problem was that the Qemu/KVM process was having some issues and over the monitor socket never responded to libvirt and on his turn libvirt never responded to the Agent.

In the logs I saw:

Ping Interval has gone past 300000.  Attempting to reconnect.

DEBUG [utils.nio.NioConnection] (Agent-Selector:null) Closing socket Socket[addr=/XX.XX.XX.X,port=8250,localport=49098]

[cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands...

[cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 1 commands in progress.

[cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands...

[cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 1 commands in progress.

[cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands...

[cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 1 commands in progress.

This kept going on and on and on until I restarted the Agent since that command would never come through since libvirt was blocking.

For scripts we have a timeout, so when qemu-img doesn't complete in time we give up, but for other commands like this we don't have such a timeout.

What I did as a test for now is breaking out of the loop where we wait for any remaining commands and have the Agent reconnect. But I don't know if that is a good decision.

We are now assuming that libvirt always responds, but that is not the case. It could be numbers of reasons why libvirt can't respond.

Any suggestions on how to handle this case?

Wido

Reply via email to