Hi Dimitris,
On 17:47 Mon 20 Jan , Dimitris Aragiorgis wrote:
> Hi all!
>
>
> Recently we observed some strange TypeErrors in node-daemon.log related to
> QMP message handling. Specifically the exceptions were inside
> GetInstanceInfo().
>
> Please do the following test:
>
> 1) Client A connects to qmp socket with socat
> 2) Client A gets greeting message {"QMP": {"version": ..}
> 3) Client A waits (select on the socket's fd)
> 4) Client B tries to connect to the *same* qmp socket with socat
> 5) Client B does *NOT* get any greating message
> 6) Client B waits (select on the socket's fd)
> 7) Client B closes connection (kill socat)
> 8) Client A quits too
> 9) Client C connects to qmp socket
> 10) Client C gets *two* greeting messages!!!
>
> This case is reproducable in Ganeti:
> 1) Allocator runs and invokes GetAllInstancesInfo()
Doesn't the IAllocator run with the nodes locked?
> 2) GetInstanceInfo() creates a QMPConnection()
> 3) Meanwhile a client wants to get instance runtime info
> 4) GetInstanceInfo() creates a new QMPConnection() but _Recv() blocks when
> trying to read from socket and a timeout exception is raised. This
> is catched and raised as HypervisorError which is *silently* catched inside
> GetInstanceInfo()
> 5) First connection closes
> 6) When a third connection will arive it will get *two* greetings!
> 7) The first greeting will be consumed inside connect()
> 8) The second will be consumed as response to qmp_capabilities execute command
> 9) info cpus will get {'return': {}}!! which will result to TypeError.
>
> After digging qemu source we found that monitor's output buffer is
> shared between clients. Only one client can interact with the monitor.
> The others are considered as blocked. Still their greeting message is
> written to the output buffer and a callback is registered to flush the
> message to the socket when possible. This means that if a client closes
> the connection before getting the greeting, these data remain in the
> buffer and will be flushed upon a future connection.
This sounds like a qemu bug, and as such it should IMHO be discussed
with the qemu upstream first. Is it also present in qemu-1.6? Earlier
versions?
>
> To fix this on the Ganeti side, first of all, a finally: section that closes
> the socket must be added for each QMPConnection(), that practically leaves
> smaller timeslot for more than one concurrent connections. Additionally we
> should parse as many greetings available when starting the connection,
> and finally parse the output of qmp_capabilities command.
On top of that you would need a per-monitor lock to be safe, but even if
you do all of this, you will not be safe from external tools trying to
use the QMP socket (including the KVM daemon IIUC?).
Regards,
Apollon