Hi all!
Recently we observed some strange TypeErrors in node-daemon.log related to
QMP message handling. Specifically the exceptions were inside GetInstanceInfo().
Please do the following test:
1) Client A connects to qmp socket with socat
2) Client A gets greeting message {"QMP": {"version": ..}
3) Client A waits (select on the socket's fd)
4) Client B tries to connect to the *same* qmp socket with socat
5) Client B does *NOT* get any greating message
6) Client B waits (select on the socket's fd)
7) Client B closes connection (kill socat)
8) Client A quits too
9) Client C connects to qmp socket
10) Client C gets *two* greeting messages!!!
This case is reproducable in Ganeti:
1) Allocator runs and invokes GetAllInstancesInfo()
2) GetInstanceInfo() creates a QMPConnection()
3) Meanwhile a client wants to get instance runtime info
4) GetInstanceInfo() creates a new QMPConnection() but _Recv() blocks when
trying to read from socket and a timeout exception is raised. This
is catched and raised as HypervisorError which is *silently* catched inside
GetInstanceInfo()
5) First connection closes
6) When a third connection will arive it will get *two* greetings!
7) The first greeting will be consumed inside connect()
8) The second will be consumed as response to qmp_capabilities execute command
9) info cpus will get {'return': {}}!! which will result to TypeError.
After digging qemu source we found that monitor's output buffer is
shared between clients. Only one client can interact with the monitor.
The others are considered as blocked. Still their greeting message is
written to the output buffer and a callback is registered to flush the
message to the socket when possible. This means that if a client closes
the connection before getting the greeting, these data remain in the
buffer and will be flushed upon a future connection.
To fix this on the Ganeti side, first of all, a finally: section that closes
the socket must be added for each QMPConnection(), that practically leaves
smaller timeslot for more than one concurrent connections. Additionally we
should parse as many greetings available when starting the connection,
and finally parse the output of qmp_capabilities command.
Do you agree with all the above?
Should I proceed with a patch? Against stable 2.9?
Thanks,
dimara
PS: All this have been tested with QEMU 1.7
signature.asc
Description: Digital signature
