On Mon, Jan 20, 2014 at 4:47 PM, Dimitris Aragiorgis <[email protected]>wrote:
> Hi all!
>
>
> Recently we observed some strange TypeErrors in node-daemon.log related to
> QMP message handling. Specifically the exceptions were inside
> GetInstanceInfo().
>
> Please do the following test:
>
> 1) Client A connects to qmp socket with socat
> 2) Client A gets greeting message {"QMP": {"version": ..}
> 3) Client A waits (select on the socket's fd)
> 4) Client B tries to connect to the *same* qmp socket with socat
> 5) Client B does *NOT* get any greating message
> 6) Client B waits (select on the socket's fd)
> 7) Client B closes connection (kill socat)
> 8) Client A quits too
> 9) Client C connects to qmp socket
> 10) Client C gets *two* greeting messages!!!
>
> This case is reproducable in Ganeti:
> 1) Allocator runs and invokes GetAllInstancesInfo()
> 2) GetInstanceInfo() creates a QMPConnection()
> 3) Meanwhile a client wants to get instance runtime info
> 4) GetInstanceInfo() creates a new QMPConnection() but _Recv() blocks when
> trying to read from socket and a timeout exception is raised. This
> is catched and raised as HypervisorError which is *silently* catched inside
> GetInstanceInfo()
> 5) First connection closes
> 6) When a third connection will arive it will get *two* greetings!
> 7) The first greeting will be consumed inside connect()
> 8) The second will be consumed as response to qmp_capabilities execute
> command
> 9) info cpus will get {'return': {}}!! which will result to TypeError.
>
> After digging qemu source we found that monitor's output buffer is
> shared between clients. Only one client can interact with the monitor.
> The others are considered as blocked. Still their greeting message is
> written to the output buffer and a callback is registered to flush the
> message to the socket when possible. This means that if a client closes
> the connection before getting the greeting, these data remain in the
> buffer and will be flushed upon a future connection.
>
> To fix this on the Ganeti side, first of all, a finally: section that
> closes
> the socket must be added for each QMPConnection(), that practically leaves
> smaller timeslot for more than one concurrent connections. Additionally we
> should parse as many greetings available when starting the connection,
> and finally parse the output of qmp_capabilities command.
>
> Do you agree with all the above?
>
> Should I proceed with a patch? Against stable 2.9?
>
I believe that the bug should be reported to the QEMU community to hear
their response first. As you have more details on how to do this and a
reproduction case easily available, I hope you might be able to do it?
If not, I can do the deed as well.
But after this is very likely registered as a bug, a patch would be
welcome. If this happens early enough, even stable-2.8 could be a good
target for the patch - I'll try and reproduce the bug there.
>
> Thanks,
> dimara
>
>
> PS: All this have been tested with QEMU 1.7
>
Thanks for the detailed analysis!
Cheers,
Riba