Issue 1117 in ganeti: (patch) GlusterFS 3.6 breaks qemu start -> enhance RunCmd utility

ganeti Fri, 10 Jul 2015 07:42:11 -0700

Status: New
Owner: ----

New issue 1117 by [email protected]: (patch) GlusterFS 3.6 breaks qemustart -> enhance RunCmd utility

https://code.google.com/p/ganeti/issues/detail?id=1117


What software version are you running? Please provide the output of "gnt-
cluster --version", "gnt-cluster version", and "hspace --version".
gnt-cluster (ganeti v2.11.6) 2.11.6

What distribution are you using?
Ubuntu 14.04

What steps will reproduce the problem?
1. Have a cluster with qemu and disks in GlusterFS
2. Upgrade GlusterFS to 3.6
3. VMs won't start at all

What is the expected output? What do you see instead?
VM should start instead of gnt-instance timeouting

Please provide any additional information below.

I've dug quite deeply into the issue. The problem is that the qemu commandlaunched by noded never returns. RunCmd is called, outputs the command tothe log but never ends, although the -daemonize option is given on the qemucommand line.After a lot of prospection, I've found that glusterfs calls a dup(2)somewhere for its own usage. As such, the FD of the pipe that is set byganeti's popen call is not closed at all (it's transmitted to the child),even after qemu forks itself for daemonization.This leads the polling of FDs (process.py, _RunCmdPipe), used for readingthe command output to never finish, since the FD is still open but no datawill come and nothing will close it (the pipe is hold by the daemonizedqemu).

It's not obvious at first but this demonstrates that relying on the FDsbeing closed to say that a child terminated is too simple. We have to checkif it is really gone or not. In this case the child is in fact a zombiesince ganeti never wait()'ed for it: ganeti is stuck polling FDs, which donot belong anymore to the direct child it started but to a daemonizationsub-child.

Attached is a small patch that checks for child's life each time beforeattempting a new poll on the FDs. This solves the problem for me.It would maybe be better to set a default timeout when polling the FDssince a racecond may occur if the child dies between the check and thepolling. Since I'm no ganeti dev I've no idea if this is a good idea, butsetting pt=10000 or something like that would probably be better (makes theloop always run every 10s, even if no event was caught on the FDs).




Attachments:
        patch-ganeti-qemu-fds.diff  407 bytes

--

You received this message because this project is configured to send allissue notifications to this address.

You may adjust your notification preferences at:
https://code.google.com/hosting/settings

Issue 1117 in ganeti: (patch) GlusterFS 3.6 breaks qemu start -> enhance RunCmd utility

Reply via email to