Okay, just got distracted by this happening again. The way I'm reproducing it
is by ^C'ing a Cassandane run (which cleanly shuts down the cyrus instances it
had started) during the JMAP tests, and I guess I'm sometimes getting lucky on
the timing and hitting this.
On Wed, Dec 4, 2019, at 2:22 PM, ellie timoney wrote:
> I think the really useful information to collect next time this happens
> (and while the master process is still running) is:
>
> * What does lsof tell us about that ready file descriptor (in the
> example, fd 11)? I would be very interested to know if it's a client
> socket, or an internal messaging socket (that service processes use to
> tell master their status).
master 3121 cyrus 11u IPv4 49559608 0t0 TCP
localhost:9146 (LISTEN)
This is the port the http service was listening on.
> * If you can attach a debugger and step through a couple of iterations
> of master's big "for (;;) {" loop, what path is it taking? What
> decisions is it making?
So the problem pops up here:
/* connections */
if (y >= 0 && Services[i].ready_workers == 0 &&
Services[i].nactive < Services[i].max_workers &&
!service_is_fork_limited(&Services[i])) {
if (verbose > 2)
syslog(LOG_DEBUG, "listening for connections for %s/%s",
Services[i].name, Services[i].familyname);
FD_SET(y, &rfds);
if (y > maxfd) maxfd = y;
}
(gdb) p Services[i]
$28 = {name = 0x55df363956c0 "http", listen = 0x55df36395170 "127.0.0.1:9146",
proto = 0x55df36395680 "tcp", exec = 0x55df363956e0, babysit = 0,
associate = 0, family = 2, familyname = 0x55df35a8c780 "ipv4", socket = 11,
stat = {12, 13}, desired_workers = 0, max_workers = 2147483647, maxfds = 0,
maxforkrate = 0, ready_workers = 0, nforks = 1, nactive = 1,
nconnections = 1, forkrate = 0, nreadyfails = 0, lastreadyfail = 0,
last_interval_start = {tv_sec = 1578455582, tv_usec = 342312},
interval_forks = 1}
The http service has shut down already... sort of. It's hanging around as a
defunct process:
cyrus 3143 3121 0 14:53 ? 00:00:00 [httpd] <defunct>
We have no ready_workers, and nactive is less than max_workers, so we're adding
the service socket (fd 11) to the listen set. If verbose were large enough, I
think we would be logging "listening for connections for %s/%s" during this
loop.
So, we have data on the socket for this service, and if we weren't already
in_shutdown we'd be trying to spawn a new service process to handle it. But we
ARE in_shutdown, so we don't spawn a service, so the data on the socket remains
unhandled.
We don't simply finish shutting down, because that nactive=1 in the service
table entry means we think we still have children, so we call child_janitor and
then go round the loop again.
Sooooo we're waiting for child_janitor to clean up after the http service, I
guess.
Inside child_janitor, here's the entry:
(gdb) print **p
$36 = {pid = 3143, service_state = SERVICE_STATE_BUSY, janitor_deadline = 0,
si = 1,
desc = 0x55df36397f30 "type:SERVICE name:http
path:/dev/shm/cyrus/main/libexec/httpd", spawntime = {tv_sec = 1578455582,
tv_usec = 682977}, sighuptime = -1,
next = 0x0}
So, it thinks the http process is still busy, so we're spinning and waiting for
it... but the defunct state tells us it's already exited, and the system is
just waiting for the parent process (master) to waitpid() it before getting rid
of it completely.
It's pretty clear that there's still client data on that socket, but I'm not
sure if "shut down Cyrus while there's still a client sending data to httpd" is
enough to trigger it, or if it's more subtle. The data pending on the socket
is the reason we see pselect() called over and over again, but I don't think
it's the cause of the problem: without that pending data, we'd still be hanging
here instead of shutting down cleanly, but we'd be blocked in pselect() instead.
We _should_ have gotten a sigchld from httpd shutting down, and then reap_child
should have waitpid'd it and cleaned up the child table, so we shouldn't be
spinning like this. So I guess the failure occurred somewhere in the sigchld
handling, but by the time I notice it spinning it's too late to see what
happened. Did the sigchld go missing? Did reap_child do the wrong thing? Who
knows!
I don't have any further insight at the moment. I guess if I can find a way to
reproduce it reasonably reliably, then I can litter reap_child with debug
output and see what it reports just before it goes pear-shaped....
Cheers,
ellie