Okay, just got distracted by this happening again. The way I'm reproducing it is by ^C'ing a Cassandane run (which cleanly shuts down the cyrus instances it had started) during the JMAP tests, and I guess I'm sometimes getting lucky on the timing and hitting this.
On Wed, Dec 4, 2019, at 2:22 PM, ellie timoney wrote: > I think the really useful information to collect next time this happens > (and while the master process is still running) is: > > * What does lsof tell us about that ready file descriptor (in the > example, fd 11)? I would be very interested to know if it's a client > socket, or an internal messaging socket (that service processes use to > tell master their status). master 3121 cyrus 11u IPv4 49559608 0t0 TCP localhost:9146 (LISTEN) This is the port the http service was listening on. > * If you can attach a debugger and step through a couple of iterations > of master's big "for (;;) {" loop, what path is it taking? What > decisions is it making? So the problem pops up here: /* connections */ if (y >= 0 && Services[i].ready_workers == 0 && Services[i].nactive < Services[i].max_workers && !service_is_fork_limited(&Services[i])) { if (verbose > 2) syslog(LOG_DEBUG, "listening for connections for %s/%s", Services[i].name, Services[i].familyname); FD_SET(y, &rfds); if (y > maxfd) maxfd = y; } (gdb) p Services[i] $28 = {name = 0x55df363956c0 "http", listen = 0x55df36395170 "127.0.0.1:9146", proto = 0x55df36395680 "tcp", exec = 0x55df363956e0, babysit = 0, associate = 0, family = 2, familyname = 0x55df35a8c780 "ipv4", socket = 11, stat = {12, 13}, desired_workers = 0, max_workers = 2147483647, maxfds = 0, maxforkrate = 0, ready_workers = 0, nforks = 1, nactive = 1, nconnections = 1, forkrate = 0, nreadyfails = 0, lastreadyfail = 0, last_interval_start = {tv_sec = 1578455582, tv_usec = 342312}, interval_forks = 1} The http service has shut down already... sort of. It's hanging around as a defunct process: cyrus 3143 3121 0 14:53 ? 00:00:00 [httpd] <defunct> We have no ready_workers, and nactive is less than max_workers, so we're adding the service socket (fd 11) to the listen set. If verbose were large enough, I think we would be logging "listening for connections for %s/%s" during this loop. So, we have data on the socket for this service, and if we weren't already in_shutdown we'd be trying to spawn a new service process to handle it. But we ARE in_shutdown, so we don't spawn a service, so the data on the socket remains unhandled. We don't simply finish shutting down, because that nactive=1 in the service table entry means we think we still have children, so we call child_janitor and then go round the loop again. Sooooo we're waiting for child_janitor to clean up after the http service, I guess. Inside child_janitor, here's the entry: (gdb) print **p $36 = {pid = 3143, service_state = SERVICE_STATE_BUSY, janitor_deadline = 0, si = 1, desc = 0x55df36397f30 "type:SERVICE name:http path:/dev/shm/cyrus/main/libexec/httpd", spawntime = {tv_sec = 1578455582, tv_usec = 682977}, sighuptime = -1, next = 0x0} So, it thinks the http process is still busy, so we're spinning and waiting for it... but the defunct state tells us it's already exited, and the system is just waiting for the parent process (master) to waitpid() it before getting rid of it completely. It's pretty clear that there's still client data on that socket, but I'm not sure if "shut down Cyrus while there's still a client sending data to httpd" is enough to trigger it, or if it's more subtle. The data pending on the socket is the reason we see pselect() called over and over again, but I don't think it's the cause of the problem: without that pending data, we'd still be hanging here instead of shutting down cleanly, but we'd be blocked in pselect() instead. We _should_ have gotten a sigchld from httpd shutting down, and then reap_child should have waitpid'd it and cleaned up the child table, so we shouldn't be spinning like this. So I guess the failure occurred somewhere in the sigchld handling, but by the time I notice it spinning it's too late to see what happened. Did the sigchld go missing? Did reap_child do the wrong thing? Who knows! I don't have any further insight at the moment. I guess if I can find a way to reproduce it reasonably reliably, then I can litter reap_child with debug output and see what it reports just before it goes pear-shaped.... Cheers, ellie