Wow, I'm surprised this hasn't been fixed upstream yet... I had this problem about a year ago.

The master process runs in a loop, checking it's communication pipes with all the different types of child processes it spawns, keeping track of the child processes, and distributing work to them. The child processes send messages to the master like "I'm ready to handle a new connection" or "I'm busy" etc. via pipes.

What was happening to me, and looks like is happening to you, is that the master gets a sigchld signal, and executes the reap_child() process *while there are still messages from the child in the pipe*. Then when it goes to check the pipe, it finds this "busy" or "ready" message from the dead child process. But because it already executed reap_child(), it no longer has a record of this process in its table of children. So it assumes the child process is so brand-new that it isn't in the child table yet (thus the "not registered yet" log message) and creates a new entry in the child table with status UNKNOWN.

Contributing to the problem was the fact that the master process was only dealing with one message from each service type each time it looped. From master.c:

            if (FD_ISSET(x, &rfds)) {
                r = read(x, &msg, sizeof(msg));
                if (r != sizeof(msg)) {
                    syslog(LOG_ERR, "got incorrectly sized response from child: 
%x", i);
                    continue;
                }
                
                process_msg(i, &msg);
            }

Henrique gave me a patch which fixed this behavior, replacing the above segment with the code between #####:

#####
          if (FD_ISSET(x, &rfds)) {
                while ((r = read_msg(x, &msg)) == 0)
                    process_msg(&Services[i], &msg);

if (r == 2) {
syslog(LOG_ERR,
"got incorrectly sized response from child: %x", i);
continue;
}
if (r < 0) {
syslog(LOG_ERR,
"error while receiving message from child %x: %m", i);
continue;
}
}
#####


And adding this new function to master.c:

int read_msg(int fd, struct notify_message *msg)
{
        ssize_t r;
        size_t off = 0;
        int s = sizeof(*msg);

        while (s > 0) {
            do
                r = read(fd, msg + off, s);
            while ((r == -1) && (errno == EINTR));
            if (r <= 0) break;
            s -= r;
            off += r;
        }
        if ( ((r == 0) && (off == 0)) ||
             ((r == -1) && (errno == EAGAIN)) )
            return 1;
        if (r == -1) return -1;
        if (s != 0) return 2;
        return 0;
}




Wolfgang Powisch wrote:
Hi,

Some days ago I had a strange problem with pop3d, suddenly not accepting
connections anymore. pop3d was set with maxchilds=60 in cyrus.conf and
we seamed to have really reached 60 simultanous pop3 connections at this
point.
 From the logs, the master process complaine about UNKNOWN states and
a negative number of pop3d worker processes, and this negtive number was
increasing continuously.
Some minutes after restarting cyrus completely, the same again.
Then I increased pop3 maxchilds to 100, restarted cyrus and no problem
anymore.

I'm using cyrus-imapd-2.2.12 with perdition as pop3/imap frontend.
What should be the expected behaviour when reaching the maxchilds limit.
What's the reason for the negative worker number ?



Part of the logs:

Apr 5 15:23:19 imap01 master[18697]: about to exec /usr/lib/cyrus-imapd/pop3d
Apr 5 15:23:19 imap01 master[4065]: service pop3 pid 17896: while trying to process message 0x3: not registered yet
Apr 5 15:23:20 imap01 master[4065]: service pop3 pid 17896 in UNKNOWN state: processing message 0x3
Apr 5 15:23:21 imap01 master[4065]: service pop3 pid 17896 in UNKNOWN state: now in BUSY state and serving connection
Apr 5 15:23:22 imap01 master[4065]: process 17383 exited, status 0
Apr 5 15:23:23 imap01 master[4065]: process 18647 exited, status 0
Apr 5 15:23:25 imap01 master[4065]: process 18650 exited, status 0
Apr 5 15:23:26 imap01 master[4065]: process 18697 exited, status 0
Apr 5 15:23:27 imap01 master[4065]: process 17734 exited, status 0
Apr 5 15:23:28 imap01 master[18768]: about to exec /usr/lib/cyrus-imapd/pop3d
Apr 5 15:23:28 imap01 master[4065]: pop3 has -1 workers?!?




regards,


--- Cyrus Home Page: http://asg.web.cmu.edu/cyrus Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Reply via email to