Ocassional signaled to death by 6 followed by increasing numbers of hung processes

2001-03-05 Thread Irelann Kerry Anderson

We recently converted our main mail server (30,000+ users) from cyrus-1.6 to
cyrus-2.0.12, we had converted a smaller  (6000+ users) some time earlier to
2.0.9.  We had  tried 2.0.9 on this larger server, but that version has severe
performance problems with that many mailboxes.

Things looked pretty good initially, but after a few days, it stopped responding
to POP and IMAP requests.   A lsof and a PS showed hundreds of lmtpd processes
and increasing.  About that time we could get no response at all from the
machine and were forced to reboot before we could gather more information.

This has happened 4 more times since at intervals of from 1 to 4 days (always
during off hours although that may not be significant).  One of these times
I was able to get in and send a TERM signal to the master process and all shut
down fine and things worked fine when I restarted the master process.  From this
it appears that when a process is aborted in this fashion, some resource is
remaining locked causing all new processes (lmtpd, imapd and pop) to hang.

On examining the logs, I found that each of these incidents was immediately
preceded by the message:

"signaled to death by 6"

4 times the process in question was imapd, once it was lmtpd.

There was no core file produced, I've since changed the startup script to cd
into a directory writeable by cyrus and removed the "ulimit -c 0" from the
startup script, but I've not yet gotten a core file to look at.

In the meantime, I'm posting this to the list on the off chance someone else has
seen and debugged this problem.

The mail server is a dual Pentium III 500 with 1GB ram, 100GB hardware raid
running RedHat 7.0 with all current updates applied except the kernel which is
kernel-smp-2.2.16-22

--
Irelann Kerry Anderson  phone:(207)581-3508
Systems Group   internet  [EMAIL PROTECTED]
UNET (formerly CAPS) Technology Services
University of Maine System






Re: Ocassional signaled to death by 6 followed by increasing numbers of hung processes

2001-03-05 Thread Lawrence Greenfield

   Date: Mon, 05 Mar 2001 11:32:01 -0500
   From: Irelann Kerry Anderson [EMAIL PROTECTED]

   We recently converted our main mail server (30,000+ users) from
   cyrus-1.6 to cyrus-2.0.12, we had converted a smaller (6000+ users)
   some time earlier to 2.0.9.  We had tried 2.0.9 on this larger
   server, but that version has severe performance problems with that
   many mailboxes.

   Things looked pretty good initially, but after a few days, it
   stopped responding to POP and IMAP requests.  A lsof and a PS
   showed hundreds of lmtpd processes and increasing.  About that time
   we could get no response at all from the machine and were forced to
   reboot before we could gather more information.

   This has happened 4 more times since at intervals of from 1 to 4
   days (always during off hours although that may not be
   significant).  One of these times I was able to get in and send a
   TERM signal to the master process and all shut down fine and things
   worked fine when I restarted the master process.  From this it
   appears that when a process is aborted in this fashion, some
   resource is remaining locked causing all new processes (lmtpd,
   imapd and pop) to hang.

This is consistent with a lock being held in the Berkeley db
environment when a process crashes.

   On examining the logs, I found that each of these incidents was
   immediately preceded by the message:

   "signaled to death by 6"

   4 times the process in question was imapd, once it was lmtpd.

Signal 6 on my Linux system is SIGABRT, which is usually caused by an
assert() failing or an abort() call.  This should always dump core.
Since imapd does chdir(), it could be dumping core in some user's
mailbox; I'd run a 

find /var/spool/imap -type f -name core

to track down the core files and find out what's causing them if they
exist (I'm sure you'll have some with that many users).

   There was no core file produced, I've since changed the startup
   script to cd into a directory writeable by cyrus and removed the
   "ulimit -c 0" from the startup script, but I've not yet gotten a
   core file to look at.

I'm surprised the lmtpd didn't dump core into that directory.

   In the meantime, I'm posting this to the list on the off chance
   someone else has seen and debugged this problem.

   The mail server is a dual Pentium III 500 with 1GB ram, 100GB
   hardware raid running RedHat 7.0 with all current updates applied
   except the kernel which is kernel-smp-2.2.16-22

Since with this many users you may be somewhat desperate, I'll mention
that it's possible to run Cyrus v2 using the flat file
/var/imap/mailboxes.db instead of the Berkeley db-ized
/var/imap/mailboxes.db.

Doing this conversion may solve the symptom but not the problem, and
will also cause your CREATE/RENAME/etc. performance to be
approximately what it is with v1.6.  If you can't debug this, we can
talk about how to make this change.

Larry