Date: Mon, 05 Mar 2001 11:32:01 -0500
From: Irelann Kerry Anderson [EMAIL PROTECTED]
We recently converted our main mail server (30,000+ users) from
cyrus-1.6 to cyrus-2.0.12, we had converted a smaller (6000+ users)
some time earlier to 2.0.9. We had tried 2.0.9 on this larger
server, but that version has severe performance problems with that
many mailboxes.
Things looked pretty good initially, but after a few days, it
stopped responding to POP and IMAP requests. A lsof and a PS
showed hundreds of lmtpd processes and increasing. About that time
we could get no response at all from the machine and were forced to
reboot before we could gather more information.
This has happened 4 more times since at intervals of from 1 to 4
days (always during off hours although that may not be
significant). One of these times I was able to get in and send a
TERM signal to the master process and all shut down fine and things
worked fine when I restarted the master process. From this it
appears that when a process is aborted in this fashion, some
resource is remaining locked causing all new processes (lmtpd,
imapd and pop) to hang.
This is consistent with a lock being held in the Berkeley db
environment when a process crashes.
On examining the logs, I found that each of these incidents was
immediately preceded by the message:
"signaled to death by 6"
4 times the process in question was imapd, once it was lmtpd.
Signal 6 on my Linux system is SIGABRT, which is usually caused by an
assert() failing or an abort() call. This should always dump core.
Since imapd does chdir(), it could be dumping core in some user's
mailbox; I'd run a
find /var/spool/imap -type f -name core
to track down the core files and find out what's causing them if they
exist (I'm sure you'll have some with that many users).
There was no core file produced, I've since changed the startup
script to cd into a directory writeable by cyrus and removed the
"ulimit -c 0" from the startup script, but I've not yet gotten a
core file to look at.
I'm surprised the lmtpd didn't dump core into that directory.
In the meantime, I'm posting this to the list on the off chance
someone else has seen and debugged this problem.
The mail server is a dual Pentium III 500 with 1GB ram, 100GB
hardware raid running RedHat 7.0 with all current updates applied
except the kernel which is kernel-smp-2.2.16-22
Since with this many users you may be somewhat desperate, I'll mention
that it's possible to run Cyrus v2 using the flat file
/var/imap/mailboxes.db instead of the Berkeley db-ized
/var/imap/mailboxes.db.
Doing this conversion may solve the symptom but not the problem, and
will also cause your CREATE/RENAME/etc. performance to be
approximately what it is with v1.6. If you can't debug this, we can
talk about how to make this change.
Larry