We are running a Tru64 TruCluster system. We have 2 members in the cluster and run Cyrus IMAP 2.2.1b. We typically ran the system with Cyrus being CAA'd and only running on one member at a time. The stuff would relocate to the other cluster member if for some reason it could not run on the first one or we had to take it down for maintenance or whatever.
Well, it appears that this new version uses a lot more memory than the 2.0.16
version did, with a lot of the processes settling on 27MB or 28MB of resident
memory in use (not virtual memory, which the processes always indicate has
more in use, but real memory in use). On Tru64, there is no way to determine
exactly where that memory is going, unlike Solaris where you can run any of
the proc tools, like pmap, to get a break down of what memory is shared,
what is stored in the heap and what is consumed by the stack. Running lsof
doesn't help, as they all show the same thing... interestingly enough, our
mailboxes.db file is about 27MB in size, but I can find a lot of processes
that are only a couple megabytes in size and that file is opened with them
as well, so I think it is just a coincidence. Has anyone else noticed the
larger memory footprint?
So, with 3000+ cyrus process averaging about 20MB each, it consumed pretty much all our real RAM (we have 8GB on each cluster member). I would say about 6GB of memory was consumed in just Cyrus processes.
We decided to run Cyrus on both cluster members at the same time. Since we are using a cluster file system which uses flock() to keep things working properly, it shouldn't be a problem. For those not familiar with Tru64's cluster file system, this is not NFS. It is basically a local file system as far as each member is concerned, but it is shared like NFS on all the members.
Anyways, as Cyrus starts up on each member, it runs the "ctl_cyrusdb -r" command. The problem with that is that it runs it on each member at the same time (if I start them at the same time), so mailboxes.db has two of these processes hitting it at the same time. Worse, one member may finish faster than the other and start accepting connections before the other member has completed the recovery process.
This doesn't appear to cause any side effects, but I would like to know if there would be any from this... especially if users are hitting the file while a recovery is in progress.
Also, it takes a really *really* long time for the recovery process to run, which means even a simple restart is felt by all, as it takes several minutes for it to complete. In 2.0.16 with a flat file database, there was no wait at all for the restart to occur, and most people may not even notice it, since their email clients would silently reopen IMAP connections that were closed on them.
Is there any way to shorten the duration of the recovery process? For instance, increasing the frequency of checkpoints considerably is one idea I have... would that help? Is there a point that I could do the recovery process on a schedule (like once a night) instead of running it at startup time to cut down on the overhead?
Anyways, I am looking for some insight into this process...
Thanks, Scott -- +-----------------------------------------------------------------------+ Scott W. Adkins http://www.cns.ohiou.edu/~sadkins/ UNIX Systems Engineer mailto:[EMAIL PROTECTED] ICQ 7626282 Work (740)593-9478 Fax (740)593-1944 +-----------------------------------------------------------------------+ PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/
pgp00000.pgp
Description: PGP signature