database recovery...

Scott Adkins Tue, 09 Sep 2003 22:48:20 -0700

We are running a Tru64 TruCluster system.  We have 2 members in the cluster
and run Cyrus IMAP 2.2.1b.  We typically ran the system with Cyrus being
CAA'd and only running on one member at a time.  The stuff would relocate
to the other cluster member if for some reason it could not run on the first
one or we had to take it down for maintenance or whatever.

Well, it appears that this new version uses a lot more memory than the 2.0.16
version did, with a lot of the processes settling on 27MB or 28MB of resident
memory in use (not virtual memory, which the processes always indicate has
more in use, but real memory in use). On Tru64, there is no way to determine
exactly where that memory is going, unlike Solaris where you can run any of
the proc tools, like pmap, to get a break down of what memory is shared,
what is stored in the heap and what is consumed by the stack. Running lsof
doesn't help, as they all show the same thing... interestingly enough, our
mailboxes.db file is about 27MB in size, but I can find a lot of processes
that are only a couple megabytes in size and that file is opened with them
as well, so I think it is just a coincidence. Has anyone else noticed the
larger memory footprint?

So, with 3000+ cyrus process averaging about 20MB each, it consumed pretty
much all our real RAM (we have 8GB on each cluster member).  I would say
about 6GB of memory was consumed in just Cyrus processes.

We decided to run Cyrus on both cluster members at the same time.  Since
we are using a cluster file system which uses flock() to keep things working
properly, it shouldn't be a problem.  For those not familiar with Tru64's
cluster file system, this is not NFS.  It is basically a local file system
as far as each member is concerned, but it is shared like NFS on all the
members.

Anyways, as Cyrus starts up on each member, it runs the "ctl_cyrusdb -r"
command.  The problem with that is that it runs it on each member at the
same time (if I start them at the same time), so mailboxes.db has two
of these processes hitting it at the same time.  Worse, one member may
finish faster than the other and start accepting connections before the
other member has completed the recovery process.

This doesn't appear to cause any side effects, but I would like to know
if there would be any from this... especially if users are hitting the
file while a recovery is in progress.

Also, it takes a really *really* long time for the recovery process to
run, which means even a simple restart is felt by all, as it takes several
minutes for it to complete.  In 2.0.16 with a flat file database, there
was no wait at all for the restart to occur, and most people may not even
notice it, since their email clients would silently reopen IMAP connections
that were closed on them.

Is there any way to shorten the duration of the recovery process?  For
instance, increasing the frequency of checkpoints considerably is one idea
I have... would that help?  Is there a point that I could do the recovery
process on a schedule (like once a night) instead of running it at startup
time to cut down on the overhead?

Anyways, I am looking for some insight into this process...

Thanks,
Scott
--
+-----------------------------------------------------------------------+
     Scott W. Adkins                http://www.cns.ohiou.edu/~sadkins/
  UNIX Systems Engineer                  mailto:[EMAIL PROTECTED]
       ICQ 7626282                 Work (740)593-9478 Fax (740)593-1944
+-----------------------------------------------------------------------+
    PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/

pgp00000.pgp
Description: PGP signature

database recovery...

Reply via email to