Hi All,

Had another locking problem today and was able to get reasonably complete info.    
Unfortunately, I
only had a limited time to debug since I di not identify the problem until long after 
it started..
So here goes.

At the point I identified the problem, I had 6 blocked IMAP processes and 10 blocked 
lmtpd's.     The
earliest imapd was process 2122,  from the syslog:
-----------------------------------------------------------------------
Nov 14 07:33:03 borg master[2122]: about to exec /usr/cyrus/bin/imapd
Nov 14 07:33:03 borg service-imap[2122]: executed
Nov 14 07:33:03 borg imapd[2122]: accepted connection
Nov 14 07:33:06 borg imapd[2122]: login: occ50-2-172.oakton.edu[10.50.2.172] che
rmak plaintext
Nov 14 07:33:06 borg imapd[2122]: seen_db: user chermak opened /var/imap/user/c/
chermak.seen
Nov 14 07:33:06 borg imapd[2122]: open: user chermak opened Inbox
Nov 14 07:34:23 borg imapd[2122]: seen_db: user chermak opened /var/imap/user/c/
chermak.seen
Nov 14 11:57:03 borg master[4292]: process 2122 exited, signaled to death by 15
-----------------------------------------------------------------------
The final entry is from me killing the process around noon.

a gdb backtrace showed the following
-----------------------------------------------------------------------
(gdb) bt
#0  0x4028a8a1 in __flock () from /lib/libc.so.6
#1  0x8079c8f in lock_reopen (fd=23,
    filename=0x8111be8 "/var/imap/user/c/chermak.seen", sbuf=0xbfffebf0,
    failaction=0xbfffebec) at lock_flock.c:84
#2  0x807b8a7 in starttxn_or_refetch (db=0x8111bd0, mytid=0x8111bc0)
    at cyrusdb_flat.c:211
#3  0x807ba07 in myfetch (db=0x8111bd0, key=0x8112428 "753096e9389d0186",
    keylen=16, data=0xbfffecc4, datalen=0xbfffecc8, mytid=0x8111bc0)
    at cyrusdb_flat.c:270
#4  0x806f933 in seen_readit (seendb=0x8111bb0, lastreadptr=0xbfffed38,
    lastuidptr=0xbfffed3c, lastchangeptr=0xbfffed40, seenuidsptr=0xbfffed44,
    rw=1) at seen_db.c:268
#5  0x806fab8 in seen_lockread (seendb=0x8111bb0, lastreadptr=0xbfffed38,
    lastuidptr=0xbfffed3c, lastchangeptr=0xbfffed40, seenuidsptr=0xbfffed44)
    at seen_db.c:332
#6  0x8060dc6 in append_addseen (mailbox=0xbfffedb0,
    userid=0xbffff05c "chermak", msgrange=0x8112478 "20263") at append.c:896
#7  0x805f945 in append_commit (as=0xbfffedb0, uidvalidity=0xbfffeda4,
    start=0xbfffeda8, num=0xbfffedac) at append.c:240
#8  0x8059a4b in index_copy (mailbox=0x80ffe00, sequence=0x810fbd8 "19326",
    usinguid=1, name=0xbffff2c0 "user.chermak.Trash", copyuidp=0xbffff2bc)
    at index.c:1226
#9  0x80531eb in cmd_copy (tag=0x810fac8 "188", sequence=0x810fbd8 "19326",
---Type <return> to continue, or q <return> to quit---
    name=0x810fc48 "INBOX.Trash", usinguid=1) at imapd.c:3192
#10 0x804d878 in cmdloop () at imapd.c:791
#11 0x804cf50 in service_main (argc=1, argv=0xbffffe14, envp=0xbffffe1c)
    at imapd.c:546
#12 0x804bd00 in main ()
\#13 0x401d2f31 in __libc_start_main (main=0x804b8fc <main>, argc=1,
    ubp_av=0xbffffe14, init=0x804a9e4 <_init>, fini=0x807e83c <_fini>,
    rtld_fini=0x4000e274 <_dl_fini>, stack_end=0xbffffe0c)
    at ../sysdeps/generic/libc-start.c:129
(gdb)
-----------------------------------------------------------------------
A check of other processes indicated that they were waiting for locks on mailbox 
header files that
were already locked by this process.  It appears that this process (2122)  blocked 
trying to get an
exclusive lock on the seen file: /var/imap/user/c/chermak.seen while doing a message 
copy from the
inbox (user.chermak) to the trash subfolder (user.inbox.trash)   Mail client was 
Netscape Communicator
4.51 configured to use the move to trash delete option so I suspect the user was 
trying to delete a
message.

>From an lsof:
-----------------------------------------------------------------------
[root@borg /root]# lsof | grep chermak.seen
imapd      2059  cyrus   17u   REG       8,10     1087     112299 /var/imap/user
/c/chermak.seen.NEW (deleted)
imapd      2122  cyrus   17u   REG       8,10     1087     112299 /var/imap/user
/c/chermak.seen.NEW (deleted)
imapd      2122  cyrus   23u   REG       8,10     1111     112788 /var/imap/user
/c/chermak.seen.NEW (deleted)
imapd      9585  cyrus   18u   REG       8,10     1087     112737 /var/imap/user
/c/chermak.seen.NEW (deleted)
imapd     12411  cyrus   18u   REG       8,10     1087     112823 /var/imap/user
/c/chermak.seen.NEW (deleted)
imapd     21007  cyrus  mem    REG       8,10     1087     112849 /var/imap/user
/c/chermak.seen
imapd     21007  cyrus   18u   REG       8,10     1087     112849 /var/imap/user
/c/chermak.seen
imapd     32590  cyrus   18u   REG       8,10     1087     112779 /var/imap/user
/c/chermak.seen.NEW (deleted)
-----------------------------------------------------------------------
I then grep'ed /proc/locks to find out what locks were already set on these files, the 
only entry was
for inode 112788
-----------------------------------------------------------------------
[root@borg /root]# grep 112788 /proc/locks
9: FLOCK  ADVISORY  WRITE 2122 08:0a:112788 0 9223372036854775807 f68f5a00 f68f5
e60 e03b2280 00000000 f57a4ae0
9: -> FLOCK  ADVISORY  WRITE 2122 08:0a:112788 0 9223372036854775807 f57a4ae0 00
000000 00000000 00000000 f68f5a00
-----------------------------------------------------------------------
This seems to imply the the imapd process had locked the file, and then failed to 
release the lock and
then attempted to lock it again. (and blocked waiting for itself to release the lock)  
  I was
suprised to see this block, thinking that a subsequent call to flock() for the same 
process to the
same inode with the same file descriptor should just proceed.  (In fact a test program 
I wrote worked
this way, the only time i had it block was when I reopened the file with a new file 
descriptor and
tried to lock it again.)    Whether or not it should block, clearly there was a bug 
somewhere that did
not unlock the file when it was supposed to.

I still don't understand under what conditions the imapd creates a new seen file,  all 
the testing I
did left the seen file with the original inode number, but clearly here the processes 
are creating
these .NEW files and then deleting them.

I do have a little more info about this specific crash if it would be helpful, and I 
think I will have
to recompile with seen debug turned on on this box to see if we can get more info 
about when it failed
to unlock.  (I am just a little afraid of the log file sizes, since I already did this 
on my other
less heavily used box and the logs are huge.)

Larry, do you have any other suggestions on what to look at?    I did write a patch 
for the
lock_reopen subroutine that will keep retrying the lock non-blocking and then 
eventually time out with
an error to syslog.   This works in test, but I was reluctant to put it in production, 
since I wanted
to try to get more info for you.   Based on the evidence here I could also modify the 
lock_reopen
function to try to unlock the file first before attempting the lock, but all of these 
are work
arounds.

Any suggestions would be appreciated,  it is very reproducible in my environment,

Thanks,
John


>
> > Lawrence Greenfield wrote:
> >
> > > It's really crucial to understand what process is holding this lock.
> > > It might not be reused---it's probably blocked doing something else
> > > and never releasing the lock.  imapds should never do something that
> > > will block for a long time while holding a lock but evidentally
> > > something is wrong.
> > >
> > > Finding the process that holds the lock, doing a backtrace on it and
> > > (even better) figuring out how it gets into that state is really the
> > > crucial issue for fixing this problem.
> >

Reply via email to