Hi All, Had another locking problem today and was able to get reasonably complete info. Unfortunately, I only had a limited time to debug since I di not identify the problem until long after it started.. So here goes.
At the point I identified the problem, I had 6 blocked IMAP processes and 10 blocked lmtpd's. The earliest imapd was process 2122, from the syslog: ----------------------------------------------------------------------- Nov 14 07:33:03 borg master[2122]: about to exec /usr/cyrus/bin/imapd Nov 14 07:33:03 borg service-imap[2122]: executed Nov 14 07:33:03 borg imapd[2122]: accepted connection Nov 14 07:33:06 borg imapd[2122]: login: occ50-2-172.oakton.edu[10.50.2.172] che rmak plaintext Nov 14 07:33:06 borg imapd[2122]: seen_db: user chermak opened /var/imap/user/c/ chermak.seen Nov 14 07:33:06 borg imapd[2122]: open: user chermak opened Inbox Nov 14 07:34:23 borg imapd[2122]: seen_db: user chermak opened /var/imap/user/c/ chermak.seen Nov 14 11:57:03 borg master[4292]: process 2122 exited, signaled to death by 15 ----------------------------------------------------------------------- The final entry is from me killing the process around noon. a gdb backtrace showed the following ----------------------------------------------------------------------- (gdb) bt #0 0x4028a8a1 in __flock () from /lib/libc.so.6 #1 0x8079c8f in lock_reopen (fd=23, filename=0x8111be8 "/var/imap/user/c/chermak.seen", sbuf=0xbfffebf0, failaction=0xbfffebec) at lock_flock.c:84 #2 0x807b8a7 in starttxn_or_refetch (db=0x8111bd0, mytid=0x8111bc0) at cyrusdb_flat.c:211 #3 0x807ba07 in myfetch (db=0x8111bd0, key=0x8112428 "753096e9389d0186", keylen=16, data=0xbfffecc4, datalen=0xbfffecc8, mytid=0x8111bc0) at cyrusdb_flat.c:270 #4 0x806f933 in seen_readit (seendb=0x8111bb0, lastreadptr=0xbfffed38, lastuidptr=0xbfffed3c, lastchangeptr=0xbfffed40, seenuidsptr=0xbfffed44, rw=1) at seen_db.c:268 #5 0x806fab8 in seen_lockread (seendb=0x8111bb0, lastreadptr=0xbfffed38, lastuidptr=0xbfffed3c, lastchangeptr=0xbfffed40, seenuidsptr=0xbfffed44) at seen_db.c:332 #6 0x8060dc6 in append_addseen (mailbox=0xbfffedb0, userid=0xbffff05c "chermak", msgrange=0x8112478 "20263") at append.c:896 #7 0x805f945 in append_commit (as=0xbfffedb0, uidvalidity=0xbfffeda4, start=0xbfffeda8, num=0xbfffedac) at append.c:240 #8 0x8059a4b in index_copy (mailbox=0x80ffe00, sequence=0x810fbd8 "19326", usinguid=1, name=0xbffff2c0 "user.chermak.Trash", copyuidp=0xbffff2bc) at index.c:1226 #9 0x80531eb in cmd_copy (tag=0x810fac8 "188", sequence=0x810fbd8 "19326", ---Type <return> to continue, or q <return> to quit--- name=0x810fc48 "INBOX.Trash", usinguid=1) at imapd.c:3192 #10 0x804d878 in cmdloop () at imapd.c:791 #11 0x804cf50 in service_main (argc=1, argv=0xbffffe14, envp=0xbffffe1c) at imapd.c:546 #12 0x804bd00 in main () \#13 0x401d2f31 in __libc_start_main (main=0x804b8fc <main>, argc=1, ubp_av=0xbffffe14, init=0x804a9e4 <_init>, fini=0x807e83c <_fini>, rtld_fini=0x4000e274 <_dl_fini>, stack_end=0xbffffe0c) at ../sysdeps/generic/libc-start.c:129 (gdb) ----------------------------------------------------------------------- A check of other processes indicated that they were waiting for locks on mailbox header files that were already locked by this process. It appears that this process (2122) blocked trying to get an exclusive lock on the seen file: /var/imap/user/c/chermak.seen while doing a message copy from the inbox (user.chermak) to the trash subfolder (user.inbox.trash) Mail client was Netscape Communicator 4.51 configured to use the move to trash delete option so I suspect the user was trying to delete a message. >From an lsof: ----------------------------------------------------------------------- [root@borg /root]# lsof | grep chermak.seen imapd 2059 cyrus 17u REG 8,10 1087 112299 /var/imap/user /c/chermak.seen.NEW (deleted) imapd 2122 cyrus 17u REG 8,10 1087 112299 /var/imap/user /c/chermak.seen.NEW (deleted) imapd 2122 cyrus 23u REG 8,10 1111 112788 /var/imap/user /c/chermak.seen.NEW (deleted) imapd 9585 cyrus 18u REG 8,10 1087 112737 /var/imap/user /c/chermak.seen.NEW (deleted) imapd 12411 cyrus 18u REG 8,10 1087 112823 /var/imap/user /c/chermak.seen.NEW (deleted) imapd 21007 cyrus mem REG 8,10 1087 112849 /var/imap/user /c/chermak.seen imapd 21007 cyrus 18u REG 8,10 1087 112849 /var/imap/user /c/chermak.seen imapd 32590 cyrus 18u REG 8,10 1087 112779 /var/imap/user /c/chermak.seen.NEW (deleted) ----------------------------------------------------------------------- I then grep'ed /proc/locks to find out what locks were already set on these files, the only entry was for inode 112788 ----------------------------------------------------------------------- [root@borg /root]# grep 112788 /proc/locks 9: FLOCK ADVISORY WRITE 2122 08:0a:112788 0 9223372036854775807 f68f5a00 f68f5 e60 e03b2280 00000000 f57a4ae0 9: -> FLOCK ADVISORY WRITE 2122 08:0a:112788 0 9223372036854775807 f57a4ae0 00 000000 00000000 00000000 f68f5a00 ----------------------------------------------------------------------- This seems to imply the the imapd process had locked the file, and then failed to release the lock and then attempted to lock it again. (and blocked waiting for itself to release the lock) I was suprised to see this block, thinking that a subsequent call to flock() for the same process to the same inode with the same file descriptor should just proceed. (In fact a test program I wrote worked this way, the only time i had it block was when I reopened the file with a new file descriptor and tried to lock it again.) Whether or not it should block, clearly there was a bug somewhere that did not unlock the file when it was supposed to. I still don't understand under what conditions the imapd creates a new seen file, all the testing I did left the seen file with the original inode number, but clearly here the processes are creating these .NEW files and then deleting them. I do have a little more info about this specific crash if it would be helpful, and I think I will have to recompile with seen debug turned on on this box to see if we can get more info about when it failed to unlock. (I am just a little afraid of the log file sizes, since I already did this on my other less heavily used box and the logs are huge.) Larry, do you have any other suggestions on what to look at? I did write a patch for the lock_reopen subroutine that will keep retrying the lock non-blocking and then eventually time out with an error to syslog. This works in test, but I was reluctant to put it in production, since I wanted to try to get more info for you. Based on the evidence here I could also modify the lock_reopen function to try to unlock the file first before attempting the lock, but all of these are work arounds. Any suggestions would be appreciated, it is very reproducible in my environment, Thanks, John > > > Lawrence Greenfield wrote: > > > > > It's really crucial to understand what process is holding this lock. > > > It might not be reused---it's probably blocked doing something else > > > and never releasing the lock. imapds should never do something that > > > will block for a long time while holding a lock but evidentally > > > something is wrong. > > > > > > Finding the process that holds the lock, doing a backtrace on it and > > > (even better) figuring out how it gets into that state is really the > > > crucial issue for fixing this problem. > >