A warning to those considering to upgrade to Debian 10 (buster): we have seen 
occasional NFS hangs with dovecot when using the stock debian buster kernel 
(4.19.67-2+deb10u1).

When we downgrade to the debian stretch kernel (4.9.189-3+deb9u1), the issue 
does not occur. Note that we *only* downgraded the kernel, the rest of the OS 
is still debian buster. Dovecot 2.3.8.

A little more info: we have a dovecot cluster, using mdbox for storage, on an 
NFS server (netapp Cmode version 9.6P2). We use a dovecot director layer, so a 
user is always connected to the same back-end dovecot server. The NFS hang 
occurs on the back-end server.

Once the process hangs, other processes trying to write to the same mailbox, 
will get an error like this:

Timeout (180s) while waiting for lock for transaction log file 
/var/mail/.../index/storage/dovecot.map.index.log (WRITE lock held by pid XXXX)

The stuck process itself doesn't seem to do anything, is stuck in "D" disk state, 
"strace" doesn't show anything (and after attaching, strace itself needs a kill -KILL to 
stop). The only way to unwedge the process seems to be to do a kill -KILL of the stuck process. 
Reading from the mailbox is still possible.

We are in the process of contacting the linux-nfs folks about this, but I'm 
trying to reproduce this on a test-cluster first, to be able to present a 
well-documented case. Since this hang doesn't happen immediately, but takes a 
few hours to a day to occur in the wild, or a few thousand writes to the same 
mailbox, it's a bit hard to debug.

--
Jan-Pieter Cornet <joh...@xs4all.net>
Systeembeheer XS4ALL Internet bv
www.xs4all.nl


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to