Hi,

[Formatting is a bit rough, replying from a trimmed digest email]

Message: 1
Date: Fri, 6 Apr 2018 15:04:35 +0200
From: Michael Grimm <trash...@ellael.org>
To: Dovecot Mailing List <dovecot@dovecot.org>
Subject: Re: 2.3.1 Replication is throwing scary errors
Message-ID: <e7e7a927-68f8-443f-ba59-e66ced8fe...@ellael.org>
Content-Type: text/plain;       charset=utf-8

Reuben Farrelly wrote:
From: Michael Grimm <trash...@ellael.org>

[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct 
servers.]
I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error 
messages at server1 (and vice versa at server2) as follows:
        | Apr  2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: 
dsync(server2.lan): I/O has stalled, \
                no activity for 600 seconds (last sent=mail_change, last 
recv=mail_change (EOL))
        | Apr  2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: 
Timeout during state=sync_mails \
                (send=changes recv=mail_requests)
        [?]
        | Apr  2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: 
dsync(server2.lan): I/O has stalled, \
                no activity for 600 seconds (last sent=mail, last recv=mail 
(EOL))
        | Apr  2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: 
Timeout during state=sync_mails \
                (send=mails recv=recv_last_common)
I cannot see in my personal account any missing replications, *but* I haven't 
tested this thoroughly enough. I do have customers being serviced at these 
productive servers, *thus* I'm back to 2.2.35 until I do understand or have 
learned what is going on.

In my reply to this statement of mine I mentioned that I have seen those 
timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 
again, and boom: after some hours I ended up in hanging processes [1] like (see 
Remko's mail in addition) ...

        doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server)

? at server2 paired with a file like ?

        -rw------- 1 vmail dovecot uarch 0 Apr 3 16:52 
/home/to/USER1/.dovecot-sync.lock

Corresponding logfile entries at server2 are like ?

   Apr  3 17:10:49 <mail.err> server2.lan dovecot: doveadm: Error: Couldn't 
lock /home/to/USER1/.dovecot-sync.lock: \
   fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking 
failed: Timed out after 30 seconds \
   (WRITE lock held by pid 51110)

[1] Even stopping dovecot will not end those processes. One has to manually 
kill those before restarting dovecot.

After one day of testing 2.3.1 with a couple of those episodes of 
locking/timeout, and now missing mails depending with server your MUA will 
connect to, I went back to 2.2.35. After two days at that version I never had 
such an episode again.

It's not just you.  This issue hit me recently, and it was impacting
replication noticeably.  I am following git master-2.3 .
[...]
There is also a second issue of a long standing race with replication
occurring somewhere whereby if a mail comes in, is written to disk, is
replicated and then deleted in short succession, it will reappear
again to the MUA.  I suspect the mail is being replicated back from
the remote.  A few people have reported it over the years but it's not
reliable or consistent, so it has never been fixed.
And lastly there has been an ongoing but seemingly minor issue
relating to locking timing out after 30s particularly on the remote
host that is being replicated to.  I rarely see the problem on my
local disk where almost all of the mail comes in, it's almost always
occurring on the replicate/remote system.

It might be time to describe our setups in order to possibly find common 
grounds that might trigger this issue you describe and Rimko and myself ran 
into as well.

Servers:        Cloud Instances (both identical), around 25ms latency apart.
                Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class 
CPU)
                Both servers are connected via IPsec/racoon tunnels
OS:             FreeBSD 11.1-STABLE (both servers)
Filesystem:     ZFS
MTA:            postfix 3.4-20180401 (postfix delivers via dovecot's LMTP)
IMAP:           dovecot running in FreeBSD jails (issues with 2.3.1, fine with 
2.2.35)
Replication:    unsecured tcp / master-master
MUA:            mainly iOS or macOS mail.app, rarely roundcube

For me:

Servers:        Main: VM on a VMWare ESXi local system (light
                load), local SSD disks (no NFS)
                Redundant: Linode VM in Singapore, around 250ms away
                Also no NFS.  Linode use SSDs for IO.
                There is native IPv6 connectivity between both VMs.
                As I am using TCPs I don't have a VPN between them -
                just raw IPv6 end to end.
OS:             Gentoo Linux x86_64 kept well up to date
Filesystem:     EXT4 for both
MTA:            Postfix 3.4-x (same as you)
IMAP:           Dovecot running natively on the machine (no jail/chroot)
Replication:    Secured TCP / master-master (but only one master is
                actively used for SMTP and MUA access)
MUA:            Thunderbird, Maildroid, MS Mail, Gmail client,
                Roundcube, LG Mail client on handset etc

I believe it is worthwhile to mention here that I run a poor man's fail-over 
approach (round-robin DNS) as follows:

DNS:            mail.server.tld resolves to one IP4 and one IP6 address of each 
server, thus 4 IP addresses in total

For me I have a simple pair of servers that replicate mail to each other. Clients connect to the main server only, which is where mail comes in. The second one that experiences the problem with locking is only a redundant server which is a secondary MX, that only gets occasional SMTP deliveries from time to time. So the amount of replication back from the "secondary" is very minimal; the vast bulk of the replication is the primary replicating changes out.

Reuben

Reply via email to