Re: Host ... is being updated before previous update had finished
It's usually very troublesome to try to reproduce and debug these.. It looks like the directors got into some kind of a broken state that wouldn't fix itself. Some host update going around and around in the director ring without being detected as a duplicate/obsolete update. For now I think the only workaround is to do what you did: stop all directors (or maybe a single one could have been left running) and start them up again. > On 21 Apr 2017, at 18.50, Mark Moseley <moseleym...@gmail.com> wrote: > > Timo/Aki/Docecot guys, any hints here? Is this a bug? Design issue? > > On Fri, Apr 7, 2017 at 10:10 AM Mark Moseley <moseleym...@gmail.com> wrote: > >> On Mon, Apr 3, 2017 at 6:04 PM, Mark Moseley <moseleym...@gmail.com> >> wrote: >> >>> We just had a bunch of backend boxes go down due to a DDoS in our >>> director cluster. When the DDoS died down, our director ring was a mess. >>> >>> Each box had thousands (and hundreds per second, which is a bit much) of >>> log lines like the following: >>> >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (up -> >>> down) - setting to state=down vhosts=100 >>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >>> 10.1.17.15 is being updated before previous update had finished (down -> >>> up) - setting to state=up vhosts=100 >>> >>> This was on every director box and the status of all of the directors in >>> 'doveadm director ring status' was 'handshaking'. >>> >>> Here's a sample packet between directors: >>> >>> 19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq >>> 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr >>> 1721203906], length 981 >>> >>> Q. [f.|.HOST 10.1.20.10 90901006732 10.1.17.15 >>> 100 D1491260800 >>> HOST10.1.20.10 90901006733 10.1.17.15 100 >>> U1491260800 >>> HOST10.1.20.10 90901006734 10.1.17.15 100 >>> D1491260800 >>> HOST10.1.20.10 90901006735 10.1.17.15 100 >>> U1491260800 >>> HOST10.1.20.10 90901006736 10.1.17.15 100 >>> D1491260800 >>> HOST10.1.20.10 90901006737 10.1.17.15 100 >>> U1491260800 >>> HOST10.1.20.10 90901006738 10.1.17.15 100 >>> D1491260800 >>> HOST10.1.20.10 90901006739 10.1.17.15 100 >>> U1491260800 >>> HOST10.1.20.10 90901006740 10.1.17.15 100 >>> D1491260800 >>> HOST10.1.20.10 90901006741 10.1.17.15 100 >>> U1491260800 >>> HOST10.1.20.10 90901006742 10.1.17.15 100 >>> D1491260800 >>> HOST10.1.20.10 909010
Re: Host ... is being updated before previous update had finished
Timo/Aki/Docecot guys, any hints here? Is this a bug? Design issue? On Fri, Apr 7, 2017 at 10:10 AM Mark Moseley <moseleym...@gmail.com> wrote: > On Mon, Apr 3, 2017 at 6:04 PM, Mark Moseley <moseleym...@gmail.com> > wrote: > >> We just had a bunch of backend boxes go down due to a DDoS in our >> director cluster. When the DDoS died down, our director ring was a mess. >> >> Each box had thousands (and hundreds per second, which is a bit much) of >> log lines like the following: >> >> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >> 10.1.17.15 is being updated before previous update had finished (up -> >> down) - setting to state=down vhosts=100 >> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >> 10.1.17.15 is being updated before previous update had finished (down -> >> up) - setting to state=up vhosts=100 >> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >> 10.1.17.15 is being updated before previous update had finished (up -> >> down) - setting to state=down vhosts=100 >> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >> 10.1.17.15 is being updated before previous update had finished (down -> >> up) - setting to state=up vhosts=100 >> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host >> 10.1.17.15 is being updated before previous update had finished (up -> >> down) - setting to state=down vhosts=100 >> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >> 10.1.17.15 is being updated before previous update had finished (down -> >> up) - setting to state=up vhosts=100 >> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >> 10.1.17.15 is being updated before previous update had finished (up -> >> down) - setting to state=down vhosts=100 >> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >> 10.1.17.15 is being updated before previous update had finished (down -> >> up) - setting to state=up vhosts=100 >> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >> 10.1.17.15 is being updated before previous update had finished (up -> >> down) - setting to state=down vhosts=100 >> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host >> 10.1.17.15 is being updated before previous update had finished (down -> >> up) - setting to state=up vhosts=100 >> >> This was on every director box and the status of all of the directors in >> 'doveadm director ring status' was 'handshaking'. >> >> Here's a sample packet between directors: >> >> 19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq >> 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr >> 1721203906], length 981 >> >> Q. [f.|.HOST 10.1.20.10 90901006732 10.1.17.15 >> 100 D1491260800 >> HOST10.1.20.10 90901006733 10.1.17.15 100 >> U1491260800 >> HOST10.1.20.10 90901006734 10.1.17.15 100 >> D1491260800 >> HOST10.1.20.10 90901006735 10.1.17.15 100 >> U1491260800 >> HOST10.1.20.10 90901006736 10.1.17.15 100 >> D1491260800 >> HOST10.1.20.10 90901006737 10.1.17.15 100 >> U1491260800 >> HOST10.1.20.10 90901006738 10.1.17.15 100 >> D1491260800 >> HOST10.1.20.10 90901006739 10.1.17.15 100 >> U1491260800 >> HOST10.1.20.10 90901006740 10.1.17.15 100 >> D1491260800 >> HOST10.1.20.10 90901006741 10.1.17.15 100 >> U1491260800 >> HOST10.1.20.10 90901006742 10.1.17.15 100 >> D1491260800 >> HOST10.1.20.10 90901006743 10.1.17.15 100 >> U1491260800 >> HOST10.1.20.10 90901006744 10.1.17.15 100 >> D1491260800 >> HOST10.1.20.10 90901006745 10.1.17.15 100 >> U1491260800 >> HOST10.1.20.10 90901006746 10.1.17.15 100 >> D1491260800 >> HOST10.1.20.10 90901006747 10.1.17.15 100 >> U1491260800 >> SYNC10.1.20.10 90901011840 7 1491263483 3377546382 >> >> I'm guessing that D1491260800 is the user hash (with D for down), and the >> U version is for 'up'. >> >> I'm happy to provide the full tcpdump (and/or doveconf -a), though the >> tcpdump is basically all identical the one I pasted (same hash, same host). >> >> This seems pretty fragile. There shou
Re: Host ... is being updated before previous update had finished
On Mon, Apr 3, 2017 at 6:04 PM, Mark Moseley <moseleym...@gmail.com> wrote: > We just had a bunch of backend boxes go down due to a DDoS in our director > cluster. When the DDoS died down, our director ring was a mess. > > Each box had thousands (and hundreds per second, which is a bit much) of > log lines like the following: > > Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host > 10.1.17.15 is being updated before previous update had finished (up -> > down) - setting to state=down vhosts=100 > Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host > 10.1.17.15 is being updated before previous update had finished (down -> > up) - setting to state=up vhosts=100 > Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host > 10.1.17.15 is being updated before previous update had finished (up -> > down) - setting to state=down vhosts=100 > Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host > 10.1.17.15 is being updated before previous update had finished (down -> > up) - setting to state=up vhosts=100 > Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host > 10.1.17.15 is being updated before previous update had finished (up -> > down) - setting to state=down vhosts=100 > Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host > 10.1.17.15 is being updated before previous update had finished (down -> > up) - setting to state=up vhosts=100 > Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host > 10.1.17.15 is being updated before previous update had finished (up -> > down) - setting to state=down vhosts=100 > Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host > 10.1.17.15 is being updated before previous update had finished (down -> > up) - setting to state=up vhosts=100 > Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host > 10.1.17.15 is being updated before previous update had finished (up -> > down) - setting to state=down vhosts=100 > Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host > 10.1.17.15 is being updated before previous update had finished (down -> > up) - setting to state=up vhosts=100 > > This was on every director box and the status of all of the directors in > 'doveadm director ring status' was 'handshaking'. > > Here's a sample packet between directors: > > 19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq > 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr > 1721203906], length 981 > > Q. [f.|.HOST 10.1.20.10 90901006732 10.1.17.15 > 100 D1491260800 > HOST10.1.20.10 90901006733 10.1.17.15 100 > U1491260800 > HOST10.1.20.10 90901006734 10.1.17.15 100 > D1491260800 > HOST10.1.20.10 90901006735 10.1.17.15 100 > U1491260800 > HOST10.1.20.10 90901006736 10.1.17.15 100 > D1491260800 > HOST10.1.20.10 90901006737 10.1.17.15 100 > U1491260800 > HOST10.1.20.10 90901006738 10.1.17.15 100 > D1491260800 > HOST10.1.20.10 90901006739 10.1.17.15 100 > U1491260800 > HOST10.1.20.10 90901006740 10.1.17.15 100 > D1491260800 > HOST10.1.20.10 90901006741 10.1.17.15 100 > U1491260800 > HOST10.1.20.10 90901006742 10.1.17.15 100 > D1491260800 > HOST10.1.20.10 90901006743 10.1.17.15 100 > U1491260800 > HOST10.1.20.10 90901006744 10.1.17.15 100 > D1491260800 > HOST10.1.20.10 90901006745 10.1.17.15 100 > U1491260800 > HOST10.1.20.10 90901006746 10.1.17.15 100 > D1491260800 > HOST10.1.20.10 90901006747 10.1.17.15 100 > U1491260800 > SYNC10.1.20.10 90901011840 7 1491263483 3377546382 > > I'm guessing that D1491260800 is the user hash (with D for down), and the > U version is for 'up'. > > I'm happy to provide the full tcpdump (and/or doveconf -a), though the > tcpdump is basically all identical the one I pasted (same hash, same host). > > This seems pretty fragile. There should be some sort of tie break for > that, instead of bringing the entire cluster to its knees. Or just drop the > backend host completely. Or something, anything besides hosing things > pretty badly. > > This is 2.2.27, on both the directors and backend. If the answer is > upgrade to 2.2.28, then I'll upgrade immediately. I see commit > a9ade104616bbb81c34cc6f8bfde5dab0571afac mentions the same error but the > commit predates 2.2.27 by a month and a half. > > In the meantime, is there a
Host ... is being updated before previous update had finished
We just had a bunch of backend boxes go down due to a DDoS in our director cluster. When the DDoS died down, our director ring was a mess. Each box had thousands (and hundreds per second, which is a bit much) of log lines like the following: Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100 This was on every director box and the status of all of the directors in 'doveadm director ring status' was 'handshaking'. Here's a sample packet between directors: 19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr 1721203906], length 981 Q. [f.|.HOST 10.1.20.10 90901006732 10.1.17.15 100 D1491260800 HOST10.1.20.10 90901006733 10.1.17.15 100 U1491260800 HOST10.1.20.10 90901006734 10.1.17.15 100 D1491260800 HOST10.1.20.10 90901006735 10.1.17.15 100 U1491260800 HOST10.1.20.10 90901006736 10.1.17.15 100 D1491260800 HOST10.1.20.10 90901006737 10.1.17.15 100 U1491260800 HOST10.1.20.10 90901006738 10.1.17.15 100 D1491260800 HOST10.1.20.10 90901006739 10.1.17.15 100 U1491260800 HOST10.1.20.10 90901006740 10.1.17.15 100 D1491260800 HOST10.1.20.10 90901006741 10.1.17.15 100 U1491260800 HOST10.1.20.10 90901006742 10.1.17.15 100 D1491260800 HOST10.1.20.10 90901006743 10.1.17.15 100 U1491260800 HOST10.1.20.10 90901006744 10.1.17.15 100 D1491260800 HOST10.1.20.10 90901006745 10.1.17.15 100 U1491260800 HOST10.1.20.10 90901006746 10.1.17.15 100 D1491260800 HOST10.1.20.10 90901006747 10.1.17.15 100 U1491260800 SYNC10.1.20.10 90901011840 7 1491263483 3377546382 I'm guessing that D1491260800 is the user hash (with D for down), and the U version is for 'up'. I'm happy to provide the full tcpdump (and/or doveconf -a), though the tcpdump is basically all identical the one I pasted (same hash, same host). This seems pretty fragile. There should be some sort of tie break for that, instead of bringing the entire cluster to its knees. Or just drop the backend host completely. Or something, anything besides hosing things pretty badly. This is 2.2.27, on both the directors and backend. If the answer is upgrade to 2.2.28, then I'll upgrade immediately. I see commit a9ade104616bbb81c34cc6f8bfde5dab0571afac mentions the same error but the commit predates 2.2.27 by a month and a half. In the meantime, is there any doveadm command I could've done to fix this? I tried removing the host (doveadm director remove 10.1.17.15) but that didn't do anything. I didn't think to try to flush the mapping for that user till just now. I suspect that with the ring unsync'd, flushing the user wouldn't have helped. The only remedy was to kill dovecot on every box in the director cluster and then (with dovecot down on *all* of them) start dovecot back up. Restarting each director's dovecot (with other directors' dovecots still running) did nothing. Only by brining the ent