Re: Host ... is being updated before previous update had finished

2017-04-24 Thread Timo Sirainen
It's usually very troublesome to try to reproduce and debug these.. It looks 
like the directors got into some kind of a broken state that wouldn't fix 
itself. Some host update going around and around in the director ring without 
being detected as a duplicate/obsolete update. For now I think the only 
workaround is to do what you did: stop all directors (or maybe a single one 
could have been left running) and start them up again.

> On 21 Apr 2017, at 18.50, Mark Moseley <moseleym...@gmail.com> wrote:
> 
> Timo/Aki/Docecot guys, any hints here? Is this a bug? Design issue?
> 
> On Fri, Apr 7, 2017 at 10:10 AM Mark Moseley <moseleym...@gmail.com> wrote:
> 
>> On Mon, Apr 3, 2017 at 6:04 PM, Mark Moseley <moseleym...@gmail.com>
>> wrote:
>> 
>>> We just had a bunch of backend boxes go down due to a DDoS in our
>>> director cluster. When the DDoS died down, our director ring was a mess.
>>> 
>>> Each box had thousands (and hundreds per second, which is a bit much) of
>>> log lines like the following:
>>> 
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>>> 10.1.17.15 is being updated before previous update had finished (up ->
>>> down) - setting to state=down vhosts=100
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>>> 10.1.17.15 is being updated before previous update had finished (down ->
>>> up) - setting to state=up vhosts=100
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>>> 10.1.17.15 is being updated before previous update had finished (up ->
>>> down) - setting to state=down vhosts=100
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>>> 10.1.17.15 is being updated before previous update had finished (down ->
>>> up) - setting to state=up vhosts=100
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>>> 10.1.17.15 is being updated before previous update had finished (up ->
>>> down) - setting to state=down vhosts=100
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>>> 10.1.17.15 is being updated before previous update had finished (down ->
>>> up) - setting to state=up vhosts=100
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>>> 10.1.17.15 is being updated before previous update had finished (up ->
>>> down) - setting to state=down vhosts=100
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>>> 10.1.17.15 is being updated before previous update had finished (down ->
>>> up) - setting to state=up vhosts=100
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>>> 10.1.17.15 is being updated before previous update had finished (up ->
>>> down) - setting to state=down vhosts=100
>>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>>> 10.1.17.15 is being updated before previous update had finished (down ->
>>> up) - setting to state=up vhosts=100
>>> 
>>> This was on every director box and the status of all of the directors in
>>> 'doveadm director ring status' was 'handshaking'.
>>> 
>>> Here's a sample packet between directors:
>>> 
>>> 19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq
>>> 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr
>>> 1721203906], length 981
>>> 
>>> Q.  [f.|.HOST   10.1.20.10  90901006732 10.1.17.15
>>> 100 D1491260800
>>> HOST10.1.20.10  90901006733 10.1.17.15  100
>>> U1491260800
>>> HOST10.1.20.10  90901006734 10.1.17.15  100
>>> D1491260800
>>> HOST10.1.20.10  90901006735 10.1.17.15  100
>>> U1491260800
>>> HOST10.1.20.10  90901006736 10.1.17.15  100
>>> D1491260800
>>> HOST10.1.20.10  90901006737 10.1.17.15  100
>>> U1491260800
>>> HOST10.1.20.10  90901006738 10.1.17.15  100
>>> D1491260800
>>> HOST10.1.20.10  90901006739 10.1.17.15  100
>>> U1491260800
>>> HOST10.1.20.10  90901006740 10.1.17.15  100
>>> D1491260800
>>> HOST10.1.20.10  90901006741 10.1.17.15  100
>>> U1491260800
>>> HOST10.1.20.10  90901006742 10.1.17.15  100
>>> D1491260800
>>> HOST10.1.20.10  909010

Re: Host ... is being updated before previous update had finished

2017-04-21 Thread Mark Moseley
Timo/Aki/Docecot guys, any hints here? Is this a bug? Design issue?

On Fri, Apr 7, 2017 at 10:10 AM Mark Moseley <moseleym...@gmail.com> wrote:

> On Mon, Apr 3, 2017 at 6:04 PM, Mark Moseley <moseleym...@gmail.com>
> wrote:
>
>> We just had a bunch of backend boxes go down due to a DDoS in our
>> director cluster. When the DDoS died down, our director ring was a mess.
>>
>> Each box had thousands (and hundreds per second, which is a bit much) of
>> log lines like the following:
>>
>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>> 10.1.17.15 is being updated before previous update had finished (up ->
>> down) - setting to state=down vhosts=100
>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>> 10.1.17.15 is being updated before previous update had finished (down ->
>> up) - setting to state=up vhosts=100
>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>> 10.1.17.15 is being updated before previous update had finished (up ->
>> down) - setting to state=down vhosts=100
>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>> 10.1.17.15 is being updated before previous update had finished (down ->
>> up) - setting to state=up vhosts=100
>> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
>> 10.1.17.15 is being updated before previous update had finished (up ->
>> down) - setting to state=down vhosts=100
>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>> 10.1.17.15 is being updated before previous update had finished (down ->
>> up) - setting to state=up vhosts=100
>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>> 10.1.17.15 is being updated before previous update had finished (up ->
>> down) - setting to state=down vhosts=100
>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>> 10.1.17.15 is being updated before previous update had finished (down ->
>> up) - setting to state=up vhosts=100
>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>> 10.1.17.15 is being updated before previous update had finished (up ->
>> down) - setting to state=down vhosts=100
>> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
>> 10.1.17.15 is being updated before previous update had finished (down ->
>> up) - setting to state=up vhosts=100
>>
>> This was on every director box and the status of all of the directors in
>> 'doveadm director ring status' was 'handshaking'.
>>
>> Here's a sample packet between directors:
>>
>> 19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq
>> 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr
>> 1721203906], length 981
>>
>> Q.  [f.|.HOST   10.1.20.10  90901006732 10.1.17.15
>>  100 D1491260800
>> HOST10.1.20.10  90901006733 10.1.17.15  100
>> U1491260800
>> HOST10.1.20.10  90901006734 10.1.17.15  100
>> D1491260800
>> HOST10.1.20.10  90901006735 10.1.17.15  100
>> U1491260800
>> HOST10.1.20.10  90901006736 10.1.17.15  100
>> D1491260800
>> HOST10.1.20.10  90901006737 10.1.17.15  100
>> U1491260800
>> HOST10.1.20.10  90901006738 10.1.17.15  100
>> D1491260800
>> HOST10.1.20.10  90901006739 10.1.17.15  100
>> U1491260800
>> HOST10.1.20.10  90901006740 10.1.17.15  100
>> D1491260800
>> HOST10.1.20.10  90901006741 10.1.17.15  100
>> U1491260800
>> HOST10.1.20.10  90901006742 10.1.17.15  100
>> D1491260800
>> HOST10.1.20.10  90901006743 10.1.17.15  100
>> U1491260800
>> HOST10.1.20.10  90901006744 10.1.17.15  100
>> D1491260800
>> HOST10.1.20.10  90901006745 10.1.17.15  100
>> U1491260800
>> HOST10.1.20.10  90901006746 10.1.17.15  100
>> D1491260800
>> HOST10.1.20.10  90901006747 10.1.17.15  100
>> U1491260800
>> SYNC10.1.20.10  90901011840 7   1491263483  3377546382
>>
>> I'm guessing that D1491260800 is the user hash (with D for down), and the
>> U version is for 'up'.
>>
>> I'm happy to provide the full tcpdump (and/or doveconf -a), though the
>> tcpdump is basically all identical the one I pasted (same hash, same host).
>>
>> This seems pretty fragile. There shou

Re: Host ... is being updated before previous update had finished

2017-04-07 Thread Mark Moseley
On Mon, Apr 3, 2017 at 6:04 PM, Mark Moseley <moseleym...@gmail.com> wrote:

> We just had a bunch of backend boxes go down due to a DDoS in our director
> cluster. When the DDoS died down, our director ring was a mess.
>
> Each box had thousands (and hundreds per second, which is a bit much) of
> log lines like the following:
>
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
>
> This was on every director box and the status of all of the directors in
> 'doveadm director ring status' was 'handshaking'.
>
> Here's a sample packet between directors:
>
> 19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq
> 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr
> 1721203906], length 981
>
> Q.  [f.|.HOST   10.1.20.10  90901006732 10.1.17.15
>  100 D1491260800
> HOST10.1.20.10  90901006733 10.1.17.15  100
> U1491260800
> HOST10.1.20.10  90901006734 10.1.17.15  100
> D1491260800
> HOST10.1.20.10  90901006735 10.1.17.15  100
> U1491260800
> HOST10.1.20.10  90901006736 10.1.17.15  100
> D1491260800
> HOST10.1.20.10  90901006737 10.1.17.15  100
> U1491260800
> HOST10.1.20.10  90901006738 10.1.17.15  100
> D1491260800
> HOST10.1.20.10  90901006739 10.1.17.15  100
> U1491260800
> HOST10.1.20.10  90901006740 10.1.17.15  100
> D1491260800
> HOST10.1.20.10  90901006741 10.1.17.15  100
> U1491260800
> HOST10.1.20.10  90901006742 10.1.17.15  100
> D1491260800
> HOST10.1.20.10  90901006743 10.1.17.15  100
> U1491260800
> HOST10.1.20.10  90901006744 10.1.17.15  100
> D1491260800
> HOST10.1.20.10  90901006745 10.1.17.15  100
> U1491260800
> HOST10.1.20.10  90901006746 10.1.17.15  100
> D1491260800
> HOST10.1.20.10  90901006747 10.1.17.15  100
> U1491260800
> SYNC10.1.20.10  90901011840 7   1491263483  3377546382
>
> I'm guessing that D1491260800 is the user hash (with D for down), and the
> U version is for 'up'.
>
> I'm happy to provide the full tcpdump (and/or doveconf -a), though the
> tcpdump is basically all identical the one I pasted (same hash, same host).
>
> This seems pretty fragile. There should be some sort of tie break for
> that, instead of bringing the entire cluster to its knees. Or just drop the
> backend host completely. Or something, anything besides hosing things
> pretty badly.
>
> This is 2.2.27, on both the directors and backend. If the answer is
> upgrade to 2.2.28, then I'll upgrade immediately. I see commit
> a9ade104616bbb81c34cc6f8bfde5dab0571afac mentions the same error but the
> commit predates 2.2.27 by a month and a half.
>
> In the meantime, is there a

Host ... is being updated before previous update had finished

2017-04-03 Thread Mark Moseley
We just had a bunch of backend boxes go down due to a DDoS in our director
cluster. When the DDoS died down, our director ring was a mess.

Each box had thousands (and hundreds per second, which is a bit much) of
log lines like the following:

Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100

This was on every director box and the status of all of the directors in
'doveadm director ring status' was 'handshaking'.

Here's a sample packet between directors:

19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq
4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr
1721203906], length 981

Q.  [f.|.HOST   10.1.20.10  90901006732 10.1.17.15  100
D1491260800
HOST10.1.20.10  90901006733 10.1.17.15  100
U1491260800
HOST10.1.20.10  90901006734 10.1.17.15  100
D1491260800
HOST10.1.20.10  90901006735 10.1.17.15  100
U1491260800
HOST10.1.20.10  90901006736 10.1.17.15  100
D1491260800
HOST10.1.20.10  90901006737 10.1.17.15  100
U1491260800
HOST10.1.20.10  90901006738 10.1.17.15  100
D1491260800
HOST10.1.20.10  90901006739 10.1.17.15  100
U1491260800
HOST10.1.20.10  90901006740 10.1.17.15  100
D1491260800
HOST10.1.20.10  90901006741 10.1.17.15  100
U1491260800
HOST10.1.20.10  90901006742 10.1.17.15  100
D1491260800
HOST10.1.20.10  90901006743 10.1.17.15  100
U1491260800
HOST10.1.20.10  90901006744 10.1.17.15  100
D1491260800
HOST10.1.20.10  90901006745 10.1.17.15  100
U1491260800
HOST10.1.20.10  90901006746 10.1.17.15  100
D1491260800
HOST10.1.20.10  90901006747 10.1.17.15  100
U1491260800
SYNC10.1.20.10  90901011840 7   1491263483  3377546382

I'm guessing that D1491260800 is the user hash (with D for down), and the U
version is for 'up'.

I'm happy to provide the full tcpdump (and/or doveconf -a), though the
tcpdump is basically all identical the one I pasted (same hash, same host).

This seems pretty fragile. There should be some sort of tie break for that,
instead of bringing the entire cluster to its knees. Or just drop the
backend host completely. Or something, anything besides hosing things
pretty badly.

This is 2.2.27, on both the directors and backend. If the answer is upgrade
to 2.2.28, then I'll upgrade immediately. I see
commit a9ade104616bbb81c34cc6f8bfde5dab0571afac mentions the same error but
the commit predates 2.2.27 by a month and a half.

In the meantime, is there any doveadm command I could've done to fix this?
I tried removing the host (doveadm director remove 10.1.17.15) but that
didn't do anything. I didn't think to try to flush the mapping for that
user till just now. I suspect that with the ring unsync'd, flushing the
user wouldn't have helped.

The only remedy was to kill dovecot on every box in the director cluster
and then (with dovecot down on *all* of them) start dovecot back up.
Restarting each director's dovecot (with other directors' dovecots still
running) did nothing. Only by brining the ent