Hi Emerson, On Tue, Jan 15, 2019 at 12:21:07PM +0100, Emerson Gomes wrote: > Hi Willy, Tim, > > I am providing some more details about my setup if you wish to try to > reproduce the issue. > As I mentioned before, I have 5 HAProxy nodes, all of them listening to > public IPs. > My DNS is setup with round-robin mode on AWS R53, resolving to one of the > HAProxy nodes individual IPs for each request. > It means that very commonly one client will have multiple connections with > many (or even) all nodes in the cluster - Also they do tend to > connect/disconnect fast (little keep-alive usage), making this racing > condition quite likely to happen. > > I suppose the scenario Tim described earlier is accurate: > > - Connect to peer A (A=1, B=0) > - Peer A sends 1 to B (A=1, B=1) > - Kill connection to A (A=0, B=1) > - Connect to peer B (A=0, B=2) > - Peer A sends 0 to B (A=0, B=0) > - Peer B sends 0/2 to A (A=?, B=0) > - Kill connection to B (A=?, B=-1) > - Peer B sends -1 to A (A=-1, B=-1)
Got it! I thought the problem was local to a process and that we replicated bad data, but in fact not, it's a distributed race. In this case there is no other short-term solution, and the drift has no reason to significantly accumulate over time. The only long-term solution I'd be seeing to work around this specific pattern would be to keep such values as differential pairs : - count and synchronize the number of ++ - count and synchronize the number of -- In this case the real value is the difference between the two. But it's a bit overkill and is still prone to other races when connections appear in parallel on the two peers. Then at this point better use an external aggregator. OK I'm merging Tim's patch now. Thanks! Willy

