On 2016-03-29 15:13, Christian Ruppert wrote:
On 2016-03-29 10:58, Christian Ruppert wrote:
Hi Willy,
On 2016-03-25 18:17, Willy Tarreau wrote:
On Fri, Mar 25, 2016 at 01:53:50PM +0100, Willy Tarreau wrote:
I think it's even different (but could be wrong) since Christian
spoke
about counters suddenly doubling. The issue you faced Sylvain which
I
still have no idea how to fix unfortunately is that the peers applet
is not always woken up when a connection establishes on the other
side
and it may simply miss an event, resulting in everything remaining
stable and appear frozen until the connection closes. Here it seems
data are exchanged but incorrect. This one could be easier to
reproduce
however, we'll check.
OK I found it. Indeed it was easy to reproduce. The frequency
counters
are sent as "now - freq.date", which is a positive age compared to
the
current date. But on receipt, this age was *added* to the current
date
instead of subtracted. So since the date was always in the future,
they
were always expired if the activity changed side in less than the
counter's measuring period (eg: 10s).
I'm commiting this simple fix that you can apply to your tree for
now.
Cheers,
Willy
diff --git a/src/peers.c b/src/peers.c
index c29ea73..9918dac 100644
--- a/src/peers.c
+++ b/src/peers.c
@@ -1153,7 +1153,7 @@ switchstate:
case
STD_T_FRQP: {
struct freq_ctr_period data;
- data.curr_tick = tick_add(now_ms, intdecode(&msg_cur,
msg_end));
+ data.curr_tick = tick_add(now_ms, -intdecode(&msg_cur,
msg_end));
if (!msg_cur) {
/* malformed message */
appctx->st0 = PEER_SESS_ST_ERRPROTO;
Thanks a lot for the fast investigation! The proposed patch seems to
do the trick :)
Hrm, or not. At least not completely.
There's still something wrong it seems:
20160329 15:07:03: 0x3bca858: key=xx.xx.xx.xx use=0 exp=28799601
gpc0=0 conn_cnt=682 conn_rate(10000)=1 conn_cur=3 sess_cnt=1
sess_rate(10000)=-1032058827 http_req_cnt=0 http_req_rate(10000)=2272
http_err_cnt=3 http_err_rate(10000)=1143800 bytes_in_cnt=0
bytes_out_cnt=247977
Note the sess_rate is a negative int. Some http_err_rate seems to be
affected as well. Even the http_req_rate seems to be still wrong, in
some cases.
20160329 15:11:38: 0x3e67318: key=xx.xx.xx.xx use=0 exp=28605259
gpc0=0 conn_cnt=86 conn_rate(10000)=0 conn_cur=7 sess_cnt=0
sess_rate(10000)=0 http_req_cnt=0 http_req_rate(10000)=349038424
http_err_cnt=6 http_err_rate(10000)=0 bytes_in_cnt=0
bytes_out_cnt=3261818950
We're using httpclose so in this case it *actually* should match the
conn_cnt so 86.
I haven't had enough time yet but it looks like I had one case where the
now_ms? was used as value and if that would explain the integer overflow
within http_sess_rate if that is added furthermore.
--
Regards,
Christian Ruppert