[ Apologies for consuming yet more vertical space ]

With this in .cfg :
log-format 
±{"date":"%t","lbtype":"haproxy","lbname":"%H","cip":"%ci","pid":"%pid","name_f":"%f","name_b":"%b","name_s":"%s","time_cr":"%Tq","time_dq":"%Tw","time_sc":"%Tc","time_sr":"%Tr","time_t":"%Tt","scode":"%ST","bytes_c":"%U","bytes_s":"%B","termstat":"%ts","con_act":"%ac","con_frnt":"%fc","con_back":"%bc","con_srv":"%sc","rtry":"%rc","queue_s":"%sq","queue_b":"%bq","rqst":"%r","hdrs":"%hr"}

, these requests logged with large %Tt (one request for favicon.ico,
which gets answered?):
=====
4/21/16
3:06:36.268 PM
{ [-]
    bytes_c:  578
    bytes_s:  2485558
    cip:  10.107.152.81
    con_act:  43
    con_back:  0
    con_frnt:  0
    con_srv:  0
    date:  21/Apr/2016:21:06:36.268
    hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
    lbname:  haproxy01
    lbtype:  haproxy
    name_b:  haproxy_stats
    name_f:  haproxy_stats
    name_s:  <STATS>
    pid:  20030
    queue_b:  0
    queue_s:  0
    rqst:  GET /favicon.ico HTTP/1.1
    rtry:  0
    scode:  200
    termstat:  LR
    time_cr:  5874
    time_dq:  0
    time_sc:  0
    time_sr:  0
    time_t:  992288
}
host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy

4/21/16
3:06:36.268 PM
{ [-]
    bytes_c:  577
    bytes_s:  3091670
    cip:  10.107.152.81
    con_act:  198
    con_back:  0
    con_frnt:  1
    con_srv:  0
    date:  21/Apr/2016:21:06:36.268
    hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
    lbname:  haproxy01
    lbtype:  haproxy
    name_b:  haproxy_stats
    name_f:  haproxy_stats
    name_s:  <STATS>
    pid:  20030
    queue_b:  0
    queue_s:  0
    rqst:  GET / HTTP/1.1
    rtry:  0
    scode:  200
    termstat:  LR
    time_cr:  107
    time_dq:  0
    time_sc:  0
    time_sr:  0
    time_t:  2493
}
host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy

4/21/16
3:05:06.722 PM
{ [-]
    bytes_c:  577
    bytes_s:  2448514
    cip:  10.107.152.81
    con_act:  1133
    con_back:  0
    con_frnt:  0
    con_srv:  0
    date:  21/Apr/2016:21:05:06.722
    hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
    lbname:  haproxy01
    lbtype:  haproxy
    name_b:  haproxy_stats
    name_f:  haproxy_stats
    name_s:  <STATS>
    pid:  20030
    queue_b:  0
    queue_s:  0
    rqst:  GET / HTTP/1.1
    rtry:  0
    scode:  200
    termstat:  LR
    time_cr:  126
    time_dq:  0
    time_sc:  0
    time_sr:  0
    time_t:  88490
}
host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy

On Thu, Apr 21, 2016 at 5:10 PM, Jim Freeman <sovr...@gmail.com> wrote:
> Another alert+followup :
>
> Cpu pegged again - connected to host and ran :
> ======
> # netstat -pantu | egrep "(^Proto|:50000)"
> Proto Recv-Q Send-Q Local Address           Foreign Address
> State       PID/Program name
> tcp        0      0 0.0.0.0:50000           0.0.0.0:*
> LISTEN      7944/haproxy
> tcp        0      0 10.33.176.98:50000      10.34.157.166:53155
> TIME_WAIT   -
> tcp        0 191520 10.33.176.98:50000      10.107.152.81:59029
> ESTABLISHED 20030/haproxy
> tcp        0      0 10.33.176.98:50000      10.34.155.182:43154
> TIME_WAIT   -
> tcp        0      0 10.33.176.98:50000      10.34.157.165:37806
> TIME_WAIT   -
>
> # the request with un-ACK'd Send-Q data looks suspicious - kill it
> # ./killcx 10.107.152.81:59029
> killcx v1.0.3 - (c)2009-2011 Jerome Bruandet - http://killcx.sourceforge.net/
> [PARENT] checking connection with [10.107.152.81:59029]
> [PARENT] found connection with [10.33.176.98:50000] (ESTABLISHED)
> [PARENT] forking child
> [CHILD]  interface not defined, will use [eth0]
> [CHILD]  setting up filter to sniff ACK on [eth0] for 5 seconds
> [PARENT] sending spoofed SYN to [10.33.176.98:50000] with bogus SeqNum
> [CHILD]  hooked ACK from [10.33.176.98:50000]
> [CHILD]  found AckNum [2424084881] and SeqNum [2973703732]
> [CHILD]  sending spoofed RST to [10.33.176.98:50000] with SeqNum [2424084881]
> [CHILD]  sending RST to remote host as well with SeqNum [2973703732]
> [CHILD]  all done, sending USR1 signal to parent [8077] and exiting
> [PARENT] received child signal, checking results...
>          => success : connection has been closed !
> ======
>
> Right after that, cpu/latency show normal.
>
> I'm unsure if this is a leading or lagging anomaly - it seems to
> follow another strangeness, where ~5 minutes prior, the cpu usage
> across several haproxy hosts drops by 40 %-points [graph attached]
>
> On Thu, Apr 21, 2016 at 11:44 AM, Jim Freeman <sovr...@gmail.com> wrote:
>> Followup: alert triggered this AM - I'll provide what bits I was able
>> to glean.  [ HA-Proxy version 1.5.17 ]
>>
>> A proxy's CPU1 pegged @10:21.  To isolate the connections to a
>> non-listening nanny proc, did a '-sf' reload at 10:24.
>>
>> After the reload, latencies on the proxy of interest rose by an order
>> of magnitude (historically, when this condition lingers, request
>> timings across all proxies/system often suffer substantially).
>>
>> At about  10:35 the pegged CPU resolved spontaneously (connections on
>> the nanny process were finishing - a connection triggering the
>> epoll_wait() busyloop terminated?), and timings returned to normal.
>>
>> Splunk graphs attached (if they're allowed through).
>> cpuBusy.png (y-axis => %cpuBusy)
>> latency.png (y-axis => Td = Tt - (Tq + Tw + Tc + Tr)
>>
>> If its of any use, here's the splunk searcht that triggers the alert :
>> index=os sourcetype=cpu host=haproxy0* | multikv | search CPU=1 | eval
>> cpuBusy=100-pctIdle | anomalousvalue pthresh=0.02 maxanofreq=0.2
>> minsupcount=50 action=annotate cpuBusy | search cpuBusy=100
>> Anomaly_Score_Num\(cpuBusy\)>0 | stats count dc(host) as hosts | where
>> count > hosts
>>
>> On Fri, Apr 15, 2016 at 3:20 PM, Jim Freeman <sovr...@gmail.com> wrote:
>>> I have haproxy slaved to 2d cpu (CPU1), with frequent config changes
>>> and a '-sf' soft-stop with the now-old non-listening process nannying
>>> old connections.
>>>
>>> Sometimes CPU1 goes to %100, and then a few minutes later request
>>> latencies suffer across multiple haproxy peers.
>>>
>>> An strace of the nanny haproxy process shows a tight loop of :
>>>
>>> epoll_wait(0, {}, 200, 0)               = 0
>>> epoll_wait(0, {}, 200, 0)               = 0
>>> epoll_wait(0, {}, 200, 0)               = 0
>>>
>>> I've searched the archives and found similar but old-ish complaints
>>> about similar circumstances, but with fixes/patches mentioned.
>>>
>>> This has happened with both 1.5.3 and 1.5.17.
>>>
>>> Insights ?
>>>
>>> ===========
>>>
>>> # cat  /proc/version
>>> Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org)
>>> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian
>>> 3.16.7-ckt25-1~bpo70+1 (2016-04-02)
>>>
>>> # haproxy -vv
>>> HA-Proxy version 1.5.17 2016/04/13
>>> Copyright 2000-2016 Willy Tarreau <wi...@haproxy.org>
>>>
>>> Build options :
>>>   TARGET  = linux2628
>>>   CPU     = generic
>>>   CC      = gcc
>>>   CFLAGS  = -g -O2 -fstack-protector --param=ssp-buffer-size=4
>>> -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
>>>   OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1
>>>
>>> Default settings :
>>>   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200
>>>
>>> Encrypted password support via crypt(3): yes
>>> Built with zlib version : 1.2.7
>>> Compression algorithms supported : identity, deflate, gzip
>>> Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
>>> Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
>>> OpenSSL library supports TLS extensions : yes
>>> OpenSSL library supports SNI : yes
>>> OpenSSL library supports prefer-server-ciphers : yes
>>> Built with PCRE version : 8.30 2012-02-04
>>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
>>> Built with transparent proxy support using: IP_TRANSPARENT
>>> IPV6_TRANSPARENT IP_FREEBIND
>>>
>>> Available polling systems :
>>>       epoll : pref=300,  test result OK
>>>        poll : pref=200,  test result OK
>>>      select : pref=150,  test result OK
>>> Total: 3 (3 usable), will use epoll.

Reply via email to