[ Apologies for consuming yet more vertical space ] With this in .cfg : log-format ±{"date":"%t","lbtype":"haproxy","lbname":"%H","cip":"%ci","pid":"%pid","name_f":"%f","name_b":"%b","name_s":"%s","time_cr":"%Tq","time_dq":"%Tw","time_sc":"%Tc","time_sr":"%Tr","time_t":"%Tt","scode":"%ST","bytes_c":"%U","bytes_s":"%B","termstat":"%ts","con_act":"%ac","con_frnt":"%fc","con_back":"%bc","con_srv":"%sc","rtry":"%rc","queue_s":"%sq","queue_b":"%bq","rqst":"%r","hdrs":"%hr"}
, these requests logged with large %Tt (one request for favicon.ico, which gets answered?): ===== 4/21/16 3:06:36.268 PM { [-] bytes_c: 578 bytes_s: 2485558 cip: 10.107.152.81 con_act: 43 con_back: 0 con_frnt: 0 con_srv: 0 date: 21/Apr/2016:21:06:36.268 hdrs: {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.} lbname: haproxy01 lbtype: haproxy name_b: haproxy_stats name_f: haproxy_stats name_s: <STATS> pid: 20030 queue_b: 0 queue_s: 0 rqst: GET /favicon.ico HTTP/1.1 rtry: 0 scode: 200 termstat: LR time_cr: 5874 time_dq: 0 time_sc: 0 time_sr: 0 time_t: 992288 } host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy 4/21/16 3:06:36.268 PM { [-] bytes_c: 577 bytes_s: 3091670 cip: 10.107.152.81 con_act: 198 con_back: 0 con_frnt: 1 con_srv: 0 date: 21/Apr/2016:21:06:36.268 hdrs: {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.} lbname: haproxy01 lbtype: haproxy name_b: haproxy_stats name_f: haproxy_stats name_s: <STATS> pid: 20030 queue_b: 0 queue_s: 0 rqst: GET / HTTP/1.1 rtry: 0 scode: 200 termstat: LR time_cr: 107 time_dq: 0 time_sc: 0 time_sr: 0 time_t: 2493 } host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy 4/21/16 3:05:06.722 PM { [-] bytes_c: 577 bytes_s: 2448514 cip: 10.107.152.81 con_act: 1133 con_back: 0 con_frnt: 0 con_srv: 0 date: 21/Apr/2016:21:05:06.722 hdrs: {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.} lbname: haproxy01 lbtype: haproxy name_b: haproxy_stats name_f: haproxy_stats name_s: <STATS> pid: 20030 queue_b: 0 queue_s: 0 rqst: GET / HTTP/1.1 rtry: 0 scode: 200 termstat: LR time_cr: 126 time_dq: 0 time_sc: 0 time_sr: 0 time_t: 88490 } host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy On Thu, Apr 21, 2016 at 5:10 PM, Jim Freeman <sovr...@gmail.com> wrote: > Another alert+followup : > > Cpu pegged again - connected to host and ran : > ====== > # netstat -pantu | egrep "(^Proto|:50000)" > Proto Recv-Q Send-Q Local Address Foreign Address > State PID/Program name > tcp 0 0 0.0.0.0:50000 0.0.0.0:* > LISTEN 7944/haproxy > tcp 0 0 10.33.176.98:50000 10.34.157.166:53155 > TIME_WAIT - > tcp 0 191520 10.33.176.98:50000 10.107.152.81:59029 > ESTABLISHED 20030/haproxy > tcp 0 0 10.33.176.98:50000 10.34.155.182:43154 > TIME_WAIT - > tcp 0 0 10.33.176.98:50000 10.34.157.165:37806 > TIME_WAIT - > > # the request with un-ACK'd Send-Q data looks suspicious - kill it > # ./killcx 10.107.152.81:59029 > killcx v1.0.3 - (c)2009-2011 Jerome Bruandet - http://killcx.sourceforge.net/ > [PARENT] checking connection with [10.107.152.81:59029] > [PARENT] found connection with [10.33.176.98:50000] (ESTABLISHED) > [PARENT] forking child > [CHILD] interface not defined, will use [eth0] > [CHILD] setting up filter to sniff ACK on [eth0] for 5 seconds > [PARENT] sending spoofed SYN to [10.33.176.98:50000] with bogus SeqNum > [CHILD] hooked ACK from [10.33.176.98:50000] > [CHILD] found AckNum [2424084881] and SeqNum [2973703732] > [CHILD] sending spoofed RST to [10.33.176.98:50000] with SeqNum [2424084881] > [CHILD] sending RST to remote host as well with SeqNum [2973703732] > [CHILD] all done, sending USR1 signal to parent [8077] and exiting > [PARENT] received child signal, checking results... > => success : connection has been closed ! > ====== > > Right after that, cpu/latency show normal. > > I'm unsure if this is a leading or lagging anomaly - it seems to > follow another strangeness, where ~5 minutes prior, the cpu usage > across several haproxy hosts drops by 40 %-points [graph attached] > > On Thu, Apr 21, 2016 at 11:44 AM, Jim Freeman <sovr...@gmail.com> wrote: >> Followup: alert triggered this AM - I'll provide what bits I was able >> to glean. [ HA-Proxy version 1.5.17 ] >> >> A proxy's CPU1 pegged @10:21. To isolate the connections to a >> non-listening nanny proc, did a '-sf' reload at 10:24. >> >> After the reload, latencies on the proxy of interest rose by an order >> of magnitude (historically, when this condition lingers, request >> timings across all proxies/system often suffer substantially). >> >> At about 10:35 the pegged CPU resolved spontaneously (connections on >> the nanny process were finishing - a connection triggering the >> epoll_wait() busyloop terminated?), and timings returned to normal. >> >> Splunk graphs attached (if they're allowed through). >> cpuBusy.png (y-axis => %cpuBusy) >> latency.png (y-axis => Td = Tt - (Tq + Tw + Tc + Tr) >> >> If its of any use, here's the splunk searcht that triggers the alert : >> index=os sourcetype=cpu host=haproxy0* | multikv | search CPU=1 | eval >> cpuBusy=100-pctIdle | anomalousvalue pthresh=0.02 maxanofreq=0.2 >> minsupcount=50 action=annotate cpuBusy | search cpuBusy=100 >> Anomaly_Score_Num\(cpuBusy\)>0 | stats count dc(host) as hosts | where >> count > hosts >> >> On Fri, Apr 15, 2016 at 3:20 PM, Jim Freeman <sovr...@gmail.com> wrote: >>> I have haproxy slaved to 2d cpu (CPU1), with frequent config changes >>> and a '-sf' soft-stop with the now-old non-listening process nannying >>> old connections. >>> >>> Sometimes CPU1 goes to %100, and then a few minutes later request >>> latencies suffer across multiple haproxy peers. >>> >>> An strace of the nanny haproxy process shows a tight loop of : >>> >>> epoll_wait(0, {}, 200, 0) = 0 >>> epoll_wait(0, {}, 200, 0) = 0 >>> epoll_wait(0, {}, 200, 0) = 0 >>> >>> I've searched the archives and found similar but old-ish complaints >>> about similar circumstances, but with fixes/patches mentioned. >>> >>> This has happened with both 1.5.3 and 1.5.17. >>> >>> Insights ? >>> >>> =========== >>> >>> # cat /proc/version >>> Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org) >>> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian >>> 3.16.7-ckt25-1~bpo70+1 (2016-04-02) >>> >>> # haproxy -vv >>> HA-Proxy version 1.5.17 2016/04/13 >>> Copyright 2000-2016 Willy Tarreau <wi...@haproxy.org> >>> >>> Build options : >>> TARGET = linux2628 >>> CPU = generic >>> CC = gcc >>> CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 >>> -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 >>> OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1 >>> >>> Default settings : >>> maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 >>> >>> Encrypted password support via crypt(3): yes >>> Built with zlib version : 1.2.7 >>> Compression algorithms supported : identity, deflate, gzip >>> Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 >>> Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 >>> OpenSSL library supports TLS extensions : yes >>> OpenSSL library supports SNI : yes >>> OpenSSL library supports prefer-server-ciphers : yes >>> Built with PCRE version : 8.30 2012-02-04 >>> PCRE library supports JIT : no (USE_PCRE_JIT not set) >>> Built with transparent proxy support using: IP_TRANSPARENT >>> IPV6_TRANSPARENT IP_FREEBIND >>> >>> Available polling systems : >>> epoll : pref=300, test result OK >>> poll : pref=200, test result OK >>> select : pref=150, test result OK >>> Total: 3 (3 usable), will use epoll.