Another alert+followup : Cpu pegged again - connected to host and ran : ====== # netstat -pantu | egrep "(^Proto|:50000)" Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:50000 0.0.0.0:* LISTEN 7944/haproxy tcp 0 0 10.33.176.98:50000 10.34.157.166:53155 TIME_WAIT - tcp 0 191520 10.33.176.98:50000 10.107.152.81:59029 ESTABLISHED 20030/haproxy tcp 0 0 10.33.176.98:50000 10.34.155.182:43154 TIME_WAIT - tcp 0 0 10.33.176.98:50000 10.34.157.165:37806 TIME_WAIT -
# the request with un-ACK'd Send-Q data looks suspicious - kill it # ./killcx 10.107.152.81:59029 killcx v1.0.3 - (c)2009-2011 Jerome Bruandet - http://killcx.sourceforge.net/ [PARENT] checking connection with [10.107.152.81:59029] [PARENT] found connection with [10.33.176.98:50000] (ESTABLISHED) [PARENT] forking child [CHILD] interface not defined, will use [eth0] [CHILD] setting up filter to sniff ACK on [eth0] for 5 seconds [PARENT] sending spoofed SYN to [10.33.176.98:50000] with bogus SeqNum [CHILD] hooked ACK from [10.33.176.98:50000] [CHILD] found AckNum [2424084881] and SeqNum [2973703732] [CHILD] sending spoofed RST to [10.33.176.98:50000] with SeqNum [2424084881] [CHILD] sending RST to remote host as well with SeqNum [2973703732] [CHILD] all done, sending USR1 signal to parent [8077] and exiting [PARENT] received child signal, checking results... => success : connection has been closed ! ====== Right after that, cpu/latency show normal. I'm unsure if this is a leading or lagging anomaly - it seems to follow another strangeness, where ~5 minutes prior, the cpu usage across several haproxy hosts drops by 40 %-points [graph attached] On Thu, Apr 21, 2016 at 11:44 AM, Jim Freeman <sovr...@gmail.com> wrote: > Followup: alert triggered this AM - I'll provide what bits I was able > to glean. [ HA-Proxy version 1.5.17 ] > > A proxy's CPU1 pegged @10:21. To isolate the connections to a > non-listening nanny proc, did a '-sf' reload at 10:24. > > After the reload, latencies on the proxy of interest rose by an order > of magnitude (historically, when this condition lingers, request > timings across all proxies/system often suffer substantially). > > At about 10:35 the pegged CPU resolved spontaneously (connections on > the nanny process were finishing - a connection triggering the > epoll_wait() busyloop terminated?), and timings returned to normal. > > Splunk graphs attached (if they're allowed through). > cpuBusy.png (y-axis => %cpuBusy) > latency.png (y-axis => Td = Tt - (Tq + Tw + Tc + Tr) > > If its of any use, here's the splunk searcht that triggers the alert : > index=os sourcetype=cpu host=haproxy0* | multikv | search CPU=1 | eval > cpuBusy=100-pctIdle | anomalousvalue pthresh=0.02 maxanofreq=0.2 > minsupcount=50 action=annotate cpuBusy | search cpuBusy=100 > Anomaly_Score_Num\(cpuBusy\)>0 | stats count dc(host) as hosts | where > count > hosts > > On Fri, Apr 15, 2016 at 3:20 PM, Jim Freeman <sovr...@gmail.com> wrote: >> I have haproxy slaved to 2d cpu (CPU1), with frequent config changes >> and a '-sf' soft-stop with the now-old non-listening process nannying >> old connections. >> >> Sometimes CPU1 goes to %100, and then a few minutes later request >> latencies suffer across multiple haproxy peers. >> >> An strace of the nanny haproxy process shows a tight loop of : >> >> epoll_wait(0, {}, 200, 0) = 0 >> epoll_wait(0, {}, 200, 0) = 0 >> epoll_wait(0, {}, 200, 0) = 0 >> >> I've searched the archives and found similar but old-ish complaints >> about similar circumstances, but with fixes/patches mentioned. >> >> This has happened with both 1.5.3 and 1.5.17. >> >> Insights ? >> >> =========== >> >> # cat /proc/version >> Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org) >> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian >> 3.16.7-ckt25-1~bpo70+1 (2016-04-02) >> >> # haproxy -vv >> HA-Proxy version 1.5.17 2016/04/13 >> Copyright 2000-2016 Willy Tarreau <wi...@haproxy.org> >> >> Build options : >> TARGET = linux2628 >> CPU = generic >> CC = gcc >> CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 >> -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 >> OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1 >> >> Default settings : >> maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 >> >> Encrypted password support via crypt(3): yes >> Built with zlib version : 1.2.7 >> Compression algorithms supported : identity, deflate, gzip >> Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 >> Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 >> OpenSSL library supports TLS extensions : yes >> OpenSSL library supports SNI : yes >> OpenSSL library supports prefer-server-ciphers : yes >> Built with PCRE version : 8.30 2012-02-04 >> PCRE library supports JIT : no (USE_PCRE_JIT not set) >> Built with transparent proxy support using: IP_TRANSPARENT >> IPV6_TRANSPARENT IP_FREEBIND >> >> Available polling systems : >> epoll : pref=300, test result OK >> poll : pref=200, test result OK >> select : pref=150, test result OK >> Total: 3 (3 usable), will use epoll.