Re: 100% cpu , epoll_wait()

Jim Freeman Thu, 21 Apr 2016 16:11:29 -0700

Another alert+followup :

Cpu pegged again - connected to host and ran :
======
# netstat -pantu | egrep "(^Proto|:50000)"
Proto Recv-Q Send-Q Local Address           Foreign Address
State       PID/Program name
tcp        0      0 0.0.0.0:50000           0.0.0.0:*
LISTEN      7944/haproxy
tcp        0      0 10.33.176.98:50000      10.34.157.166:53155
TIME_WAIT   -
tcp        0 191520 10.33.176.98:50000      10.107.152.81:59029
ESTABLISHED 20030/haproxy
tcp        0      0 10.33.176.98:50000      10.34.155.182:43154
TIME_WAIT   -
tcp        0      0 10.33.176.98:50000      10.34.157.165:37806
TIME_WAIT   -


# the request with un-ACK'd Send-Q data looks suspicious - kill it
# ./killcx 10.107.152.81:59029
killcx v1.0.3 - (c)2009-2011 Jerome Bruandet - http://killcx.sourceforge.net/
[PARENT] checking connection with [10.107.152.81:59029]
[PARENT] found connection with [10.33.176.98:50000] (ESTABLISHED)
[PARENT] forking child
[CHILD]  interface not defined, will use [eth0]
[CHILD]  setting up filter to sniff ACK on [eth0] for 5 seconds
[PARENT] sending spoofed SYN to [10.33.176.98:50000] with bogus SeqNum
[CHILD]  hooked ACK from [10.33.176.98:50000]
[CHILD]  found AckNum [2424084881] and SeqNum [2973703732]
[CHILD]  sending spoofed RST to [10.33.176.98:50000] with SeqNum [2424084881]
[CHILD]  sending RST to remote host as well with SeqNum [2973703732]
[CHILD]  all done, sending USR1 signal to parent [8077] and exiting
[PARENT] received child signal, checking results...
         => success : connection has been closed !
======

Right after that, cpu/latency show normal.

I'm unsure if this is a leading or lagging anomaly - it seems to
follow another strangeness, where ~5 minutes prior, the cpu usage
across several haproxy hosts drops by 40 %-points [graph attached]

On Thu, Apr 21, 2016 at 11:44 AM, Jim Freeman <sovr...@gmail.com> wrote:
> Followup: alert triggered this AM - I'll provide what bits I was able
> to glean.  [ HA-Proxy version 1.5.17 ]
>
> A proxy's CPU1 pegged @10:21.  To isolate the connections to a
> non-listening nanny proc, did a '-sf' reload at 10:24.
>
> After the reload, latencies on the proxy of interest rose by an order
> of magnitude (historically, when this condition lingers, request
> timings across all proxies/system often suffer substantially).
>
> At about  10:35 the pegged CPU resolved spontaneously (connections on
> the nanny process were finishing - a connection triggering the
> epoll_wait() busyloop terminated?), and timings returned to normal.
>
> Splunk graphs attached (if they're allowed through).
> cpuBusy.png (y-axis => %cpuBusy)
> latency.png (y-axis => Td = Tt - (Tq + Tw + Tc + Tr)
>
> If its of any use, here's the splunk searcht that triggers the alert :
> index=os sourcetype=cpu host=haproxy0* | multikv | search CPU=1 | eval
> cpuBusy=100-pctIdle | anomalousvalue pthresh=0.02 maxanofreq=0.2
> minsupcount=50 action=annotate cpuBusy | search cpuBusy=100
> Anomaly_Score_Num\(cpuBusy\)>0 | stats count dc(host) as hosts | where
> count > hosts
>
> On Fri, Apr 15, 2016 at 3:20 PM, Jim Freeman <sovr...@gmail.com> wrote:
>> I have haproxy slaved to 2d cpu (CPU1), with frequent config changes
>> and a '-sf' soft-stop with the now-old non-listening process nannying
>> old connections.
>>
>> Sometimes CPU1 goes to %100, and then a few minutes later request
>> latencies suffer across multiple haproxy peers.
>>
>> An strace of the nanny haproxy process shows a tight loop of :
>>
>> epoll_wait(0, {}, 200, 0)               = 0
>> epoll_wait(0, {}, 200, 0)               = 0
>> epoll_wait(0, {}, 200, 0)               = 0
>>
>> I've searched the archives and found similar but old-ish complaints
>> about similar circumstances, but with fixes/patches mentioned.
>>
>> This has happened with both 1.5.3 and 1.5.17.
>>
>> Insights ?
>>
>> ===========
>>
>> # cat  /proc/version
>> Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org)
>> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian
>> 3.16.7-ckt25-1~bpo70+1 (2016-04-02)
>>
>> # haproxy -vv
>> HA-Proxy version 1.5.17 2016/04/13
>> Copyright 2000-2016 Willy Tarreau <wi...@haproxy.org>
>>
>> Build options :
>>   TARGET  = linux2628
>>   CPU     = generic
>>   CC      = gcc
>>   CFLAGS  = -g -O2 -fstack-protector --param=ssp-buffer-size=4
>> -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
>>   OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1
>>
>> Default settings :
>>   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200
>>
>> Encrypted password support via crypt(3): yes
>> Built with zlib version : 1.2.7
>> Compression algorithms supported : identity, deflate, gzip
>> Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
>> Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
>> OpenSSL library supports TLS extensions : yes
>> OpenSSL library supports SNI : yes
>> OpenSSL library supports prefer-server-ciphers : yes
>> Built with PCRE version : 8.30 2012-02-04
>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
>> Built with transparent proxy support using: IP_TRANSPARENT
>> IPV6_TRANSPARENT IP_FREEBIND
>>
>> Available polling systems :
>>       epoll : pref=300,  test result OK
>>        poll : pref=200,  test result OK
>>      select : pref=150,  test result OK
>> Total: 3 (3 usable), will use epoll.

Re: 100% cpu , epoll_wait()

Reply via email to