Hi,

I updated from 1.5.17 to 1.5.18 today, but sadly this issue still exits in the 
latest version in our environment. However downgrading to 1.5.14 "fixed" the 
issue for us.

Running CentOS6, Linux XXX 2.6.32-504.8.1.el6.x86_64 #1 SMP Wed Jan 28 21:11:36 
UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Build options :
  TARGET  = linux26
  CPU     = generic
  CC      = gcc
  CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing
  OPTIONS = USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built without zlib support (USE_ZLIB not set)
Compression algorithms supported : identity
Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 7.8 2008-09-05
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with transparent proxy support using: IP_TRANSPARENT IP_FREEBIND

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.


Bye,
Sebastian
 
 
-----Ursprüngliche Nachricht-----
> Von:Jim Freeman <sovr...@gmail.com>
> Gesendet: Fre 22 April 2016 01:21
> An: HAProxy <haproxy@formilux.org>
> Betreff: Re: 100% cpu , epoll_wait()
> 
> [ Apologies for consuming yet more vertical space ]
> 
> With this in .cfg :
> log-format 
> ±{"date":"%t","lbtype":"haproxy","lbname":"%H","cip":"%ci","pid":"%pid","name_f":"%f","name_b":"%b","name_s":"%s","time_cr":"%Tq","time_dq":"%Tw","time_sc":"%Tc","time_sr":"%Tr","time_t":"%Tt","scode":"%ST","bytes_c":"%U","bytes_s":"%B","termstat":"%ts","con_act":"%ac","con_frnt":"%fc","con_back":"%bc","con_srv":"%sc","rtry":"%rc","queue_s":"%sq","queue_b":"%bq","rqst":"%r","hdrs":"%hr"}
> 
> , these requests logged with large %Tt (one request for favicon.ico,
> which gets answered?):
> =====
> 4/21/16
> 3:06:36.268 PM
> { [-]
>     bytes_c:  578
>     bytes_s:  2485558
>     cip:  10.107.152.81
>     con_act:  43
>     con_back:  0
>     con_frnt:  0
>     con_srv:  0
>     date:  21/Apr/2016:21:06:36.268
>     hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
>     lbname:  haproxy01
>     lbtype:  haproxy
>     name_b:  haproxy_stats
>     name_f:  haproxy_stats
>     name_s:  <STATS>
>     pid:  20030
>     queue_b:  0
>     queue_s:  0
>     rqst:  GET /favicon.ico HTTP/1.1
>     rtry:  0
>     scode:  200
>     termstat:  LR
>     time_cr:  5874
>     time_dq:  0
>     time_sc:  0
>     time_sr:  0
>     time_t:  992288
> }
> host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy
> 
> 4/21/16
> 3:06:36.268 PM
> { [-]
>     bytes_c:  577
>     bytes_s:  3091670
>     cip:  10.107.152.81
>     con_act:  198
>     con_back:  0
>     con_frnt:  1
>     con_srv:  0
>     date:  21/Apr/2016:21:06:36.268
>     hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
>     lbname:  haproxy01
>     lbtype:  haproxy
>     name_b:  haproxy_stats
>     name_f:  haproxy_stats
>     name_s:  <STATS>
>     pid:  20030
>     queue_b:  0
>     queue_s:  0
>     rqst:  GET / HTTP/1.1
>     rtry:  0
>     scode:  200
>     termstat:  LR
>     time_cr:  107
>     time_dq:  0
>     time_sc:  0
>     time_sr:  0
>     time_t:  2493
> }
> host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy
> 
> 4/21/16
> 3:05:06.722 PM
> { [-]
>     bytes_c:  577
>     bytes_s:  2448514
>     cip:  10.107.152.81
>     con_act:  1133
>     con_back:  0
>     con_frnt:  0
>     con_srv:  0
>     date:  21/Apr/2016:21:05:06.722
>     hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
>     lbname:  haproxy01
>     lbtype:  haproxy
>     name_b:  haproxy_stats
>     name_f:  haproxy_stats
>     name_s:  <STATS>
>     pid:  20030
>     queue_b:  0
>     queue_s:  0
>     rqst:  GET / HTTP/1.1
>     rtry:  0
>     scode:  200
>     termstat:  LR
>     time_cr:  126
>     time_dq:  0
>     time_sc:  0
>     time_sr:  0
>     time_t:  88490
> }
> host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy
> 
> On Thu, Apr 21, 2016 at 5:10 PM, Jim Freeman <sovr...@gmail.com> wrote:
> > Another alert+followup :
> >
> > Cpu pegged again - connected to host and ran :
> > ======
> > # netstat -pantu | egrep "(^Proto|:50000)"
> > Proto Recv-Q Send-Q Local Address           Foreign Address
> > State       PID/Program name
> > tcp        0      0 0.0.0.0:50000           0.0.0.0:*
> > LISTEN      7944/haproxy
> > tcp        0      0 10.33.176.98:50000      10.34.157.166:53155
> > TIME_WAIT   -
> > tcp        0 191520 10.33.176.98:50000      10.107.152.81:59029
> > ESTABLISHED 20030/haproxy
> > tcp        0      0 10.33.176.98:50000      10.34.155.182:43154
> > TIME_WAIT   -
> > tcp        0      0 10.33.176.98:50000      10.34.157.165:37806
> > TIME_WAIT   -
> >
> > # the request with un-ACK'd Send-Q data looks suspicious - kill it
> > # ./killcx 10.107.152.81:59029
> > killcx v1.0.3 - (c)2009-2011 Jerome Bruandet - 
> > http://killcx.sourceforge.net/
> > [PARENT] checking connection with [10.107.152.81:59029]
> > [PARENT] found connection with [10.33.176.98:50000] (ESTABLISHED)
> > [PARENT] forking child
> > [CHILD]  interface not defined, will use [eth0]
> > [CHILD]  setting up filter to sniff ACK on [eth0] for 5 seconds
> > [PARENT] sending spoofed SYN to [10.33.176.98:50000] with bogus SeqNum
> > [CHILD]  hooked ACK from [10.33.176.98:50000]
> > [CHILD]  found AckNum [2424084881] and SeqNum [2973703732]
> > [CHILD]  sending spoofed RST to [10.33.176.98:50000] with SeqNum 
> > [2424084881]
> > [CHILD]  sending RST to remote host as well with SeqNum [2973703732]
> > [CHILD]  all done, sending USR1 signal to parent [8077] and exiting
> > [PARENT] received child signal, checking results...
> >          => success : connection has been closed !
> > ======
> >
> > Right after that, cpu/latency show normal.
> >
> > I'm unsure if this is a leading or lagging anomaly - it seems to
> > follow another strangeness, where ~5 minutes prior, the cpu usage
> > across several haproxy hosts drops by 40 %-points [graph attached]
> >
> > On Thu, Apr 21, 2016 at 11:44 AM, Jim Freeman <sovr...@gmail.com> wrote:
> >> Followup: alert triggered this AM - I'll provide what bits I was able
> >> to glean.  [ HA-Proxy version 1.5.17 ]
> >>
> >> A proxy's CPU1 pegged @10:21.  To isolate the connections to a
> >> non-listening nanny proc, did a '-sf' reload at 10:24.
> >>
> >> After the reload, latencies on the proxy of interest rose by an order
> >> of magnitude (historically, when this condition lingers, request
> >> timings across all proxies/system often suffer substantially).
> >>
> >> At about  10:35 the pegged CPU resolved spontaneously (connections on
> >> the nanny process were finishing - a connection triggering the
> >> epoll_wait() busyloop terminated?), and timings returned to normal.
> >>
> >> Splunk graphs attached (if they're allowed through).
> >> cpuBusy.png (y-axis => %cpuBusy)
> >> latency.png (y-axis => Td = Tt - (Tq + Tw + Tc + Tr)
> >>
> >> If its of any use, here's the splunk searcht that triggers the alert :
> >> index=os sourcetype=cpu host=haproxy0* | multikv | search CPU=1 | eval
> >> cpuBusy=100-pctIdle | anomalousvalue pthresh=0.02 maxanofreq=0.2
> >> minsupcount=50 action=annotate cpuBusy | search cpuBusy=100
> >> Anomaly_Score_Num\(cpuBusy\)>0 | stats count dc(host) as hosts | where
> >> count > hosts
> >>
> >> On Fri, Apr 15, 2016 at 3:20 PM, Jim Freeman <sovr...@gmail.com> wrote:
> >>> I have haproxy slaved to 2d cpu (CPU1), with frequent config changes
> >>> and a '-sf' soft-stop with the now-old non-listening process nannying
> >>> old connections.
> >>>
> >>> Sometimes CPU1 goes to %100, and then a few minutes later request
> >>> latencies suffer across multiple haproxy peers.
> >>>
> >>> An strace of the nanny haproxy process shows a tight loop of :
> >>>
> >>> epoll_wait(0, {}, 200, 0)               = 0
> >>> epoll_wait(0, {}, 200, 0)               = 0
> >>> epoll_wait(0, {}, 200, 0)               = 0
> >>>
> >>> I've searched the archives and found similar but old-ish complaints
> >>> about similar circumstances, but with fixes/patches mentioned.
> >>>
> >>> This has happened with both 1.5.3 and 1.5.17.
> >>>
> >>> Insights ?
> >>>
> >>> ===========
> >>>
> >>> # cat  /proc/version
> >>> Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org)
> >>> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian
> >>> 3.16.7-ckt25-1~bpo70+1 (2016-04-02)
> >>>
> >>> # haproxy -vv
> >>> HA-Proxy version 1.5.17 2016/04/13
> >>> Copyright 2000-2016 Willy Tarreau <wi...@haproxy.org>
> >>>
> >>> Build options :
> >>>   TARGET  = linux2628
> >>>   CPU     = generic
> >>>   CC      = gcc
> >>>   CFLAGS  = -g -O2 -fstack-protector --param=ssp-buffer-size=4
> >>> -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
> >>>   OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1
> >>>
> >>> Default settings :
> >>>   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200
> >>>
> >>> Encrypted password support via crypt(3): yes
> >>> Built with zlib version : 1.2.7
> >>> Compression algorithms supported : identity, deflate, gzip
> >>> Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
> >>> Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
> >>> OpenSSL library supports TLS extensions : yes
> >>> OpenSSL library supports SNI : yes
> >>> OpenSSL library supports prefer-server-ciphers : yes
> >>> Built with PCRE version : 8.30 2012-02-04
> >>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
> >>> Built with transparent proxy support using: IP_TRANSPARENT
> >>> IPV6_TRANSPARENT IP_FREEBIND
> >>>
> >>> Available polling systems :
> >>>       epoll : pref=300,  test result OK
> >>>        poll : pref=200,  test result OK
> >>>      select : pref=150,  test result OK
> >>> Total: 3 (3 usable), will use epoll.
> 
> 

Reply via email to