Re: 100% cpu , epoll_wait()
Hi Holger, On Fri, Jun 10, 2016 at 04:32:55PM +0200, Holger Just wrote: > Hi Willy et al., > > > Thank you for this report, it helps. How often does it happen, and/or after > > how long on average after you start it ? What's your workload ? Do you use > > SSL, compression, TCP and/or HTTP mode, peers synchronization, etc ? > > Yesterday, we upgraded from 1.5.14 to 1.5.18 and now observed exactly > this issue in production. After rolling back to 1.5.14, it didn't occur > anymore. > > We have mostly http traffic, little TCP with about 100-200 req/s, about > 2000 concurrent connections over all. About all traffic is SSL > terminated. We use no peer synchronization and no compression. > > An strace on the process reveals this (with most of the calls being > epoll_wait): > > [...] > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {{EPOLLIN, {u32=796, u64=796}}}, 200, 0) = 1 > read(796, " > \357\275Y\231\275'b\5\216#\33\220\337'\370\312\215sG4\316\275\277y-%\v\v\211\331\342"..., > 5872) = 1452 > read(796, 0x9fa26ec, 4420) = -1 EAGAIN (Resource > temporarily unavailable) > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > [...] Thank you for the report. I'll inspect the SSL part just in case I'd miss something. Don't take risks in your production of course. Best regards, Willy
Re: 100% cpu , epoll_wait()
Hi Willy et al., > Thank you for this report, it helps. How often does it happen, and/or after > how long on average after you start it ? What's your workload ? Do you use > SSL, compression, TCP and/or HTTP mode, peers synchronization, etc ? Yesterday, we upgraded from 1.5.14 to 1.5.18 and now observed exactly this issue in production. After rolling back to 1.5.14, it didn't occur anymore. We have mostly http traffic, little TCP with about 100-200 req/s, about 2000 concurrent connections over all. About all traffic is SSL terminated. We use no peer synchronization and no compression. An strace on the process reveals this (with most of the calls being epoll_wait): [...] epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {{EPOLLIN, {u32=796, u64=796}}}, 200, 0) = 1 read(796, " \357\275Y\231\275'b\5\216#\33\220\337'\370\312\215sG4\316\275\277y-%\v\v\211\331\342"..., 5872) = 1452 read(796, 0x9fa26ec, 4420) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 [...] The strace was done after reloading using -sf. However, the process was at 100% load even before the reload. Since we kept the process running after the reload (it still holds some connections), I was able to run a second strace, about half an hour later which now show a different behavior: [...] epoll_wait(0, {}, 200, 4) = 0 epoll_wait(0, {}, 200, 7) = 0 epoll_wait(0, {}, 200, 3) = 0 epoll_wait(0, {}, 200, 6) = 0 epoll_wait(0, {}, 200, 3) = 0 epoll_wait(0, {}, 200, 3) = 0 epoll_wait(0, {}, 200, 10) = 0 epoll_wait(0, {}, 200, 3) = 0 epoll_wait(0, {}, 200, 27) = 0 epoll_wait(0, {}, 200, 6) = 0 epoll_wait(0, {}, 200, 4) = 0 [...] The CPU load taken by the process is now back to more or less idle load, without further intervention on the process. `haproxy -vv` of the process running into the busy-loop shows HA-Proxy version 1.5.18 2016/05/10 Copyright 2000-2016 Willy Tarreau Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 Encrypted password support via crypt(3): yes Built with zlib version : 1.2.8 Compression algorithms supported : identity, deflate, gzip Built with OpenSSL version : OpenSSL 1.0.1t 3 May 2016 Running on OpenSSL version : OpenSSL 1.0.1t 3 May 2016 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports prefer-server-ciphers : yes Built with PCRE version : 8.35 2014-04-04 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll. Unfortunately, since we have rolled back production to 1.5.14, we have now little possibility to reproduce this anymore. The process which shows the behavior is still running for the time being though. Regards, Holger
Re: 100% cpu , epoll_wait()
Hi Veiko, On Mon, May 23, 2016 at 11:39:02AM +0300, Veiko Kukk wrote: > I can confirm that on CentOS 6 with HAproxy 1.6.5 this 100% CPU load still > happens. Exactly the same: > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, ^CProcess 6200 detached > > > # haproxy -vv > HA-Proxy version 1.6.5 2016/05/10 > Copyright 2000-2016 Willy Tarreau (...) Thank you for this report, it helps. How often does it happen, and/or after how long on average after you start it ? What's your workload ? Do you use SSL, compression, TCP and/or HTTP mode, peers synchronization, etc ? Willy
Re: 100% cpu , epoll_wait()
On 18/05/16 15:42, Willy Tarreau wrote: Hi Sebastian, On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote: Hi Lukas, starting from around 200mbit/s in&out, haproxy processes (nbproc 6) are hitting 100% cpu regularly (noticed up to 3 processes at the same time with 100%), but recover again on its own after some time. stracing such a process yesterday showed the following: epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is never near to 10% cpu usage with way higher bandwidth. So far I've got good reports from people having experienced similar issues with recent versions, thus I'm thinking about something, are you certain that you did a make clean after upgrading and before rebuilding ? Sometimes we tend to forget it, especially after a simple "git pull". It is very possible that some old .o files were not properly rebuilt and still contain these bugs. If in doubt, you can simply keep a copy of your latest haproxy binary, make clean, build again and run cmp between them. It should not report any difference otherwise it means there was an issue (which would be a great news). I can confirm that on CentOS 6 with HAproxy 1.6.5 this 100% CPU load still happens. Exactly the same: epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, ^CProcess 6200 detached # haproxy -vv HA-Proxy version 1.6.5 2016/05/10 Copyright 2000-2016 Willy Tarreau Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Encrypted password support via crypt(3): yes Built with zlib version : 1.2.3 Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip") Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013 Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports prefer-server-ciphers : yes Built with PCRE version : 7.8 2008-09-05 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built without Lua support Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll. Veiko
Re: 100% cpu , epoll_wait()
On Wed, May 18, 2016 at 03:39:34PM +0200, Sebastian Heid wrote: > Hi Willy, > > thanks for the feedback. I'm using OBS (Open Build Service) for creating > CentOS/RHEL RPMs, so it should be as clean as it can get. OK thanks for checking. > When I find some time, I'll try to reproduce it and will come back to you. Thank you. For now I must confess I'm clueless about what could still cause it. Cheers, Willy
Re: 100% cpu , epoll_wait()
Hi Willy, thanks for the feedback. I'm using OBS (Open Build Service) for creating CentOS/RHEL RPMs, so it should be as clean as it can get. When I find some time, I'll try to reproduce it and will come back to you. Cheers, Sebastian -Ursprüngliche Nachricht- > Von:Willy Tarreau > Gesendet: Mit 18 Mai 2016 14:44 > An: Sebastian Heid > CC: Lukas Tribus ; HAProxy > Betreff: Re: 100% cpu , epoll_wait() > > Hi Sebastian, > > On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote: > > Hi Lukas, > > > > starting from around 200mbit/s in&out, haproxy processes (nbproc 6) are > > hitting 100% cpu regularly (noticed up to 3 processes at the same time with > > 100%), but recover again on its own after some time. > > > > stracing such a process yesterday showed the following: > > epoll_wait(0, {}, 200, 0) = 0 > > epoll_wait(0, {}, 200, 0) = 0 > > epoll_wait(0, {}, 200, 0) = 0 > > epoll_wait(0, {}, 200, 0) = 0 > > epoll_wait(0, {}, 200, 0) = 0 > > > > Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is > > never near to 10% cpu usage with way higher bandwidth. > > So far I've got good reports from people having experienced similar issues > with recent versions, thus I'm thinking about something, are you certain > that you did a make clean after upgrading and before rebuilding ? Sometimes > we tend to forget it, especially after a simple "git pull". It is very > possible that some old .o files were not properly rebuilt and still contain > these bugs. If in doubt, you can simply keep a copy of your latest haproxy > binary, make clean, build again and run cmp between them. It should not > report any difference otherwise it means there was an issue (which would be > a great news). > > Cheers, > Willy > > >
Re: 100% cpu , epoll_wait()
Hi Sebastian, On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote: > Hi Lukas, > > starting from around 200mbit/s in&out, haproxy processes (nbproc 6) are > hitting 100% cpu regularly (noticed up to 3 processes at the same time with > 100%), but recover again on its own after some time. > > stracing such a process yesterday showed the following: > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > > Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is > never near to 10% cpu usage with way higher bandwidth. So far I've got good reports from people having experienced similar issues with recent versions, thus I'm thinking about something, are you certain that you did a make clean after upgrading and before rebuilding ? Sometimes we tend to forget it, especially after a simple "git pull". It is very possible that some old .o files were not properly rebuilt and still contain these bugs. If in doubt, you can simply keep a copy of your latest haproxy binary, make clean, build again and run cmp between them. It should not report any difference otherwise it means there was an issue (which would be a great news). Cheers, Willy
Re: 100% cpu , epoll_wait()
Hi Lukas, On Fri, May 13, 2016 at 06:36:38PM +0200, Lukas Tribus wrote: > Not sure if that's what you meant by the other issue, but if there are still > buffer issues it may also caused the reported crash in zlib (since 1.6.4 but > also affects 1.6.5), that would be thread "Crash with kernel error", where > we are waiting for the OP to provide a stack trace. You're absolutely right. I analysed this one (when the crash moved to glibc) and found in another RHEL6's libc that the instruction pointer was suspiciously close to memcpy(), and there are indeed memcpy() calls in the compression path. So assuming we get a negative length somewhere, all of this could be tied to a single bug having various effects depending on the version, config and environment. Now the question is : what could cause this ? We have Sasha's dump which will help figure in what state his sessions are and see in the code 1) if it's normal to end up in this state, and 2) if/why there is no way to leave this state. Willy
Re: 100% cpu , epoll_wait()
Hi Willy, Am 13.05.2016 um 17:01 schrieb Willy Tarreau: Hi Sebastian, On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote: Hi Lukas, starting from around 200mbit/s in&out, haproxy processes (nbproc 6) are hitting 100% cpu regularly (noticed up to 3 processes at the same time with 100%), but recover again on its own after some time. stracing such a process yesterday showed the following: epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is never near to 10% cpu usage with way higher bandwidth. So we still have another situation making haproxy think it can receive while it cannot. There's a report of another issue possibly related to this on 1.6, so let me try to understand it then I'll check if it may cause this behaviour on 1.5 (assuming it's present there as well). Don't take risks for now. Not sure if that's what you meant by the other issue, but if there are still buffer issues it may also caused the reported crash in zlib (since 1.6.4 but also affects 1.6.5), that would be thread "Crash with kernel error", where we are waiting for the OP to provide a stack trace. cheers, Lukas
Re: 100% cpu , epoll_wait()
Hi Sebastian, On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote: > Hi Lukas, > > starting from around 200mbit/s in&out, haproxy processes (nbproc 6) are > hitting 100% cpu regularly (noticed up to 3 processes at the same time with > 100%), but recover again on its own after some time. > > stracing such a process yesterday showed the following: > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > > Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is > never near to 10% cpu usage with way higher bandwidth. So we still have another situation making haproxy think it can receive while it cannot. There's a report of another issue possibly related to this on 1.6, so let me try to understand it then I'll check if it may cause this behaviour on 1.5 (assuming it's present there as well). Don't take risks for now. Thanks, Willy
AW: 100% cpu , epoll_wait()
Hi Lukas, starting from around 200mbit/s in&out, haproxy processes (nbproc 6) are hitting 100% cpu regularly (noticed up to 3 processes at the same time with 100%), but recover again on its own after some time. stracing such a process yesterday showed the following: epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is never near to 10% cpu usage with way higher bandwidth. Bye, Sebastian -Ursprüngliche Nachricht- > Von:Lukas Tribus > Gesendet: Mit 11 Mai 2016 22:55 > An: Sebastian Heid ; HAProxy > Betreff: Re: 100% cpu , epoll_wait() > > Hi Sebastian, > > > Am 11.05.2016 um 16:07 schrieb Sebastian Heid: > > Hi, > > > > I updated from 1.5.17 to 1.5.18 today, but sadly this issue still exits in > > the latest version in our environment. However downgrading to 1.5.14 > > "fixed" the issue for us. > > Seems like a different issue then. Can you elaborate what you are > seeing? Sporadic 100% cpu load? Do you have to kill it or does it > recover on its own? Can you strace it? > > > > Thanks, > > Lukas > > >
Re: 100% cpu , epoll_wait()
Hi Sebastian, Am 11.05.2016 um 16:07 schrieb Sebastian Heid: Hi, I updated from 1.5.17 to 1.5.18 today, but sadly this issue still exits in the latest version in our environment. However downgrading to 1.5.14 "fixed" the issue for us. Seems like a different issue then. Can you elaborate what you are seeing? Sporadic 100% cpu load? Do you have to kill it or does it recover on its own? Can you strace it? Thanks, Lukas
Re: 100% cpu , epoll_wait()
Hi, I updated from 1.5.17 to 1.5.18 today, but sadly this issue still exits in the latest version in our environment. However downgrading to 1.5.14 "fixed" the issue for us. Running CentOS6, Linux XXX 2.6.32-504.8.1.el6.x86_64 #1 SMP Wed Jan 28 21:11:36 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux Build options : TARGET = linux26 CPU = generic CC = gcc CFLAGS = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing OPTIONS = USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 Encrypted password support via crypt(3): yes Built without zlib support (USE_ZLIB not set) Compression algorithms supported : identity Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013 Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports prefer-server-ciphers : yes Built with PCRE version : 7.8 2008-09-05 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built with transparent proxy support using: IP_TRANSPARENT IP_FREEBIND Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll. Bye, Sebastian -Ursprüngliche Nachricht- > Von:Jim Freeman > Gesendet: Fre 22 April 2016 01:21 > An: HAProxy > Betreff: Re: 100% cpu , epoll_wait() > > [ Apologies for consuming yet more vertical space ] > > With this in .cfg : > log-format > ±{"date":"%t","lbtype":"haproxy","lbname":"%H","cip":"%ci","pid":"%pid","name_f":"%f","name_b":"%b","name_s":"%s","time_cr":"%Tq","time_dq":"%Tw","time_sc":"%Tc","time_sr":"%Tr","time_t":"%Tt","scode":"%ST","bytes_c":"%U","bytes_s":"%B","termstat":"%ts","con_act":"%ac","con_frnt":"%fc","con_back":"%bc","con_srv":"%sc","rtry":"%rc","queue_s":"%sq","queue_b":"%bq","rqst":"%r","hdrs":"%hr"} > > , these requests logged with large %Tt (one request for favicon.ico, > which gets answered?): > = > 4/21/16 > 3:06:36.268 PM > { [-] > bytes_c: 578 > bytes_s: 2485558 > cip: 10.107.152.81 > con_act: 43 > con_back: 0 > con_frnt: 0 > con_srv: 0 > date: 21/Apr/2016:21:06:36.268 > hdrs: {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.} > lbname: haproxy01 > lbtype: haproxy > name_b: haproxy_stats > name_f: haproxy_stats > name_s: > pid: 20030 > queue_b: 0 > queue_s: 0 > rqst: GET /favicon.ico HTTP/1.1 > rtry: 0 > scode: 200 > termstat: LR > time_cr: 5874 > time_dq: 0 > time_sc: 0 > time_sr: 0 > time_t: 992288 > } > host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy > > 4/21/16 > 3:06:36.268 PM > { [-] > bytes_c: 577 > bytes_s: 3091670 > cip: 10.107.152.81 > con_act: 198 > con_back: 0 > con_frnt: 1 > con_srv: 0 > date: 21/Apr/2016:21:06:36.268 > hdrs: {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.} > lbname: haproxy01 > lbtype: haproxy > name_b: haproxy_stats > name_f: haproxy_stats > name_s: > pid: 20030 > queue_b: 0 > queue_s: 0 > rqst: GET / HTTP/1.1 > rtry: 0 > scode: 200 > termstat: LR > time_cr: 107 > time_dq: 0 > time_sc: 0 > time_sr: 0 > time_t: 2493 > } > host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy > > 4/21/16 > 3:05:06.722 PM > { [-] > bytes_c: 577 > bytes_s: 2448514 > cip: 10.107.152.81 > con_act: 1133 > con_back: 0 > con_frnt: 0 > con_srv: 0 > date: 21/Apr/2016:21:05:06.722 > hdrs: {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.} > lbname: haproxy01 > lbtype: haproxy > name_b: haproxy_stats > name_f: haproxy_stats > name_s: > pid: 20030 > queue_b: 0 > queue_s: 0 > rqst: GET / HTTP/1.1 > rtry: 0 > scode: 200 > termstat: LR > time_cr: 126 > time_dq: 0 > time_sc: 0 &g
Re: 100% cpu , epoll_wait()
[ Apologies for consuming yet more vertical space ] With this in .cfg : log-format ±{"date":"%t","lbtype":"haproxy","lbname":"%H","cip":"%ci","pid":"%pid","name_f":"%f","name_b":"%b","name_s":"%s","time_cr":"%Tq","time_dq":"%Tw","time_sc":"%Tc","time_sr":"%Tr","time_t":"%Tt","scode":"%ST","bytes_c":"%U","bytes_s":"%B","termstat":"%ts","con_act":"%ac","con_frnt":"%fc","con_back":"%bc","con_srv":"%sc","rtry":"%rc","queue_s":"%sq","queue_b":"%bq","rqst":"%r","hdrs":"%hr"} , these requests logged with large %Tt (one request for favicon.ico, which gets answered?): = 4/21/16 3:06:36.268 PM { [-] bytes_c: 578 bytes_s: 2485558 cip: 10.107.152.81 con_act: 43 con_back: 0 con_frnt: 0 con_srv: 0 date: 21/Apr/2016:21:06:36.268 hdrs: {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.} lbname: haproxy01 lbtype: haproxy name_b: haproxy_stats name_f: haproxy_stats name_s: pid: 20030 queue_b: 0 queue_s: 0 rqst: GET /favicon.ico HTTP/1.1 rtry: 0 scode: 200 termstat: LR time_cr: 5874 time_dq: 0 time_sc: 0 time_sr: 0 time_t: 992288 } host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy 4/21/16 3:06:36.268 PM { [-] bytes_c: 577 bytes_s: 3091670 cip: 10.107.152.81 con_act: 198 con_back: 0 con_frnt: 1 con_srv: 0 date: 21/Apr/2016:21:06:36.268 hdrs: {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.} lbname: haproxy01 lbtype: haproxy name_b: haproxy_stats name_f: haproxy_stats name_s: pid: 20030 queue_b: 0 queue_s: 0 rqst: GET / HTTP/1.1 rtry: 0 scode: 200 termstat: LR time_cr: 107 time_dq: 0 time_sc: 0 time_sr: 0 time_t: 2493 } host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy 4/21/16 3:05:06.722 PM { [-] bytes_c: 577 bytes_s: 2448514 cip: 10.107.152.81 con_act: 1133 con_back: 0 con_frnt: 0 con_srv: 0 date: 21/Apr/2016:21:05:06.722 hdrs: {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.} lbname: haproxy01 lbtype: haproxy name_b: haproxy_stats name_f: haproxy_stats name_s: pid: 20030 queue_b: 0 queue_s: 0 rqst: GET / HTTP/1.1 rtry: 0 scode: 200 termstat: LR time_cr: 126 time_dq: 0 time_sc: 0 time_sr: 0 time_t: 88490 } host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy On Thu, Apr 21, 2016 at 5:10 PM, Jim Freeman wrote: > Another alert+followup : > > Cpu pegged again - connected to host and ran : > == > # netstat -pantu | egrep "(^Proto|:5)" > Proto Recv-Q Send-Q Local Address Foreign Address > State PID/Program name > tcp0 0 0.0.0.0:5 0.0.0.0:* > LISTEN 7944/haproxy > tcp0 0 10.33.176.98:5 10.34.157.166:53155 > TIME_WAIT - > tcp0 191520 10.33.176.98:5 10.107.152.81:59029 > ESTABLISHED 20030/haproxy > tcp0 0 10.33.176.98:5 10.34.155.182:43154 > TIME_WAIT - > tcp0 0 10.33.176.98:5 10.34.157.165:37806 > TIME_WAIT - > > # the request with un-ACK'd Send-Q data looks suspicious - kill it > # ./killcx 10.107.152.81:59029 > killcx v1.0.3 - (c)2009-2011 Jerome Bruandet - http://killcx.sourceforge.net/ > [PARENT] checking connection with [10.107.152.81:59029] > [PARENT] found connection with [10.33.176.98:5] (ESTABLISHED) > [PARENT] forking child > [CHILD] interface not defined, will use [eth0] > [CHILD] setting up filter to sniff ACK on [eth0] for 5 seconds > [PARENT] sending spoofed SYN to [10.33.176.98:5] with bogus SeqNum > [CHILD] hooked ACK from [10.33.176.98:5] > [CHILD] found AckNum [2424084881] and SeqNum [2973703732] > [CHILD] sending spoofed RST to [10.33.176.98:5] with SeqNum [2424084881] > [CHILD] sending RST to remote host as well with SeqNum [2973703732] > [CHILD] all done, sending USR1 signal to parent [8077] and exiting > [PARENT] received child signal, checking results... > => success : connection has been closed ! > == > > Right after that, cpu/latency show normal. > > I'm unsure if this is a leading or lagging anomaly - it seems to > follow another strangeness, where ~5 minutes prior, the cpu usage > across several haproxy hosts drops by 40 %-points [graph attached] > > On Thu, Apr 21, 2016 at 11:44 AM, Jim Freeman wrote: >> Followup: alert triggered this AM - I'll provide what bits I was able >> to glean. [ HA-Proxy version 1.5.17 ] >> >> A proxy's CPU1 pegged @10:21. To isolate the connections to a >> non-listening nanny proc, did a '-sf' reload at 10:24. >> >> After the reload, latencies on the proxy of interest rose by an order >> of magnitude (historically, when this condition lingers, request >> timings across all proxies/system often suffer subst
Re: 100% cpu , epoll_wait()
Another alert+followup : Cpu pegged again - connected to host and ran : == # netstat -pantu | egrep "(^Proto|:5)" Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp0 0 0.0.0.0:5 0.0.0.0:* LISTEN 7944/haproxy tcp0 0 10.33.176.98:5 10.34.157.166:53155 TIME_WAIT - tcp0 191520 10.33.176.98:5 10.107.152.81:59029 ESTABLISHED 20030/haproxy tcp0 0 10.33.176.98:5 10.34.155.182:43154 TIME_WAIT - tcp0 0 10.33.176.98:5 10.34.157.165:37806 TIME_WAIT - # the request with un-ACK'd Send-Q data looks suspicious - kill it # ./killcx 10.107.152.81:59029 killcx v1.0.3 - (c)2009-2011 Jerome Bruandet - http://killcx.sourceforge.net/ [PARENT] checking connection with [10.107.152.81:59029] [PARENT] found connection with [10.33.176.98:5] (ESTABLISHED) [PARENT] forking child [CHILD] interface not defined, will use [eth0] [CHILD] setting up filter to sniff ACK on [eth0] for 5 seconds [PARENT] sending spoofed SYN to [10.33.176.98:5] with bogus SeqNum [CHILD] hooked ACK from [10.33.176.98:5] [CHILD] found AckNum [2424084881] and SeqNum [2973703732] [CHILD] sending spoofed RST to [10.33.176.98:5] with SeqNum [2424084881] [CHILD] sending RST to remote host as well with SeqNum [2973703732] [CHILD] all done, sending USR1 signal to parent [8077] and exiting [PARENT] received child signal, checking results... => success : connection has been closed ! == Right after that, cpu/latency show normal. I'm unsure if this is a leading or lagging anomaly - it seems to follow another strangeness, where ~5 minutes prior, the cpu usage across several haproxy hosts drops by 40 %-points [graph attached] On Thu, Apr 21, 2016 at 11:44 AM, Jim Freeman wrote: > Followup: alert triggered this AM - I'll provide what bits I was able > to glean. [ HA-Proxy version 1.5.17 ] > > A proxy's CPU1 pegged @10:21. To isolate the connections to a > non-listening nanny proc, did a '-sf' reload at 10:24. > > After the reload, latencies on the proxy of interest rose by an order > of magnitude (historically, when this condition lingers, request > timings across all proxies/system often suffer substantially). > > At about 10:35 the pegged CPU resolved spontaneously (connections on > the nanny process were finishing - a connection triggering the > epoll_wait() busyloop terminated?), and timings returned to normal. > > Splunk graphs attached (if they're allowed through). > cpuBusy.png (y-axis => %cpuBusy) > latency.png (y-axis => Td = Tt - (Tq + Tw + Tc + Tr) > > If its of any use, here's the splunk searcht that triggers the alert : > index=os sourcetype=cpu host=haproxy0* | multikv | search CPU=1 | eval > cpuBusy=100-pctIdle | anomalousvalue pthresh=0.02 maxanofreq=0.2 > minsupcount=50 action=annotate cpuBusy | search cpuBusy=100 > Anomaly_Score_Num\(cpuBusy\)>0 | stats count dc(host) as hosts | where > count > hosts > > On Fri, Apr 15, 2016 at 3:20 PM, Jim Freeman wrote: >> I have haproxy slaved to 2d cpu (CPU1), with frequent config changes >> and a '-sf' soft-stop with the now-old non-listening process nannying >> old connections. >> >> Sometimes CPU1 goes to %100, and then a few minutes later request >> latencies suffer across multiple haproxy peers. >> >> An strace of the nanny haproxy process shows a tight loop of : >> >> epoll_wait(0, {}, 200, 0) = 0 >> epoll_wait(0, {}, 200, 0) = 0 >> epoll_wait(0, {}, 200, 0) = 0 >> >> I've searched the archives and found similar but old-ish complaints >> about similar circumstances, but with fixes/patches mentioned. >> >> This has happened with both 1.5.3 and 1.5.17. >> >> Insights ? >> >> === >> >> # cat /proc/version >> Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org) >> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian >> 3.16.7-ckt25-1~bpo70+1 (2016-04-02) >> >> # haproxy -vv >> HA-Proxy version 1.5.17 2016/04/13 >> Copyright 2000-2016 Willy Tarreau >> >> Build options : >> TARGET = linux2628 >> CPU = generic >> CC = gcc >> CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 >> -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 >> OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1 >> >> Default settings : >> maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 >> >> Encrypted password support via crypt(3): yes >> Built with zlib version : 1.2.7 >> Compression algorithms supported : identity, deflate, gzip >> Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 >> Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 >> OpenSSL library supports TLS extensions : yes >> OpenSSL library supports SNI : yes >> OpenSSL library supports prefer-server-ciphers : yes >> Built with PCRE version : 8.30 2012-02-04 >> PCRE library supports JIT : no (USE_PCRE_JIT not set) >> Built with transpare
Re: 100% cpu , epoll_wait()
On Wed, Apr 20, 2016 at 11:50:20AM +0300, Veiko Kukk wrote: > >is not good unfortunately. I'm assuming this is with 1.5.17, that's it ? If > >so we still have an issue :-/ > > It is 1.6.3 > > # haproxy -vv > HA-Proxy version 1.6.3 2015/12/25 > Copyright 2000-2015 Willy Tarreau OK so this one definitely still has the problem. Don't rush too fast on 1.6.4 as it has an improperly fixed version. Latest snapshot is better. I'm working on a remaining corner case but it's unlikely you'll hit it anyway. Willy
Re: 100% cpu , epoll_wait()
On 20/04/16 11:43, Willy Tarreau wrote: On Tue, Apr 19, 2016 at 09:53:36PM +0300, Veiko Kukk wrote: On 19/04/16 18:52, Willy Tarreau wrote: On Tue, Apr 19, 2016 at 04:15:08PM +0200, Willy Tarreau wrote: OK in fact it's different. Above we have a busy polling loop, which may very be caused by the buffer space miscalculation bug and which results in a process not completing its job until a timeout strikes. The link to the other report shows a normal polling with blocked signals. The processes that was created yesterday via soft reload, went 100% cpu today. haproxy 29388 5.0 0.0 58772 11700 ?Rs Apr17 156:44 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf 1997 Section from strace output: (...) this below : epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 is not good unfortunately. I'm assuming this is with 1.5.17, that's it ? If so we still have an issue :-/ It is 1.6.3 # haproxy -vv HA-Proxy version 1.6.3 2015/12/25 Copyright 2000-2015 Willy Tarreau Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Encrypted password support via crypt(3): yes Built with zlib version : 1.2.3 Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip") Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013 Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports prefer-server-ciphers : yes Built with PCRE version : 7.8 2008-09-05 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built without Lua support Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll. Veiko
Re: 100% cpu , epoll_wait()
On Tue, Apr 19, 2016 at 09:53:36PM +0300, Veiko Kukk wrote: > On 19/04/16 18:52, Willy Tarreau wrote: > >On Tue, Apr 19, 2016 at 04:15:08PM +0200, Willy Tarreau wrote: > >OK in fact it's different. Above we have a busy polling loop, which may > >very be caused by the buffer space miscalculation bug and which results > >in a process not completing its job until a timeout strikes. The link to > >the other report shows a normal polling with blocked signals. > > The processes that was created yesterday via soft reload, went 100% cpu > today. > > haproxy 29388 5.0 0.0 58772 11700 ?Rs Apr17 156:44 > /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf > 1997 > > Section from strace output: > (...) this below : > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 > epoll_wait(0, {}, 200, 0) = 0 is not good unfortunately. I'm assuming this is with 1.5.17, that's it ? If so we still have an issue :-/ Thanks! Willy
Re: 100% cpu , epoll_wait()
On 19/04/16 18:52, Willy Tarreau wrote: On Tue, Apr 19, 2016 at 04:15:08PM +0200, Willy Tarreau wrote: OK in fact it's different. Above we have a busy polling loop, which may very be caused by the buffer space miscalculation bug and which results in a process not completing its job until a timeout strikes. The link to the other report shows a normal polling with blocked signals. The processes that was created yesterday via soft reload, went 100% cpu today. haproxy 29388 5.0 0.0 58772 11700 ?Rs Apr17 156:44 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf 1997 Section from strace output: epoll_wait(0, {}, 200, 0) = 0 recvfrom(32, "\366\334\247\270<\230\3028\v\334\236K\204^p\31\6\3T\230:\23s\257\337\316\242\302]\2\246\227"..., 15368, 0, NULL, NULL) = 15368 recvfrom(32, "\366\334si\251\272Y\372\360'/\363\212\246\262w\307[\251\375\314\236whe\302\337\257\25NQ\370"..., 1024, 0, NULL, NULL) = 1024 sendto(18, "\366\334\247\270<\230\3028\v\334\236K\204^p\31\6\3T\230:\23s\257\337\316\242\302]\2\246\227"..., 16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 8016 sendto(18, "\355\265\207\360\357\3046k\364\320\330\30d\247\354\273BE\201\337\4\265#\357Z\231\231\337\365*\242\345"..., 8376, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(0, EPOLL_CTL_MOD, 18, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=18, u64=18}}) = 0 epoll_wait(0, {}, 200, 0) = 0 recvfrom(32, "@OR\224\335\233\263\347U\245X\376)\240\342\334\242\31\321\322\354\222\276\233\247\316-\263\370)\252U"..., 8016, 0, NULL, NULL) = 8016 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {{EPOLLOUT, {u32=53, u64=53}}}, 200, 0) = 1 sendto(53, "\274'[\24\n\264*b\306\253YA\313A\36\202a\177\317\370K:\302\230\315.\315\215\f&\351\27"..., 14032, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 14032 sendto(53, "\234CS\236wYsf\267\24\276v\325\302\267+a\303\336\250\211x\236\33\23MR_\324\214A\264"..., 2360, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 2360 recvfrom(55, "\231\16\35\337\20\203V\344\360\202n\307\2120\213\r\353\312\334\357\205\366=\\\373|\210\4-\354\32\360"..., 15368, 0, NULL, NULL) = 15368 recvfrom(55, "i\244\305N\242I\177n'4g\211\256%\26X\34il\3374\34HN\22\365\357\211Y\354\306K"..., 1024, 0, NULL, NULL) = 1024 sendto(53, "\231\16\35\337\20\203V\344\360\202n\307\2120\213\r\353\312\334\357\205\366=\\\373|\210\4-\354\32\360"..., 16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 16392 epoll_ctl(0, EPOLL_CTL_MOD, 53, {EPOLLIN|EPOLLRDHUP, {u32=53, u64=53}}) = 0 epoll_wait(0, {}, 200, 0) = 0 recvfrom(55, "\365f\303r(\1\365S\276\246c\334\216\346\226\10<}\340\227h\374\370\360\276sSs\346\351\337\370"..., 15368, 0, NULL, NULL) = 15368 recvfrom(55, "-\r\21\326\326\0\0>\346-?\375\325J\346N\336\353Jz\376\303\373?\226y}\317\257\371\304t"..., 1024, 0, NULL, NULL) = 1024 sendto(53, "\365f\303r(\1\365S\276\246c\334\216\346\226\10<}\340\227h\374\370\360\276sSs\346\351\337\370"..., 16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 16392 epoll_wait(0, {}, 200, 0) = 0 recvfrom(55, "\251\3\0200\317\217ab\223\f\306\322/}J\231\4\3b\311h\220sq\220[\225\21\372\264Dv"..., 15368, 0, NULL, NULL) = 15368 recvfrom(55, "\233.\20B\337\343\274\311\212\211\241\244\5\257\221w1{\253Kjh\23?w\357\365\377\335\261\3\215"..., 1024, 0, NULL, NULL) = 1024 sendto(53, "\251\3\0200\317\217ab\223\f\306\322/}J\231\4\3b\311h\220sq\220[\225\21\372\264Dv"..., 16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 16392 epoll_wait(0, {}, 200, 0) = 0 recvfrom(55, ",T\27\22\300\31\231t\207%j-\263}\344\25#\333\235\214*M\227\26\0215*_\312/@\351"..., 15368, 0, NULL, NULL) = 15368 recvfrom(55, "\225\256\37Qib\371\377\220l\342\20\2742\271\3360U\224\0375?ju\10\207\235J\267\35\340\367"..., 1024, 0, NULL, NULL) = 1024 sendto(53, ",T\27\22\300\31\231t\207%j-\263}\344\25#\333\235\214*M\227\26\0215*_\312/@\351"..., 16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 13312 sendto(53, "\372\265\334\263\232\2016l2\216\372\261B\26\243\252\204\220\353\f\367\215\331\232\203hI,\260\37\207\357"..., 3080, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(0, EPOLL_CTL_MOD, 53, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=53, u64=53}}) = 0 epoll_wait(0, {}, 200, 0) = 0 recvfrom(55, "k\33\342U\260:Z\350\3725>\211R@\20\347\326\363\203\36?\226\304\241\367\263B\242\230\6^\221"..., 13312, 0, NULL, NULL) = 13312 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_w
Re: 100% cpu , epoll_wait()
On Tue, Apr 19, 2016 at 04:15:08PM +0200, Willy Tarreau wrote: > On Tue, Apr 19, 2016 at 02:54:35PM +0200, Lukas Tribus wrote: > > >We use haproxy 1.6.3 (latest CentOS 6.7) and experience similar situation > > >after some reloads (-sf). The old haproxy process does not exit and uses > > >100% cpu, strace showing: > > >epoll_wait(0, {}, 200, 0) = 0 > > >epoll_wait(0, {}, 200, 0) = 0 > > >epoll_wait(0, {}, 200, 0) = 0 > > >epoll_wait(0, {}, 200, 0) = 0 > > >epoll_wait(0, {}, 200, 0) = 0 > > >epoll_wait(0, {}, 200, 0) = 0 > > > > > >In our case, it was a tcp backend tunnelling rsyslog messages. After > > >restarting local rsyslogd, the load was gone and old haproxy instance > > >exited. It's hard to tell how many reloads it takes to make haproxy go > > >crazy or what is the exact reproducible test. But it does not take > > >hundreds of restart, rather 10-20 (our reloads are not very frequent) to > > >make haproxy go crazy. > > > > Also matches this report from December: > > https://www.mail-archive.com/haproxy@formilux.org/msg20772.html > > Yep very likely. The combination of the two reports is very intriguing. > The first one shows the signals being blocked, while the only place where > we block them is in __signal_process_queue() only while calling the handlers > or performing the wakeup() calls, both of which should be instantaneous, > and more importantly the function cannot return without unblocking the > signals. > > I still have no idea what is going on, the code looks simple and clear, > and certainly not compatible with such behaviours. I'm still digging. OK in fact it's different. Above we have a busy polling loop, which may very be caused by the buffer space miscalculation bug and which results in a process not completing its job until a timeout strikes. The link to the other report shows a normal polling with blocked signals. Willy
Re: 100% cpu , epoll_wait()
Hi guys, On Tue, Apr 19, 2016 at 02:54:35PM +0200, Lukas Tribus wrote: > >We use haproxy 1.6.3 (latest CentOS 6.7) and experience similar situation > >after some reloads (-sf). The old haproxy process does not exit and uses > >100% cpu, strace showing: > >epoll_wait(0, {}, 200, 0) = 0 > >epoll_wait(0, {}, 200, 0) = 0 > >epoll_wait(0, {}, 200, 0) = 0 > >epoll_wait(0, {}, 200, 0) = 0 > >epoll_wait(0, {}, 200, 0) = 0 > >epoll_wait(0, {}, 200, 0) = 0 > > > >In our case, it was a tcp backend tunnelling rsyslog messages. After > >restarting local rsyslogd, the load was gone and old haproxy instance > >exited. It's hard to tell how many reloads it takes to make haproxy go > >crazy or what is the exact reproducible test. But it does not take > >hundreds of restart, rather 10-20 (our reloads are not very frequent) to > >make haproxy go crazy. > > Also matches this report from December: > https://www.mail-archive.com/haproxy@formilux.org/msg20772.html Yep very likely. The combination of the two reports is very intriguing. The first one shows the signals being blocked, while the only place where we block them is in __signal_process_queue() only while calling the handlers or performing the wakeup() calls, both of which should be instantaneous, and more importantly the function cannot return without unblocking the signals. I still have no idea what is going on, the code looks simple and clear, and certainly not compatible with such behaviours. I'm still digging. Willy
Re: 100% cpu , epoll_wait()
Hi, Am 19.04.2016 um 09:39 schrieb Veiko Kukk: We use haproxy 1.6.3 (latest CentOS 6.7) and experience similar situation after some reloads (-sf). The old haproxy process does not exit and uses 100% cpu, strace showing: epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 In our case, it was a tcp backend tunnelling rsyslog messages. After restarting local rsyslogd, the load was gone and old haproxy instance exited. It's hard to tell how many reloads it takes to make haproxy go crazy or what is the exact reproducible test. But it does not take hundreds of restart, rather 10-20 (our reloads are not very frequent) to make haproxy go crazy. Also matches this report from December: https://www.mail-archive.com/haproxy@formilux.org/msg20772.html Lukas
Re: 100% cpu , epoll_wait()
On 16/04/16 01:53, Jim Freeman wrote: I'm suspecting that a connection to the stats port goes wonky with a '-sf' reload, but I'll have to wait for it to re-appear to poke further. I'll look first for a stats port connection handled by the pegged process, then use 'tcpkill' to kill just that connection (rather than the whole process, which may be handling other connections). We use haproxy 1.6.3 (latest CentOS 6.7) and experience similar situation after some reloads (-sf). The old haproxy process does not exit and uses 100% cpu, strace showing: epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 In our case, it was a tcp backend tunnelling rsyslog messages. After restarting local rsyslogd, the load was gone and old haproxy instance exited. It's hard to tell how many reloads it takes to make haproxy go crazy or what is the exact reproducible test. But it does not take hundreds of restart, rather 10-20 (our reloads are not very frequent) to make haproxy go crazy. $ haproxy -vv HA-Proxy version 1.6.3 2015/12/25 Copyright 2000-2015 Willy Tarreau Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Encrypted password support via crypt(3): yes Built with zlib version : 1.2.3 Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip") Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013 Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports prefer-server-ciphers : yes Built with PCRE version : 7.8 2008-09-05 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built without Lua support Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll. Best regards, Veiko
Re: 100% cpu , epoll_wait()
Did a bit more digging on the most recent instance, and found that the haproxy pid doing the hogging was handling a connection to the stats port : listen haproxy_stats :5 stats enable stats uri / no log , with this 'netstat -pantlu' entry : tcp0 99756 10.34.176.98:5 10.255.247.189:54484 ESTABLISHED 9499/haproxy I'm suspecting that a connection to the stats port goes wonky with a '-sf' reload, but I'll have to wait for it to re-appear to poke further. I'll look first for a stats port connection handled by the pegged process, then use 'tcpkill' to kill just that connection (rather than the whole process, which may be handling other connections). Its been happening 2 to 3 times a week, and I now have alerting around the event - I'll post more info as I get it ... On Fri, Apr 15, 2016 at 4:28 PM, Cyril Bonté wrote: > Hi Jim, > > Le 15/04/2016 23:20, Jim Freeman a écrit : >> >> I have haproxy slaved to 2d cpu (CPU1), with frequent config changes >> and a '-sf' soft-stop with the now-old non-listening process nannying >> old connections. >> >> Sometimes CPU1 goes to %100, and then a few minutes later request >> latencies suffer across multiple haproxy peers. >> >> An strace of the nanny haproxy process shows a tight loop of : >> >> epoll_wait(0, {}, 200, 0) = 0 >> epoll_wait(0, {}, 200, 0) = 0 >> epoll_wait(0, {}, 200, 0) = 0 >> >> I've searched the archives and found similar but old-ish complaints >> about similar circumstances, but with fixes/patches mentioned. >> >> This has happened with both 1.5.3 and 1.5.17. >> >> Insights ? > > > Can you provide your configuration (without sensible data) ? > Are you using peers ? > > Also, do you have a reproductible testcase that we can play with, or is it > absolutely random ? > > > >> >> === >> >> # cat /proc/version >> Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org) >> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian >> 3.16.7-ckt25-1~bpo70+1 (2016-04-02) >> >> # haproxy -vv >> HA-Proxy version 1.5.17 2016/04/13 >> Copyright 2000-2016 Willy Tarreau >> >> Build options : >>TARGET = linux2628 >>CPU = generic >>CC = gcc >>CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 >> -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 >>OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1 >> >> Default settings : >>maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 >> >> Encrypted password support via crypt(3): yes >> Built with zlib version : 1.2.7 >> Compression algorithms supported : identity, deflate, gzip >> Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 >> Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 >> OpenSSL library supports TLS extensions : yes >> OpenSSL library supports SNI : yes >> OpenSSL library supports prefer-server-ciphers : yes >> Built with PCRE version : 8.30 2012-02-04 >> PCRE library supports JIT : no (USE_PCRE_JIT not set) >> Built with transparent proxy support using: IP_TRANSPARENT >> IPV6_TRANSPARENT IP_FREEBIND >> >> Available polling systems : >>epoll : pref=300, test result OK >> poll : pref=200, test result OK >> select : pref=150, test result OK >> Total: 3 (3 usable), will use epoll. >> > > > -- > Cyril Bonté
Re: 100% cpu , epoll_wait()
Hi Jim, Le 15/04/2016 23:20, Jim Freeman a écrit : I have haproxy slaved to 2d cpu (CPU1), with frequent config changes and a '-sf' soft-stop with the now-old non-listening process nannying old connections. Sometimes CPU1 goes to %100, and then a few minutes later request latencies suffer across multiple haproxy peers. An strace of the nanny haproxy process shows a tight loop of : epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 I've searched the archives and found similar but old-ish complaints about similar circumstances, but with fixes/patches mentioned. This has happened with both 1.5.3 and 1.5.17. Insights ? Can you provide your configuration (without sensible data) ? Are you using peers ? Also, do you have a reproductible testcase that we can play with, or is it absolutely random ? === # cat /proc/version Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.16.7-ckt25-1~bpo70+1 (2016-04-02) # haproxy -vv HA-Proxy version 1.5.17 2016/04/13 Copyright 2000-2016 Willy Tarreau Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 Encrypted password support via crypt(3): yes Built with zlib version : 1.2.7 Compression algorithms supported : identity, deflate, gzip Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports prefer-server-ciphers : yes Built with PCRE version : 8.30 2012-02-04 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll. -- Cyril Bonté
100% cpu , epoll_wait()
I have haproxy slaved to 2d cpu (CPU1), with frequent config changes and a '-sf' soft-stop with the now-old non-listening process nannying old connections. Sometimes CPU1 goes to %100, and then a few minutes later request latencies suffer across multiple haproxy peers. An strace of the nanny haproxy process shows a tight loop of : epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 epoll_wait(0, {}, 200, 0) = 0 I've searched the archives and found similar but old-ish complaints about similar circumstances, but with fixes/patches mentioned. This has happened with both 1.5.3 and 1.5.17. Insights ? === # cat /proc/version Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.16.7-ckt25-1~bpo70+1 (2016-04-02) # haproxy -vv HA-Proxy version 1.5.17 2016/04/13 Copyright 2000-2016 Willy Tarreau Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200 Encrypted password support via crypt(3): yes Built with zlib version : 1.2.7 Compression algorithms supported : identity, deflate, gzip Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports prefer-server-ciphers : yes Built with PCRE version : 8.30 2012-02-04 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll.