Re: 100% cpu , epoll_wait()

2016-06-10 Thread Willy Tarreau
Hi Holger,

On Fri, Jun 10, 2016 at 04:32:55PM +0200, Holger Just wrote:
> Hi Willy et al.,
> 
> > Thank you for this report, it helps. How often does it happen, and/or after
> > how long on average after you start it ? What's your workload ? Do you use
> > SSL, compression, TCP and/or HTTP mode, peers synchronization, etc ?
> 
> Yesterday, we upgraded from 1.5.14 to 1.5.18 and now observed exactly
> this issue in production. After rolling back to 1.5.14, it didn't occur
> anymore.
> 
> We have mostly http traffic, little TCP with about 100-200 req/s, about
> 2000 concurrent connections over all. About all traffic is SSL
> terminated. We use no peer synchronization and no compression.
> 
> An strace on the process reveals this (with most of the calls being
> epoll_wait):
> 
> [...]
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {{EPOLLIN, {u32=796, u64=796}}}, 200, 0) = 1
> read(796, "
> \357\275Y\231\275'b\5\216#\33\220\337'\370\312\215sG4\316\275\277y-%\v\v\211\331\342"...,
> 5872) = 1452
> read(796, 0x9fa26ec, 4420)  = -1 EAGAIN (Resource
> temporarily unavailable)
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> [...]

Thank you for the report. I'll inspect the SSL part just in case I'd
miss something. Don't take risks in your production of course.

Best regards,
Willy




Re: 100% cpu , epoll_wait()

2016-06-10 Thread Holger Just
Hi Willy et al.,

> Thank you for this report, it helps. How often does it happen, and/or after
> how long on average after you start it ? What's your workload ? Do you use
> SSL, compression, TCP and/or HTTP mode, peers synchronization, etc ?

Yesterday, we upgraded from 1.5.14 to 1.5.18 and now observed exactly
this issue in production. After rolling back to 1.5.14, it didn't occur
anymore.

We have mostly http traffic, little TCP with about 100-200 req/s, about
2000 concurrent connections over all. About all traffic is SSL
terminated. We use no peer synchronization and no compression.

An strace on the process reveals this (with most of the calls being
epoll_wait):

[...]
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {{EPOLLIN, {u32=796, u64=796}}}, 200, 0) = 1
read(796, "
\357\275Y\231\275'b\5\216#\33\220\337'\370\312\215sG4\316\275\277y-%\v\v\211\331\342"...,
5872) = 1452
read(796, 0x9fa26ec, 4420)  = -1 EAGAIN (Resource
temporarily unavailable)
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
[...]

The strace was done after reloading using -sf. However, the process was
at 100% load even before the reload.

Since we kept the process running after the reload (it still holds some
connections), I was able to run a second strace, about half an hour
later which now show a different behavior:

[...]
epoll_wait(0, {}, 200, 4)   = 0
epoll_wait(0, {}, 200, 7)   = 0
epoll_wait(0, {}, 200, 3)   = 0
epoll_wait(0, {}, 200, 6)   = 0
epoll_wait(0, {}, 200, 3)   = 0
epoll_wait(0, {}, 200, 3)   = 0
epoll_wait(0, {}, 200, 10)  = 0
epoll_wait(0, {}, 200, 3)   = 0
epoll_wait(0, {}, 200, 27)  = 0
epoll_wait(0, {}, 200, 6)   = 0
epoll_wait(0, {}, 200, 4)   = 0
[...]

The CPU load taken by the process is now back to more or less idle load,
without further intervention on the process.

`haproxy -vv` of the process running into the busy-loop shows

HA-Proxy version 1.5.18 2016/05/10
Copyright 2000-2016 Willy Tarreau 

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Compression algorithms supported : identity, deflate, gzip
Built with OpenSSL version : OpenSSL 1.0.1t  3 May 2016
Running on OpenSSL version : OpenSSL 1.0.1t  3 May 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.35 2014-04-04
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Unfortunately, since we have rolled back production to 1.5.14, we have
now little possibility to reproduce this anymore. The process which
shows the behavior is still running for the time being though.

Regards,
Holger



Re: 100% cpu , epoll_wait()

2016-05-23 Thread Willy Tarreau
Hi Veiko,

On Mon, May 23, 2016 at 11:39:02AM +0300, Veiko Kukk wrote:
> I can confirm that on CentOS 6 with HAproxy 1.6.5 this 100% CPU load still
> happens. Exactly the same:
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, ^CProcess 6200 detached
>  
> 
> # haproxy -vv
> HA-Proxy version 1.6.5 2016/05/10
> Copyright 2000-2016 Willy Tarreau 
(...)

Thank you for this report, it helps. How often does it happen, and/or after
how long on average after you start it ? What's your workload ? Do you use
SSL, compression, TCP and/or HTTP mode, peers synchronization, etc ?

Willy



Re: 100% cpu , epoll_wait()

2016-05-23 Thread Veiko Kukk

On 18/05/16 15:42, Willy Tarreau wrote:

Hi Sebastian,

On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote:

Hi Lukas,

starting from around 200mbit/s in, haproxy processes (nbproc 6) are
hitting 100% cpu regularly (noticed up to 3 processes at the same time with
100%), but recover again on its own after some time.

stracing such a process yesterday showed the following:
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0

Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is
never near to 10% cpu usage with way higher bandwidth.


So far I've got good reports from people having experienced similar issues
with recent versions, thus I'm thinking about something, are you certain
that you did a make clean after upgrading and before rebuilding ? Sometimes
we tend to forget it, especially after a simple "git pull". It is very
possible that some old .o files were not properly rebuilt and still contain
these bugs. If in doubt, you can simply keep a copy of your latest haproxy
binary, make clean, build again and run cmp between them. It should not
report any difference otherwise it means there was an issue (which would be
a great news).


I can confirm that on CentOS 6 with HAproxy 1.6.5 this 100% CPU load 
still happens. Exactly the same:

epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, ^CProcess 6200 detached
 

# haproxy -vv
HA-Proxy version 1.6.5 2016/05/10
Copyright 2000-2016 Willy Tarreau 

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing 
-Wdeclaration-after-statement

  OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.3
Compression algorithms supported : identity("identity"), 
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")

Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 7.8 2008-09-05
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built without Lua support
Built with transparent proxy support using: IP_TRANSPARENT 
IPV6_TRANSPARENT IP_FREEBIND


Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Veiko




Re: 100% cpu , epoll_wait()

2016-05-18 Thread Willy Tarreau
On Wed, May 18, 2016 at 03:39:34PM +0200, Sebastian Heid wrote:
> Hi Willy,
> 
> thanks for the feedback. I'm using OBS (Open Build Service) for creating
> CentOS/RHEL RPMs, so it should be as clean as it can get.

OK thanks for checking.

> When I find some time, I'll try to reproduce it and will come back to you.

Thank you. For now I must confess I'm clueless about what could still
cause it.

Cheers,
Willy




Re: 100% cpu , epoll_wait()

2016-05-18 Thread Sebastian Heid
Hi Willy,

thanks for the feedback. I'm using OBS (Open Build Service) for creating 
CentOS/RHEL RPMs, so it should be as clean as it can get.

When I find some time, I'll try to reproduce it and will come back to you.

Cheers,
Sebastian
 
-Ursprüngliche Nachricht-
> Von:Willy Tarreau <w...@1wt.eu>
> Gesendet: Mit 18 Mai 2016 14:44
> An: Sebastian Heid <m...@sebastian-heid.de>
> CC: Lukas Tribus <lu...@gmx.net>; HAProxy <haproxy@formilux.org>
> Betreff: Re: 100% cpu , epoll_wait()
> 
> Hi Sebastian,
> 
> On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote:
> > Hi Lukas,
> > 
> > starting from around 200mbit/s in, haproxy processes (nbproc 6) are
> > hitting 100% cpu regularly (noticed up to 3 processes at the same time with
> > 100%), but recover again on its own after some time. 
> > 
> > stracing such a process yesterday showed the following:
> > epoll_wait(0, {}, 200, 0)   = 0
> > epoll_wait(0, {}, 200, 0)   = 0
> > epoll_wait(0, {}, 200, 0)   = 0
> > epoll_wait(0, {}, 200, 0)   = 0
> > epoll_wait(0, {}, 200, 0)   = 0
> >  
> > Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is
> > never near to 10% cpu usage with way higher bandwidth.
> 
> So far I've got good reports from people having experienced similar issues
> with recent versions, thus I'm thinking about something, are you certain
> that you did a make clean after upgrading and before rebuilding ? Sometimes
> we tend to forget it, especially after a simple "git pull". It is very
> possible that some old .o files were not properly rebuilt and still contain
> these bugs. If in doubt, you can simply keep a copy of your latest haproxy
> binary, make clean, build again and run cmp between them. It should not
> report any difference otherwise it means there was an issue (which would be
> a great news).
> 
> Cheers,
> Willy
> 
> 
> 



Re: 100% cpu , epoll_wait()

2016-05-18 Thread Willy Tarreau
Hi Sebastian,

On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote:
> Hi Lukas,
> 
> starting from around 200mbit/s in, haproxy processes (nbproc 6) are
> hitting 100% cpu regularly (noticed up to 3 processes at the same time with
> 100%), but recover again on its own after some time. 
> 
> stracing such a process yesterday showed the following:
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
>  
> Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is
> never near to 10% cpu usage with way higher bandwidth.

So far I've got good reports from people having experienced similar issues
with recent versions, thus I'm thinking about something, are you certain
that you did a make clean after upgrading and before rebuilding ? Sometimes
we tend to forget it, especially after a simple "git pull". It is very
possible that some old .o files were not properly rebuilt and still contain
these bugs. If in doubt, you can simply keep a copy of your latest haproxy
binary, make clean, build again and run cmp between them. It should not
report any difference otherwise it means there was an issue (which would be
a great news).

Cheers,
Willy




Re: 100% cpu , epoll_wait()

2016-05-13 Thread Willy Tarreau
Hi Lukas,

On Fri, May 13, 2016 at 06:36:38PM +0200, Lukas Tribus wrote:
> Not sure if that's what you meant by the other issue, but if there are still
> buffer issues it may also caused the reported crash in zlib (since 1.6.4 but
> also affects 1.6.5), that would be thread "Crash with kernel error", where
> we are waiting for the OP to provide a stack trace.

You're absolutely right. I analysed this one (when the crash moved to glibc)
and found in another RHEL6's libc that the instruction pointer was suspiciously
close to memcpy(), and there are indeed memcpy() calls in the compression path.
So assuming we get a negative length somewhere, all of this could be tied to
a single bug having various effects depending on the version, config and
environment. Now the question is : what could cause this ? We have Sasha's
dump which will help figure in what state his sessions are and see in the
code 1) if it's normal to end up in this state, and 2) if/why there is no
way to leave this state.

Willy



Re: 100% cpu , epoll_wait()

2016-05-13 Thread Lukas Tribus

Hi Willy,


Am 13.05.2016 um 17:01 schrieb Willy Tarreau:

Hi Sebastian,

On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote:

Hi Lukas,

starting from around 200mbit/s in, haproxy processes (nbproc 6) are hitting 
100% cpu regularly (noticed up to 3 processes at the same time with 100%), but 
recover again on its own after some time.

stracing such a process yesterday showed the following:
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
  
Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is never near to 10% cpu usage with way higher bandwidth.

So we still have another situation making haproxy think it can receive while
it cannot. There's a report of another issue possibly related to this on 1.6,
so let me try to understand it then I'll check if it may cause this behaviour
on 1.5 (assuming it's present there as well). Don't take risks for now.


Not sure if that's what you meant by the other issue, but if there are 
still buffer issues it may also caused the reported crash in zlib (since 
1.6.4 but also affects 1.6.5), that would be thread "Crash with kernel 
error", where we are waiting for the OP to provide a stack trace.



cheers,
Lukas





Re: 100% cpu , epoll_wait()

2016-05-13 Thread Willy Tarreau
Hi Sebastian,

On Thu, May 12, 2016 at 09:58:22AM +0200, Sebastian Heid wrote:
> Hi Lukas,
> 
> starting from around 200mbit/s in, haproxy processes (nbproc 6) are 
> hitting 100% cpu regularly (noticed up to 3 processes at the same time with 
> 100%), but recover again on its own after some time. 
> 
> stracing such a process yesterday showed the following:
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
> epoll_wait(0, {}, 200, 0)   = 0
>  
> Unfortunately I can't do any more debugging in this setup. HAproxy 1.5.14 is 
> never near to 10% cpu usage with way higher bandwidth.

So we still have another situation making haproxy think it can receive while
it cannot. There's a report of another issue possibly related to this on 1.6,
so let me try to understand it then I'll check if it may cause this behaviour
on 1.5 (assuming it's present there as well). Don't take risks for now.

Thanks,
Willy




Re: 100% cpu , epoll_wait()

2016-05-11 Thread Lukas Tribus

Hi Sebastian,


Am 11.05.2016 um 16:07 schrieb Sebastian Heid:

Hi,

I updated from 1.5.17 to 1.5.18 today, but sadly this issue still exits in the latest 
version in our environment. However downgrading to 1.5.14 "fixed" the issue for 
us.


Seems like a different issue then. Can you elaborate what you are 
seeing? Sporadic 100% cpu load? Do you have to kill it or does it 
recover on its own? Can you strace it?




Thanks,

Lukas




Re: 100% cpu , epoll_wait()

2016-05-11 Thread Sebastian Heid
Hi,

I updated from 1.5.17 to 1.5.18 today, but sadly this issue still exits in the 
latest version in our environment. However downgrading to 1.5.14 "fixed" the 
issue for us.

Running CentOS6, Linux XXX 2.6.32-504.8.1.el6.x86_64 #1 SMP Wed Jan 28 21:11:36 
UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Build options :
  TARGET  = linux26
  CPU = generic
  CC  = gcc
  CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing
  OPTIONS = USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built without zlib support (USE_ZLIB not set)
Compression algorithms supported : identity
Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 7.8 2008-09-05
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with transparent proxy support using: IP_TRANSPARENT IP_FREEBIND

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.


Bye,
Sebastian
 
 
-Ursprüngliche Nachricht-
> Von:Jim Freeman <sovr...@gmail.com>
> Gesendet: Fre 22 April 2016 01:21
> An: HAProxy <haproxy@formilux.org>
> Betreff: Re: 100% cpu , epoll_wait()
> 
> [ Apologies for consuming yet more vertical space ]
> 
> With this in .cfg :
> log-format 
> ±{"date":"%t","lbtype":"haproxy","lbname":"%H","cip":"%ci","pid":"%pid","name_f":"%f","name_b":"%b","name_s":"%s","time_cr":"%Tq","time_dq":"%Tw","time_sc":"%Tc","time_sr":"%Tr","time_t":"%Tt","scode":"%ST","bytes_c":"%U","bytes_s":"%B","termstat":"%ts","con_act":"%ac","con_frnt":"%fc","con_back":"%bc","con_srv":"%sc","rtry":"%rc","queue_s":"%sq","queue_b":"%bq","rqst":"%r","hdrs":"%hr"}
> 
> , these requests logged with large %Tt (one request for favicon.ico,
> which gets answered?):
> =
> 4/21/16
> 3:06:36.268 PM
> { [-]
> bytes_c:  578
> bytes_s:  2485558
> cip:  10.107.152.81
> con_act:  43
> con_back:  0
> con_frnt:  0
> con_srv:  0
> date:  21/Apr/2016:21:06:36.268
> hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
> lbname:  haproxy01
> lbtype:  haproxy
> name_b:  haproxy_stats
> name_f:  haproxy_stats
> name_s:  
> pid:  20030
> queue_b:  0
> queue_s:  0
> rqst:  GET /favicon.ico HTTP/1.1
> rtry:  0
> scode:  200
> termstat:  LR
> time_cr:  5874
> time_dq:  0
> time_sc:  0
> time_sr:  0
> time_t:  992288
> }
> host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy
> 
> 4/21/16
> 3:06:36.268 PM
> { [-]
> bytes_c:  577
> bytes_s:  3091670
> cip:  10.107.152.81
> con_act:  198
> con_back:  0
> con_frnt:  1
> con_srv:  0
> date:  21/Apr/2016:21:06:36.268
> hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
> lbname:  haproxy01
> lbtype:  haproxy
> name_b:  haproxy_stats
> name_f:  haproxy_stats
> name_s:  
> pid:  20030
> queue_b:  0
> queue_s:  0
> rqst:  GET / HTTP/1.1
> rtry:  0
> scode:  200
> termstat:  LR
> time_cr:  107
> time_dq:  0
> time_sc:  0
> time_sr:  0
> time_t:  2493
> }
> host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy
> 
> 4/21/16
> 3:05:06.722 PM
> { [-]
> bytes_c:  577
> bytes_s:  2448514
> cip:  10.107.152.81
> con_act:  1133
> con_back:  0
> con_frnt:  0
> con_srv:  0
> date:  21/Apr/2016:21:05:06.722
> hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
> lbname:  haproxy01
> lbtype:  haproxy
> name_b:  haproxy_stats
> name_f:  haproxy_stats
> name_s:  
> pid:  20030
> queue_b:  0
> queue_s:  0
> rqst:  GET / HTTP/1.1
> rtry:  0
> scode:  200
> termstat:  LR
> time_cr

Re: 100% cpu , epoll_wait()

2016-04-21 Thread Jim Freeman
[ Apologies for consuming yet more vertical space ]

With this in .cfg :
log-format 
±{"date":"%t","lbtype":"haproxy","lbname":"%H","cip":"%ci","pid":"%pid","name_f":"%f","name_b":"%b","name_s":"%s","time_cr":"%Tq","time_dq":"%Tw","time_sc":"%Tc","time_sr":"%Tr","time_t":"%Tt","scode":"%ST","bytes_c":"%U","bytes_s":"%B","termstat":"%ts","con_act":"%ac","con_frnt":"%fc","con_back":"%bc","con_srv":"%sc","rtry":"%rc","queue_s":"%sq","queue_b":"%bq","rqst":"%r","hdrs":"%hr"}

, these requests logged with large %Tt (one request for favicon.ico,
which gets answered?):
=
4/21/16
3:06:36.268 PM
{ [-]
bytes_c:  578
bytes_s:  2485558
cip:  10.107.152.81
con_act:  43
con_back:  0
con_frnt:  0
con_srv:  0
date:  21/Apr/2016:21:06:36.268
hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
lbname:  haproxy01
lbtype:  haproxy
name_b:  haproxy_stats
name_f:  haproxy_stats
name_s:  
pid:  20030
queue_b:  0
queue_s:  0
rqst:  GET /favicon.ico HTTP/1.1
rtry:  0
scode:  200
termstat:  LR
time_cr:  5874
time_dq:  0
time_sc:  0
time_sr:  0
time_t:  992288
}
host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy

4/21/16
3:06:36.268 PM
{ [-]
bytes_c:  577
bytes_s:  3091670
cip:  10.107.152.81
con_act:  198
con_back:  0
con_frnt:  1
con_srv:  0
date:  21/Apr/2016:21:06:36.268
hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
lbname:  haproxy01
lbtype:  haproxy
name_b:  haproxy_stats
name_f:  haproxy_stats
name_s:  
pid:  20030
queue_b:  0
queue_s:  0
rqst:  GET / HTTP/1.1
rtry:  0
scode:  200
termstat:  LR
time_cr:  107
time_dq:  0
time_sc:  0
time_sr:  0
time_t:  2493
}
host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy

4/21/16
3:05:06.722 PM
{ [-]
bytes_c:  577
bytes_s:  2448514
cip:  10.107.152.81
con_act:  1133
con_back:  0
con_frnt:  0
con_srv:  0
date:  21/Apr/2016:21:05:06.722
hdrs:  {Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.}
lbname:  haproxy01
lbtype:  haproxy
name_b:  haproxy_stats
name_f:  haproxy_stats
name_s:  
pid:  20030
queue_b:  0
queue_s:  0
rqst:  GET / HTTP/1.1
rtry:  0
scode:  200
termstat:  LR
time_cr:  126
time_dq:  0
time_sc:  0
time_sr:  0
time_t:  88490
}
host = haproxy01.a source = /logs/haproxy.log sourcetype = haproxy

On Thu, Apr 21, 2016 at 5:10 PM, Jim Freeman  wrote:
> Another alert+followup :
>
> Cpu pegged again - connected to host and ran :
> ==
> # netstat -pantu | egrep "(^Proto|:5)"
> Proto Recv-Q Send-Q Local Address   Foreign Address
> State   PID/Program name
> tcp0  0 0.0.0.0:5   0.0.0.0:*
> LISTEN  7944/haproxy
> tcp0  0 10.33.176.98:5  10.34.157.166:53155
> TIME_WAIT   -
> tcp0 191520 10.33.176.98:5  10.107.152.81:59029
> ESTABLISHED 20030/haproxy
> tcp0  0 10.33.176.98:5  10.34.155.182:43154
> TIME_WAIT   -
> tcp0  0 10.33.176.98:5  10.34.157.165:37806
> TIME_WAIT   -
>
> # the request with un-ACK'd Send-Q data looks suspicious - kill it
> # ./killcx 10.107.152.81:59029
> killcx v1.0.3 - (c)2009-2011 Jerome Bruandet - http://killcx.sourceforge.net/
> [PARENT] checking connection with [10.107.152.81:59029]
> [PARENT] found connection with [10.33.176.98:5] (ESTABLISHED)
> [PARENT] forking child
> [CHILD]  interface not defined, will use [eth0]
> [CHILD]  setting up filter to sniff ACK on [eth0] for 5 seconds
> [PARENT] sending spoofed SYN to [10.33.176.98:5] with bogus SeqNum
> [CHILD]  hooked ACK from [10.33.176.98:5]
> [CHILD]  found AckNum [2424084881] and SeqNum [2973703732]
> [CHILD]  sending spoofed RST to [10.33.176.98:5] with SeqNum [2424084881]
> [CHILD]  sending RST to remote host as well with SeqNum [2973703732]
> [CHILD]  all done, sending USR1 signal to parent [8077] and exiting
> [PARENT] received child signal, checking results...
>  => success : connection has been closed !
> ==
>
> Right after that, cpu/latency show normal.
>
> I'm unsure if this is a leading or lagging anomaly - it seems to
> follow another strangeness, where ~5 minutes prior, the cpu usage
> across several haproxy hosts drops by 40 %-points [graph attached]
>
> On Thu, Apr 21, 2016 at 11:44 AM, Jim Freeman  wrote:
>> Followup: alert triggered this AM - I'll provide what bits I was able
>> to glean.  [ HA-Proxy version 1.5.17 ]
>>
>> A proxy's CPU1 pegged @10:21.  To isolate the connections to a
>> non-listening nanny proc, did a '-sf' reload at 10:24.
>>
>> After the reload, latencies on the proxy of interest rose by an order
>> of magnitude (historically, when this condition lingers, request
>> timings 

Re: 100% cpu , epoll_wait()

2016-04-21 Thread Jim Freeman
Another alert+followup :

Cpu pegged again - connected to host and ran :
==
# netstat -pantu | egrep "(^Proto|:5)"
Proto Recv-Q Send-Q Local Address   Foreign Address
State   PID/Program name
tcp0  0 0.0.0.0:5   0.0.0.0:*
LISTEN  7944/haproxy
tcp0  0 10.33.176.98:5  10.34.157.166:53155
TIME_WAIT   -
tcp0 191520 10.33.176.98:5  10.107.152.81:59029
ESTABLISHED 20030/haproxy
tcp0  0 10.33.176.98:5  10.34.155.182:43154
TIME_WAIT   -
tcp0  0 10.33.176.98:5  10.34.157.165:37806
TIME_WAIT   -

# the request with un-ACK'd Send-Q data looks suspicious - kill it
# ./killcx 10.107.152.81:59029
killcx v1.0.3 - (c)2009-2011 Jerome Bruandet - http://killcx.sourceforge.net/
[PARENT] checking connection with [10.107.152.81:59029]
[PARENT] found connection with [10.33.176.98:5] (ESTABLISHED)
[PARENT] forking child
[CHILD]  interface not defined, will use [eth0]
[CHILD]  setting up filter to sniff ACK on [eth0] for 5 seconds
[PARENT] sending spoofed SYN to [10.33.176.98:5] with bogus SeqNum
[CHILD]  hooked ACK from [10.33.176.98:5]
[CHILD]  found AckNum [2424084881] and SeqNum [2973703732]
[CHILD]  sending spoofed RST to [10.33.176.98:5] with SeqNum [2424084881]
[CHILD]  sending RST to remote host as well with SeqNum [2973703732]
[CHILD]  all done, sending USR1 signal to parent [8077] and exiting
[PARENT] received child signal, checking results...
 => success : connection has been closed !
==

Right after that, cpu/latency show normal.

I'm unsure if this is a leading or lagging anomaly - it seems to
follow another strangeness, where ~5 minutes prior, the cpu usage
across several haproxy hosts drops by 40 %-points [graph attached]

On Thu, Apr 21, 2016 at 11:44 AM, Jim Freeman  wrote:
> Followup: alert triggered this AM - I'll provide what bits I was able
> to glean.  [ HA-Proxy version 1.5.17 ]
>
> A proxy's CPU1 pegged @10:21.  To isolate the connections to a
> non-listening nanny proc, did a '-sf' reload at 10:24.
>
> After the reload, latencies on the proxy of interest rose by an order
> of magnitude (historically, when this condition lingers, request
> timings across all proxies/system often suffer substantially).
>
> At about  10:35 the pegged CPU resolved spontaneously (connections on
> the nanny process were finishing - a connection triggering the
> epoll_wait() busyloop terminated?), and timings returned to normal.
>
> Splunk graphs attached (if they're allowed through).
> cpuBusy.png (y-axis => %cpuBusy)
> latency.png (y-axis => Td = Tt - (Tq + Tw + Tc + Tr)
>
> If its of any use, here's the splunk searcht that triggers the alert :
> index=os sourcetype=cpu host=haproxy0* | multikv | search CPU=1 | eval
> cpuBusy=100-pctIdle | anomalousvalue pthresh=0.02 maxanofreq=0.2
> minsupcount=50 action=annotate cpuBusy | search cpuBusy=100
> Anomaly_Score_Num\(cpuBusy\)>0 | stats count dc(host) as hosts | where
> count > hosts
>
> On Fri, Apr 15, 2016 at 3:20 PM, Jim Freeman  wrote:
>> I have haproxy slaved to 2d cpu (CPU1), with frequent config changes
>> and a '-sf' soft-stop with the now-old non-listening process nannying
>> old connections.
>>
>> Sometimes CPU1 goes to %100, and then a few minutes later request
>> latencies suffer across multiple haproxy peers.
>>
>> An strace of the nanny haproxy process shows a tight loop of :
>>
>> epoll_wait(0, {}, 200, 0)   = 0
>> epoll_wait(0, {}, 200, 0)   = 0
>> epoll_wait(0, {}, 200, 0)   = 0
>>
>> I've searched the archives and found similar but old-ish complaints
>> about similar circumstances, but with fixes/patches mentioned.
>>
>> This has happened with both 1.5.3 and 1.5.17.
>>
>> Insights ?
>>
>> ===
>>
>> # cat  /proc/version
>> Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org)
>> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian
>> 3.16.7-ckt25-1~bpo70+1 (2016-04-02)
>>
>> # haproxy -vv
>> HA-Proxy version 1.5.17 2016/04/13
>> Copyright 2000-2016 Willy Tarreau 
>>
>> Build options :
>>   TARGET  = linux2628
>>   CPU = generic
>>   CC  = gcc
>>   CFLAGS  = -g -O2 -fstack-protector --param=ssp-buffer-size=4
>> -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
>>   OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1
>>
>> Default settings :
>>   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200
>>
>> Encrypted password support via crypt(3): yes
>> Built with zlib version : 1.2.7
>> Compression algorithms supported : identity, deflate, gzip
>> Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
>> Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
>> OpenSSL library supports TLS extensions : yes
>> OpenSSL library supports SNI : yes
>> OpenSSL library supports prefer-server-ciphers : yes
>> Built with PCRE version : 8.30 2012-02-04
>> PCRE library 

Re: 100% cpu , epoll_wait()

2016-04-20 Thread Willy Tarreau
On Wed, Apr 20, 2016 at 11:50:20AM +0300, Veiko Kukk wrote:
> >is not good unfortunately. I'm assuming this is with 1.5.17, that's it ? If
> >so we still have an issue :-/
> 
> It is 1.6.3
> 
> # haproxy -vv
> HA-Proxy version 1.6.3 2015/12/25
> Copyright 2000-2015 Willy Tarreau 

OK so this one definitely still has the problem. Don't rush too fast on 1.6.4
as it has an improperly fixed version. Latest snapshot is better. I'm working
on a remaining corner case but it's unlikely you'll hit it anyway.

Willy



Re: 100% cpu , epoll_wait()

2016-04-20 Thread Veiko Kukk

On 20/04/16 11:43, Willy Tarreau wrote:

On Tue, Apr 19, 2016 at 09:53:36PM +0300, Veiko Kukk wrote:

On 19/04/16 18:52, Willy Tarreau wrote:

On Tue, Apr 19, 2016 at 04:15:08PM +0200, Willy Tarreau wrote:
OK in fact it's different. Above we have a busy polling loop, which may
very be caused by the buffer space miscalculation bug and which results
in a process not completing its job until a timeout strikes. The link to
the other report shows a normal polling with blocked signals.


The processes that was created yesterday via soft reload, went 100% cpu
today.

haproxy  29388  5.0  0.0  58772 11700 ?Rs   Apr17 156:44
/usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf
1997

Section from strace output:


(...)

this below :


epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0


is not good unfortunately. I'm assuming this is with 1.5.17, that's it ? If
so we still have an issue :-/


It is 1.6.3

# haproxy -vv
HA-Proxy version 1.6.3 2015/12/25
Copyright 2000-2015 Willy Tarreau 

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing 
-Wdeclaration-after-statement

  OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.3
Compression algorithms supported : identity("identity"), 
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")

Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 7.8 2008-09-05
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built without Lua support
Built with transparent proxy support using: IP_TRANSPARENT 
IPV6_TRANSPARENT IP_FREEBIND


Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Veiko




Re: 100% cpu , epoll_wait()

2016-04-19 Thread Veiko Kukk

On 19/04/16 18:52, Willy Tarreau wrote:

On Tue, Apr 19, 2016 at 04:15:08PM +0200, Willy Tarreau wrote:
OK in fact it's different. Above we have a busy polling loop, which may
very be caused by the buffer space miscalculation bug and which results
in a process not completing its job until a timeout strikes. The link to
the other report shows a normal polling with blocked signals.


The processes that was created yesterday via soft reload, went 100% cpu 
today.


haproxy  29388  5.0  0.0  58772 11700 ?Rs   Apr17 156:44 
/usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid 
-sf 1997


Section from strace output:

epoll_wait(0, {}, 200, 0)   = 0
recvfrom(32, 
"\366\334\247\270<\230\3028\v\334\236K\204^p\31\6\3T\230:\23s\257\337\316\242\302]\2\246\227"..., 
15368, 0, NULL, NULL) = 15368
recvfrom(32, 
"\366\334si\251\272Y\372\360'/\363\212\246\262w\307[\251\375\314\236whe\302\337\257\25NQ\370"..., 
1024, 0, NULL, NULL) = 1024
sendto(18, 
"\366\334\247\270<\230\3028\v\334\236K\204^p\31\6\3T\230:\23s\257\337\316\242\302]\2\246\227"..., 
16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 8016
sendto(18, 
"\355\265\207\360\357\3046k\364\320\330\30d\247\354\273BE\201\337\4\265#\357Z\231\231\337\365*\242\345"..., 
8376, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
epoll_ctl(0, EPOLL_CTL_MOD, 18, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=18, 
u64=18}}) = 0

epoll_wait(0, {}, 200, 0)   = 0
recvfrom(32, 
"@OR\224\335\233\263\347U\245X\376)\240\342\334\242\31\321\322\354\222\276\233\247\316-\263\370)\252U"..., 
8016, 0, NULL, NULL) = 8016

epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {{EPOLLOUT, {u32=53, u64=53}}}, 200, 0) = 1
sendto(53, 
"\274'[\24\n\264*b\306\253YA\313A\36\202a\177\317\370K:\302\230\315.\315\215\f&\351\27"..., 
14032, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 14032
sendto(53, 
"\234CS\236wYsf\267\24\276v\325\302\267+a\303\336\250\211x\236\33\23MR_\324\214A\264"..., 
2360, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 2360
recvfrom(55, 
"\231\16\35\337\20\203V\344\360\202n\307\2120\213\r\353\312\334\357\205\366=\\\373|\210\4-\354\32\360"..., 
15368, 0, NULL, NULL) = 15368
recvfrom(55, 
"i\244\305N\242I\177n'4g\211\256%\26X\34il\3374\34HN\22\365\357\211Y\354\306K"..., 
1024, 0, NULL, NULL) = 1024
sendto(53, 
"\231\16\35\337\20\203V\344\360\202n\307\2120\213\r\353\312\334\357\205\366=\\\373|\210\4-\354\32\360"..., 
16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 16392

epoll_ctl(0, EPOLL_CTL_MOD, 53, {EPOLLIN|EPOLLRDHUP, {u32=53, u64=53}}) = 0
epoll_wait(0, {}, 200, 0)   = 0
recvfrom(55, 
"\365f\303r(\1\365S\276\246c\334\216\346\226\10<}\340\227h\374\370\360\276sSs\346\351\337\370"..., 
15368, 0, NULL, NULL) = 15368
recvfrom(55, 
"-\r\21\326\326\0\0>\346-?\375\325J\346N\336\353Jz\376\303\373?\226y}\317\257\371\304t"..., 
1024, 0, NULL, NULL) = 1024
sendto(53, 
"\365f\303r(\1\365S\276\246c\334\216\346\226\10<}\340\227h\374\370\360\276sSs\346\351\337\370"..., 
16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 16392

epoll_wait(0, {}, 200, 0)   = 0
recvfrom(55, 
"\251\3\0200\317\217ab\223\f\306\322/}J\231\4\3b\311h\220sq\220[\225\21\372\264Dv"..., 
15368, 0, NULL, NULL) = 15368
recvfrom(55, 
"\233.\20B\337\343\274\311\212\211\241\244\5\257\221w1{\253Kjh\23?w\357\365\377\335\261\3\215"..., 
1024, 0, NULL, NULL) = 1024
sendto(53, 
"\251\3\0200\317\217ab\223\f\306\322/}J\231\4\3b\311h\220sq\220[\225\21\372\264Dv"..., 
16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 16392

epoll_wait(0, {}, 200, 0)   = 0
recvfrom(55, 
",T\27\22\300\31\231t\207%j-\263}\344\25#\333\235\214*M\227\26\0215*_\312/@\351"..., 
15368, 0, NULL, NULL) = 15368
recvfrom(55, 
"\225\256\37Qib\371\377\220l\342\20\2742\271\3360U\224\0375?ju\10\207\235J\267\35\340\367"..., 
1024, 0, NULL, NULL) = 1024
sendto(53, 
",T\27\22\300\31\231t\207%j-\263}\344\25#\333\235\214*M\227\26\0215*_\312/@\351"..., 
16392, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = 13312
sendto(53, 
"\372\265\334\263\232\2016l2\216\372\261B\26\243\252\204\220\353\f\367\215\331\232\203hI,\260\37\207\357"..., 
3080, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE, NULL, 0) = -1 EAGAIN (Resource 
temporarily unavailable)
epoll_ctl(0, EPOLL_CTL_MOD, 53, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=53, 
u64=53}}) = 0

epoll_wait(0, {}, 200, 0)   = 0
recvfrom(55, 
"k\33\342U\260:Z\350\3725>\211R@\20\347\326\363\203\36?\226\304\241\367\263B\242\230\6^\221"..., 
13312, 0, NULL, NULL) = 13312

epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0

Re: 100% cpu , epoll_wait()

2016-04-19 Thread Willy Tarreau
On Tue, Apr 19, 2016 at 04:15:08PM +0200, Willy Tarreau wrote:
> On Tue, Apr 19, 2016 at 02:54:35PM +0200, Lukas Tribus wrote:
> > >We use haproxy 1.6.3 (latest CentOS 6.7) and experience similar situation
> > >after some reloads (-sf). The old haproxy process does not exit and uses
> > >100% cpu, strace showing:
> > >epoll_wait(0, {}, 200, 0)   = 0
> > >epoll_wait(0, {}, 200, 0)   = 0
> > >epoll_wait(0, {}, 200, 0)   = 0
> > >epoll_wait(0, {}, 200, 0)   = 0
> > >epoll_wait(0, {}, 200, 0)   = 0
> > >epoll_wait(0, {}, 200, 0)   = 0
> > >
> > >In our case, it was a tcp backend tunnelling rsyslog messages. After
> > >restarting local rsyslogd, the load was gone and old haproxy instance
> > >exited. It's hard to tell how many reloads it takes to make haproxy go
> > >crazy or what is the exact reproducible test. But it does not take
> > >hundreds of restart, rather 10-20 (our reloads are not very frequent) to
> > >make haproxy go crazy.
> > 
> > Also matches this report from December:
> > https://www.mail-archive.com/haproxy@formilux.org/msg20772.html
> 
> Yep very likely. The combination of the two reports is very intriguing.
> The first one shows the signals being blocked, while the only place where
> we block them is in __signal_process_queue() only while calling the handlers
> or performing the wakeup() calls, both of which should be instantaneous,
> and more importantly the function cannot return without unblocking the
> signals.
> 
> I still have no idea what is going on, the code looks simple and clear,
> and certainly not compatible with such behaviours. I'm still digging.

OK in fact it's different. Above we have a busy polling loop, which may
very be caused by the buffer space miscalculation bug and which results
in a process not completing its job until a timeout strikes. The link to
the other report shows a normal polling with blocked signals.

Willy




Re: 100% cpu , epoll_wait()

2016-04-19 Thread Willy Tarreau
Hi guys,

On Tue, Apr 19, 2016 at 02:54:35PM +0200, Lukas Tribus wrote:
> >We use haproxy 1.6.3 (latest CentOS 6.7) and experience similar situation
> >after some reloads (-sf). The old haproxy process does not exit and uses
> >100% cpu, strace showing:
> >epoll_wait(0, {}, 200, 0)   = 0
> >epoll_wait(0, {}, 200, 0)   = 0
> >epoll_wait(0, {}, 200, 0)   = 0
> >epoll_wait(0, {}, 200, 0)   = 0
> >epoll_wait(0, {}, 200, 0)   = 0
> >epoll_wait(0, {}, 200, 0)   = 0
> >
> >In our case, it was a tcp backend tunnelling rsyslog messages. After
> >restarting local rsyslogd, the load was gone and old haproxy instance
> >exited. It's hard to tell how many reloads it takes to make haproxy go
> >crazy or what is the exact reproducible test. But it does not take
> >hundreds of restart, rather 10-20 (our reloads are not very frequent) to
> >make haproxy go crazy.
> 
> Also matches this report from December:
> https://www.mail-archive.com/haproxy@formilux.org/msg20772.html

Yep very likely. The combination of the two reports is very intriguing.
The first one shows the signals being blocked, while the only place where
we block them is in __signal_process_queue() only while calling the handlers
or performing the wakeup() calls, both of which should be instantaneous,
and more importantly the function cannot return without unblocking the
signals.

I still have no idea what is going on, the code looks simple and clear,
and certainly not compatible with such behaviours. I'm still digging.

Willy




Re: 100% cpu , epoll_wait()

2016-04-19 Thread Lukas Tribus

Hi,


Am 19.04.2016 um 09:39 schrieb Veiko Kukk:



We use haproxy 1.6.3 (latest CentOS 6.7) and experience similar 
situation after some reloads (-sf). The old haproxy process does not 
exit and uses 100% cpu, strace showing:

epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0

In our case, it was a tcp backend tunnelling rsyslog messages. After 
restarting local rsyslogd, the load was gone and old haproxy instance 
exited. It's hard to tell how many reloads it takes to make haproxy go 
crazy or what is the exact reproducible test. But it does not take 
hundreds of restart, rather 10-20 (our reloads are not very frequent) 
to make haproxy go crazy.


Also matches this report from December:
https://www.mail-archive.com/haproxy@formilux.org/msg20772.html



Lukas



Re: 100% cpu , epoll_wait()

2016-04-19 Thread Veiko Kukk


On 16/04/16 01:53, Jim Freeman wrote:

I'm suspecting that a connection to the stats port goes wonky with a
'-sf' reload, but I'll have to wait for it to re-appear to poke
further.  I'll look first for a stats port connection handled by the
pegged process, then use 'tcpkill' to kill just that connection
(rather than the whole process, which may be handling other
connections).


We use haproxy 1.6.3 (latest CentOS 6.7) and experience similar 
situation after some reloads (-sf). The old haproxy process does not 
exit and uses 100% cpu, strace showing:

epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0

In our case, it was a tcp backend tunnelling rsyslog messages. After 
restarting local rsyslogd, the load was gone and old haproxy instance 
exited. It's hard to tell how many reloads it takes to make haproxy go 
crazy or what is the exact reproducible test. But it does not take 
hundreds of restart, rather 10-20 (our reloads are not very frequent) to 
make haproxy go crazy.


$ haproxy -vv
HA-Proxy version 1.6.3 2015/12/25
Copyright 2000-2015 Willy Tarreau 

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing 
-Wdeclaration-after-statement

  OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.3
Compression algorithms supported : identity("identity"), 
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")

Built with OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1e-fips 11 Feb 2013
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 7.8 2008-09-05
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built without Lua support
Built with transparent proxy support using: IP_TRANSPARENT 
IPV6_TRANSPARENT IP_FREEBIND


Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Best regards,
Veiko



Re: 100% cpu , epoll_wait()

2016-04-15 Thread Jim Freeman
Did a bit more digging on the most recent instance, and found that the
haproxy pid doing the hogging was handling a connection to the stats
port :

listen haproxy_stats :5
stats enable
stats uri /
no log

, with this 'netstat -pantlu' entry :
tcp0  99756 10.34.176.98:5  10.255.247.189:54484
ESTABLISHED 9499/haproxy

I'm suspecting that a connection to the stats port goes wonky with a
'-sf' reload, but I'll have to wait for it to re-appear to poke
further.  I'll look first for a stats port connection handled by the
pegged process, then use 'tcpkill' to kill just that connection
(rather than the whole process, which may be handling other
connections).

Its been happening 2 to 3 times a week, and I now have alerting around
the event - I'll post more info as I get it ...


On Fri, Apr 15, 2016 at 4:28 PM, Cyril Bonté  wrote:
> Hi Jim,
>
> Le 15/04/2016 23:20, Jim Freeman a écrit :
>>
>> I have haproxy slaved to 2d cpu (CPU1), with frequent config changes
>> and a '-sf' soft-stop with the now-old non-listening process nannying
>> old connections.
>>
>> Sometimes CPU1 goes to %100, and then a few minutes later request
>> latencies suffer across multiple haproxy peers.
>>
>> An strace of the nanny haproxy process shows a tight loop of :
>>
>> epoll_wait(0, {}, 200, 0)   = 0
>> epoll_wait(0, {}, 200, 0)   = 0
>> epoll_wait(0, {}, 200, 0)   = 0
>>
>> I've searched the archives and found similar but old-ish complaints
>> about similar circumstances, but with fixes/patches mentioned.
>>
>> This has happened with both 1.5.3 and 1.5.17.
>>
>> Insights ?
>
>
> Can you provide your configuration (without sensible data) ?
> Are you using peers ?
>
> Also, do you have a reproductible testcase that we can play with, or is it
> absolutely random ?
>
>
>
>>
>> ===
>>
>> # cat  /proc/version
>> Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org)
>> (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian
>> 3.16.7-ckt25-1~bpo70+1 (2016-04-02)
>>
>> # haproxy -vv
>> HA-Proxy version 1.5.17 2016/04/13
>> Copyright 2000-2016 Willy Tarreau 
>>
>> Build options :
>>TARGET  = linux2628
>>CPU = generic
>>CC  = gcc
>>CFLAGS  = -g -O2 -fstack-protector --param=ssp-buffer-size=4
>> -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
>>OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1
>>
>> Default settings :
>>maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200
>>
>> Encrypted password support via crypt(3): yes
>> Built with zlib version : 1.2.7
>> Compression algorithms supported : identity, deflate, gzip
>> Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
>> Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
>> OpenSSL library supports TLS extensions : yes
>> OpenSSL library supports SNI : yes
>> OpenSSL library supports prefer-server-ciphers : yes
>> Built with PCRE version : 8.30 2012-02-04
>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
>> Built with transparent proxy support using: IP_TRANSPARENT
>> IPV6_TRANSPARENT IP_FREEBIND
>>
>> Available polling systems :
>>epoll : pref=300,  test result OK
>> poll : pref=200,  test result OK
>>   select : pref=150,  test result OK
>> Total: 3 (3 usable), will use epoll.
>>
>
>
> --
> Cyril Bonté



Re: 100% cpu , epoll_wait()

2016-04-15 Thread Cyril Bonté

Hi Jim,

Le 15/04/2016 23:20, Jim Freeman a écrit :

I have haproxy slaved to 2d cpu (CPU1), with frequent config changes
and a '-sf' soft-stop with the now-old non-listening process nannying
old connections.

Sometimes CPU1 goes to %100, and then a few minutes later request
latencies suffer across multiple haproxy peers.

An strace of the nanny haproxy process shows a tight loop of :

epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0
epoll_wait(0, {}, 200, 0)   = 0

I've searched the archives and found similar but old-ish complaints
about similar circumstances, but with fixes/patches mentioned.

This has happened with both 1.5.3 and 1.5.17.

Insights ?


Can you provide your configuration (without sensible data) ?
Are you using peers ?

Also, do you have a reproductible testcase that we can play with, or is 
it absolutely random ?





===

# cat  /proc/version
Linux version 3.16.0-0.bpo.4-amd64 (debian-ker...@lists.debian.org)
(gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian
3.16.7-ckt25-1~bpo70+1 (2016-04-02)

# haproxy -vv
HA-Proxy version 1.5.17 2016/04/13
Copyright 2000-2016 Willy Tarreau 

Build options :
   TARGET  = linux2628
   CPU = generic
   CC  = gcc
   CFLAGS  = -g -O2 -fstack-protector --param=ssp-buffer-size=4
-Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
   OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Compression algorithms supported : identity, deflate, gzip
Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.30 2012-02-04
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
   epoll : pref=300,  test result OK
poll : pref=200,  test result OK
  select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.




--
Cyril Bonté