Hi Willy et al.,

> Thank you for this report, it helps. How often does it happen, and/or after
> how long on average after you start it ? What's your workload ? Do you use
> SSL, compression, TCP and/or HTTP mode, peers synchronization, etc ?

Yesterday, we upgraded from 1.5.14 to 1.5.18 and now observed exactly
this issue in production. After rolling back to 1.5.14, it didn't occur
anymore.

We have mostly http traffic, little TCP with about 100-200 req/s, about
2000 concurrent connections over all. About all traffic is SSL
terminated. We use no peer synchronization and no compression.

An strace on the process reveals this (with most of the calls being
epoll_wait):

[...]
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {{EPOLLIN, {u32=796, u64=796}}}, 200, 0) = 1
read(796, "
\357\275Y\231\275'b\5\216#\33\220\337'\370\312\215sG4\316\275\277y-%\v\v\211\331\342"...,
5872) = 1452
read(796, 0x9fa26ec, 4420)              = -1 EAGAIN (Resource
temporarily unavailable)
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {}, 200, 0)               = 0
epoll_wait(0, {}, 200, 0)               = 0
[...]

The strace was done after reloading using -sf. However, the process was
at 100% load even before the reload.

Since we kept the process running after the reload (it still holds some
connections), I was able to run a second strace, about half an hour
later which now show a different behavior:

[...]
epoll_wait(0, {}, 200, 4)               = 0
epoll_wait(0, {}, 200, 7)               = 0
epoll_wait(0, {}, 200, 3)               = 0
epoll_wait(0, {}, 200, 6)               = 0
epoll_wait(0, {}, 200, 3)               = 0
epoll_wait(0, {}, 200, 3)               = 0
epoll_wait(0, {}, 200, 10)              = 0
epoll_wait(0, {}, 200, 3)               = 0
epoll_wait(0, {}, 200, 27)              = 0
epoll_wait(0, {}, 200, 6)               = 0
epoll_wait(0, {}, 200, 4)               = 0
[...]

The CPU load taken by the process is now back to more or less idle load,
without further intervention on the process.

`haproxy -vv` of the process running into the busy-loop shows

HA-Proxy version 1.5.18 2016/05/10
Copyright 2000-2016 Willy Tarreau <wi...@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Compression algorithms supported : identity, deflate, gzip
Built with OpenSSL version : OpenSSL 1.0.1t  3 May 2016
Running on OpenSSL version : OpenSSL 1.0.1t  3 May 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.35 2014-04-04
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Unfortunately, since we have rolled back production to 1.5.14, we have
now little possibility to reproduce this anymore. The process which
shows the behavior is still running for the time being though.

Regards,
Holger

Reply via email to