Re: haproxy consuming 100% cpu - epoll loop

2017-06-01 Thread Willy Tarreau
Hi Patrick,

On Thu, Jun 01, 2017 at 10:37:45AM -0400, Patrick Hemmer wrote:
> Ok, I'm on 1.7.5 and the issue has recurred again.
> 
> Here is `show sess all`:
(...)

Great, it should help debug the issue. I'm seeing two peers connections
here, an incoming one and an outgoing one. The task is not called in loops
but very likely one connection tried to get rid of the other one and
couldn't, for a reason that I can't spot right now. But your trace will
certainly help eliminate a number of possibilities and narrow the problem
down.

Thanks very much!
Willy



Re: haproxy consuming 100% cpu - epoll loop

2017-06-01 Thread Patrick Hemmer


On 2017/5/19 00:35, Willy Tarreau wrote:
> Hi Patrick,
>
> On Thu, May 18, 2017 at 05:44:30PM -0400, Patrick Hemmer wrote:
>> On 2017/1/17 17:02, Willy Tarreau wrote:
>>> Hi Patrick,
>>>
>>> On Tue, Jan 17, 2017 at 02:33:44AM +, Patrick Hemmer wrote:
 So on one of my local development machines haproxy started pegging the
 CPU at 100%
 `strace -T` on the process just shows:

 ...
 epoll_wait(0, {}, 200, 0)   = 0 <0.03>
 epoll_wait(0, {}, 200, 0)   = 0 <0.03>
 epoll_wait(0, {}, 200, 0)   = 0 <0.03>
 epoll_wait(0, {}, 200, 0)   = 0 <0.03>
 epoll_wait(0, {}, 200, 0)   = 0 <0.03>
 epoll_wait(0, {}, 200, 0)   = 0 <0.03>
 ...
>>> Hmm not good.
>>>
 Opening it up with gdb, the backtrace shows:

 (gdb) bt
 #0  0x7f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
 #1  0x7f4d1a570ebc in _do_poll (p=, exp=-1440976915)
 at src/ev_epoll.c:125
 #2  0x7f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737
 #3  0x7f4d1a4cf2c0 in main (argc=, argv=>>> out>) at src/haproxy.c:2097
>>> Ok so an event is not being processed correctly.
>>>
 This is haproxy 1.7.0 on CentOS/7
>>> Ah, that could be a clue. We've had 2 or 3 very ugly bugs in 1.7.0
>>> and 1.7.1. One of them is responsible for the few outages on haproxy.org
>>> (last one happened today, I left it running to get the core to confirm).
>>> One of them is an issue with the condition to wake up an applet when it
>>> failed to get a buffer first and it could be what you're seeing. The
>>> other ones could possibly cause some memory corruption resulting in
>>> anything.
>>>
>>> Thus I'd strongly urge you to update this one to 1.7.2 (which I'm going
>>> to do on haproxy.org now that I could get a core). Continue to monitor
>>> it but I'd feel much safer after this update.
>>>
>>> Thanks for your report!
>>> Willy
>>>
>> So I just had this issue recur, this time on version 1.7.2.
> OK. If it's still doing it, capturing the output of "show sess all" on
> the CLI could help a lot.
>
> I've looked at the changelog since 1.7.2 and other bugs have since been
> fixed possibly responsible for this :
>   - c691781 ("BUG/MEDIUM: stream: fix client-fin/server-fin handling")
> => fixes a cause of 100% CPU when timeout client-fin/server-fin are
>used
>   - 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected 
> into buffe
> => fixes the computation of free buffer space, it's unknown if we've
>ever hit this bug. It could be possible that it causes some applets
>like stats to fail to write and to be called again immediately for
>example.
>   - 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected 
> into buffe
> => some filters might be woken up to do nothing. Compression might trigger
>this.
>
>   - there are a bunch of polling-related fixes which might have got rid of
> such a bad situation on certain connect() cases, possibly over unix
> sockets.
>
> There are a few other fixes in the queue that I need to backport but none
> of them is related to this. Thus if you're in emergency, 1.7.5 could help
> by bringing the fixes above. If you can wait a few more days, I expect to
> issue 1.7.6 early next week.
>
> Cheers,
> Willy

Ok, I'm on 1.7.5 and the issue has recurred again.

Here is `show sess all`:

> show sess all
0x7fe603090550: [01/Jun/2017:10:36:10.616812] id=187 proto=unix_stream
source=unix:1
  flags=0x88, conn_retries=0, srv_conn=(nil), pend_pos=(nil)
  frontend=GLOBAL (id=0 mode=tcp), listener=? (id=1) addr=unix:1
  backend= (id=-1 mode=-)
  server= (id=-1)
  task=0x7fe6030856d0 (state=0x20 nice=-64 calls=2 exp=44s age=13s)
  si[0]=0x7fe603090788 (state=EST flags=0x0c endp0=CONN:0x7fe603090920
exp=, et=0x000)
  si[1]=0x7fe6030907b0 (state=EST flags=0x4018
endp1=APPCTX:0x7fe60309e790 exp=, et=0x000)
  co0=0x7fe603090920 ctrl=unix_stream xprt=RAW data=STRM
target=LISTENER:0x7fe602e50310
  flags=0x0020b300 fd=4 fd.state=24 fd.cache=0 updt=1
  app1=0x7fe60309e790 st0=7 st1=1 st2=3 applet=
  req=0x7fe603090560 (f=0xc08202 an=0x0 pipe=0 tofwd=-1 total=40)
  an_exp= rex= wex=
  buf=0x7fe603095b40 data=0x7fe603095b54 o=0 p=0 req.next=0 i=0
size=16384
  res=0x7fe6030905a0 (f=0x80048000 an=0x0 pipe=0 tofwd=-1 total=6)
  an_exp= rex= wex=
  buf=0x7fe602f44de0 data=0x7fe602f44df4 o=0 p=0 rsp.next=0 i=0
size=16384
0x7fe603090c80: [01/Jun/2017:10:36:24.028362] id=0 proto=tcpv4
  flags=0x6, conn_retries=0, srv_conn=(nil), pend_pos=(nil)
  frontend=lb1 (id=4294967295 mode=tcp), listener=? (id=0)
  backend= (id=-1 mode=-) addr=172.28.128.16:58246
  server= (id=-1) addr=172.28.128.20:1029
  task=0x7fe603090c00 (state=0x08 nice=0 calls=2 exp=3s age=?)
  si[0]=0x7fe603090eb8 (state=EST flags=0x4048
endp0=APPCTX:0x7fe603091050 exp=, et=0x000)
  si[1]=0x7fe603090ee0 (state=EST fl

Re: haproxy consuming 100% cpu - epoll loop

2017-05-18 Thread Willy Tarreau
Hi Patrick,

On Thu, May 18, 2017 at 05:44:30PM -0400, Patrick Hemmer wrote:
> 
> On 2017/1/17 17:02, Willy Tarreau wrote:
> > Hi Patrick,
> >
> > On Tue, Jan 17, 2017 at 02:33:44AM +, Patrick Hemmer wrote:
> >> So on one of my local development machines haproxy started pegging the
> >> CPU at 100%
> >> `strace -T` on the process just shows:
> >>
> >> ...
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> >> ...
> > Hmm not good.
> >
> >> Opening it up with gdb, the backtrace shows:
> >>
> >> (gdb) bt
> >> #0  0x7f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> >> #1  0x7f4d1a570ebc in _do_poll (p=, exp=-1440976915)
> >> at src/ev_epoll.c:125
> >> #2  0x7f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737
> >> #3  0x7f4d1a4cf2c0 in main (argc=, argv= >> out>) at src/haproxy.c:2097
> > Ok so an event is not being processed correctly.
> >
> >> This is haproxy 1.7.0 on CentOS/7
> > Ah, that could be a clue. We've had 2 or 3 very ugly bugs in 1.7.0
> > and 1.7.1. One of them is responsible for the few outages on haproxy.org
> > (last one happened today, I left it running to get the core to confirm).
> > One of them is an issue with the condition to wake up an applet when it
> > failed to get a buffer first and it could be what you're seeing. The
> > other ones could possibly cause some memory corruption resulting in
> > anything.
> >
> > Thus I'd strongly urge you to update this one to 1.7.2 (which I'm going
> > to do on haproxy.org now that I could get a core). Continue to monitor
> > it but I'd feel much safer after this update.
> >
> > Thanks for your report!
> > Willy
> >
> So I just had this issue recur, this time on version 1.7.2.

OK. If it's still doing it, capturing the output of "show sess all" on
the CLI could help a lot.

I've looked at the changelog since 1.7.2 and other bugs have since been
fixed possibly responsible for this :
  - c691781 ("BUG/MEDIUM: stream: fix client-fin/server-fin handling")
=> fixes a cause of 100% CPU when timeout client-fin/server-fin are
   used
  - 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected into 
buffe
=> fixes the computation of free buffer space, it's unknown if we've
   ever hit this bug. It could be possible that it causes some applets
   like stats to fail to write and to be called again immediately for
   example.
  - 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected into 
buffe
=> some filters might be woken up to do nothing. Compression might trigger
   this.

  - there are a bunch of polling-related fixes which might have got rid of
such a bad situation on certain connect() cases, possibly over unix
sockets.

There are a few other fixes in the queue that I need to backport but none
of them is related to this. Thus if you're in emergency, 1.7.5 could help
by bringing the fixes above. If you can wait a few more days, I expect to
issue 1.7.6 early next week.

Cheers,
Willy



Re: haproxy consuming 100% cpu - epoll loop

2017-05-18 Thread Patrick Hemmer

On 2017/1/17 17:02, Willy Tarreau wrote:
> Hi Patrick,
>
> On Tue, Jan 17, 2017 at 02:33:44AM +, Patrick Hemmer wrote:
>> So on one of my local development machines haproxy started pegging the
>> CPU at 100%
>> `strace -T` on the process just shows:
>>
>> ...
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
>> ...
> Hmm not good.
>
>> Opening it up with gdb, the backtrace shows:
>>
>> (gdb) bt
>> #0  0x7f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> #1  0x7f4d1a570ebc in _do_poll (p=, exp=-1440976915)
>> at src/ev_epoll.c:125
>> #2  0x7f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737
>> #3  0x7f4d1a4cf2c0 in main (argc=, argv=> out>) at src/haproxy.c:2097
> Ok so an event is not being processed correctly.
>
>> This is haproxy 1.7.0 on CentOS/7
> Ah, that could be a clue. We've had 2 or 3 very ugly bugs in 1.7.0
> and 1.7.1. One of them is responsible for the few outages on haproxy.org
> (last one happened today, I left it running to get the core to confirm).
> One of them is an issue with the condition to wake up an applet when it
> failed to get a buffer first and it could be what you're seeing. The
> other ones could possibly cause some memory corruption resulting in
> anything.
>
> Thus I'd strongly urge you to update this one to 1.7.2 (which I'm going
> to do on haproxy.org now that I could get a core). Continue to monitor
> it but I'd feel much safer after this update.
>
> Thanks for your report!
> Willy
>
So I just had this issue recur, this time on version 1.7.2.

-Patrick


Re: haproxy consuming 100% cpu - epoll loop

2017-01-17 Thread Willy Tarreau
Hi Patrick,

On Tue, Jan 17, 2017 at 02:33:44AM +, Patrick Hemmer wrote:
> So on one of my local development machines haproxy started pegging the
> CPU at 100%
> `strace -T` on the process just shows:
> 
> ...
> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> epoll_wait(0, {}, 200, 0)   = 0 <0.03>
> ...

Hmm not good.

> Opening it up with gdb, the backtrace shows:
> 
> (gdb) bt
> #0  0x7f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x7f4d1a570ebc in _do_poll (p=, exp=-1440976915)
> at src/ev_epoll.c:125
> #2  0x7f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737
> #3  0x7f4d1a4cf2c0 in main (argc=, argv= out>) at src/haproxy.c:2097

Ok so an event is not being processed correctly.

> This is haproxy 1.7.0 on CentOS/7

Ah, that could be a clue. We've had 2 or 3 very ugly bugs in 1.7.0
and 1.7.1. One of them is responsible for the few outages on haproxy.org
(last one happened today, I left it running to get the core to confirm).
One of them is an issue with the condition to wake up an applet when it
failed to get a buffer first and it could be what you're seeing. The
other ones could possibly cause some memory corruption resulting in
anything.

Thus I'd strongly urge you to update this one to 1.7.2 (which I'm going
to do on haproxy.org now that I could get a core). Continue to monitor
it but I'd feel much safer after this update.

Thanks for your report!
Willy



haproxy consuming 100% cpu - epoll loop

2017-01-16 Thread Patrick Hemmer
So on one of my local development machines haproxy started pegging the
CPU at 100%
`strace -T` on the process just shows:

...
epoll_wait(0, {}, 200, 0)   = 0 <0.03>
epoll_wait(0, {}, 200, 0)   = 0 <0.03>
epoll_wait(0, {}, 200, 0)   = 0 <0.03>
epoll_wait(0, {}, 200, 0)   = 0 <0.03>
epoll_wait(0, {}, 200, 0)   = 0 <0.03>
epoll_wait(0, {}, 200, 0)   = 0 <0.03>
...

Opening it up with gdb, the backtrace shows:

(gdb) bt
#0  0x7f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x7f4d1a570ebc in _do_poll (p=, exp=-1440976915)
at src/ev_epoll.c:125
#2  0x7f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737
#3  0x7f4d1a4cf2c0 in main (argc=, argv=) at src/haproxy.c:2097

This is haproxy 1.7.0 on CentOS/7


Unfortunately I'm not sure what triggered it.

-Patrick