Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Robin H. Johnson
On Thu, Apr 15, 2021 at 07:53:15PM +, Robin H. Johnson wrote:
> But your thought of CPU pinning was good.
> I went to confirm it in the host, and I'm not certain if the cpu-map is 
> working
> right.
Ignore me, long day and I didn't think to check each thread PID:

# ps -e -T | grep haproxy -w |awk '{print $2}' |xargs -n1 taskset -pc
pid 120689's current affinity list: 0-127
pid 120702's current affinity list: 0
pid 120706's current affinity list: 1
..
pid 120776's current affinity list: 63

-- 
Robin Hugh Johnson
E-Mail : robb...@orbis-terrarum.net
Home Page  : http://www.orbis-terrarum.net/?l=people.robbat2
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85


signature.asc
Description: PGP signature


Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Robin H. Johnson
On Thu, Apr 15, 2021 at 09:23:07AM +0200, Willy Tarreau wrote:
> On Thu, Apr 15, 2021 at 07:13:53AM +, Robin H. Johnson wrote:
> > Thanks; I will need to catch it faster or automate this, because the
> > watchdog does a MUCH better job restarting it than before, less than 30
> > seconds of 100% CPU before the watchdog reliably kills it.
> I see. Then collecting the watchdog outputs can be instructive to see
> if it always happens at the same place or not. And the core dumps will
> indicate what all threads were doing (and if some were competing on a
> lock for example).
The truncation in the log output for the crash was interesting, I couldn't see
why it was being cut off. I wish we could get a clean reproduction in our
testing environment, because the production core dumps absolutely have customer
data in them.

> > Varnish runs on the same host and is used to cache some of the backends.
> > Please of free memory at the moment.
> 
> I'm now thinking about something. Do you have at least as many CPUs as the
> total number of threads used by haproxy and varnish ? Otherwise there will
> be some competition and migrations will happen. If neither is bounded, you
> can even end up with two haproxy threads forced to run on the same CPU,
> which is the worst situation as one could be scheduled out with a lock
> held and the other one spinning waiting for this lock.
Single socket, AMD EPYC 7702P 64-Core Processor, 128 threads!
Shows as single NUMA node in our present configuration.
Hopefully the kernel is mostly doing the right thing, but read on.

HAProxy already pinned to the first 64 threads with:
cpu-map 1/1 0
...
cpu-map 1/64 63

Varnish isn't explicitly pinned right now, but uses less than 200% CPU
overall (we know most requests aren't cachable so they don't get routed to
Varnish at all)

But your thought of CPU pinning was good.
I went to confirm it in the host, and I'm not certain if the cpu-map is working
right.

# pid_haproxy_leader=68839 ; pid_haproxy_follower=68848 
# taskset -pc $pid_haproxy_leader
pid 68839's current affinity list: 0-127
# taskset -pc $pid_haproxy_follower
pid 68848's current affinity list: 0

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature


Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Willy Tarreau
On Thu, Apr 15, 2021 at 07:13:53AM +, Robin H. Johnson wrote:
> Thanks; I will need to catch it faster or automate this, because the
> watchdog does a MUCH better job restarting it than before, less than 30
> seconds of 100% CPU before the watchdog reliably kills it.

I see. Then collecting the watchdog outputs can be instructive to see
if it always happens at the same place or not. And the core dumps will
indicate what all threads were doing (and if some were competing on a
lock for example).

> >   - please also check if your machine is not swapping, as this can have
> > terrible consequences and could explain why it only happens on certain
> > machines dealing with certain workloads. I remember having spent several
> > weeks many years ago chasing a response time issue happening only in the
> > morning, which was in fact caused by the log upload batch having swapped
> > and left a good part of the unused memory in the swap, making it very
> > difficult for the network stack to allocate buffers during send() and
> > recv(), thus causing losses and retransmits as the load grew. This was
> > never reproducible in the lab because we were not syncing logs :-)
> 512GiB RAM and no swap configured on the system at all.

:-)

> Varnish runs on the same host and is used to cache some of the backends.
> Please of free memory at the moment.

I'm now thinking about something. Do you have at least as many CPUs as the
total number of threads used by haproxy and varnish ? Otherwise there will
be some competition and migrations will happen. If neither is bounded, you
can even end up with two haproxy threads forced to run on the same CPU,
which is the worst situation as one could be scheduled out with a lock
held and the other one spinning waiting for this lock.

With half a TB of RAM I guess you have multiple sockets. Could you at
least pin haproxy to the CPUs of a single socket (that's the bare minimum
to do to preserve performance, as atomics and locks over UPI/QPI are a
disaster), and ideally pin varnish to another socket ?

Or maybe just enable less threads for haproxy if you don't need that many
and make sure the CPUs it's bound to are not used by varnish ?

In such a setup combining several high-performance processes, it's really
important to reserve the resources to them, and you must count on the
number of CPUs needed to deal with network interrupts as well (and likely
for disk if varnish uses it). Once your resources are cleanly reserved,
you'll get the maximum performance with the lowest latency.

Willy



Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Robin H. Johnson
On Thu, Apr 15, 2021 at 08:59:35AM +0200, Willy Tarreau wrote:
> On Wed, Apr 14, 2021 at 01:53:06PM +0200, Christopher Faulet wrote:
> > > nbthread=64, nbproc=1 on both 1.8/2.x
> > 
> > It is thus surprising, if it is really a contention issue, that you never
> > observed slow down on the 1.8. There is no watchdog, but the thread
> > implementation is a bit awkward on the 1.8. 2.X are better on this point,
> > the best being the 2.4.
> 
> Agreed, I'd even say that 64 threads in 1.8 should be wa slower than
> a single thread.
> 
> A few things are worth having a look at, Robin:
>   - please run "perf top" when this happens, and when you see a function
> reporting a high usage, do no hesitate to navigate through it, pressing
> enter, and "annotate function ". Then scrolling through it will
> reveal some percentage of time certain points were met. It's very possible
> you'll find that 100% of the CPU are used in 1, 2 or 3 functions and
> that there is a logic error somewhere. Usually if you find a single one
> you'll end up spotting a lock.
Thanks; I will need to catch it faster or automate this, because the
watchdog does a MUCH better job restarting it than before, less than 30
seconds of 100% CPU before the watchdog reliably kills it.

>   - please also check if your machine is not swapping, as this can have
> terrible consequences and could explain why it only happens on certain
> machines dealing with certain workloads. I remember having spent several
> weeks many years ago chasing a response time issue happening only in the
> morning, which was in fact caused by the log upload batch having swapped
> and left a good part of the unused memory in the swap, making it very
> difficult for the network stack to allocate buffers during send() and
> recv(), thus causing losses and retransmits as the load grew. This was
> never reproducible in the lab because we were not syncing logs :-)
512GiB RAM and no swap configured on the system at all.
Varnish runs on the same host and is used to cache some of the backends.
Please of free memory at the moment.

-- 
Robin Hugh Johnson
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85


signature.asc
Description: PGP signature


Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-15 Thread Willy Tarreau
On Wed, Apr 14, 2021 at 01:53:06PM +0200, Christopher Faulet wrote:
> > nbthread=64, nbproc=1 on both 1.8/2.x
> 
> It is thus surprising, if it is really a contention issue, that you never
> observed slow down on the 1.8. There is no watchdog, but the thread
> implementation is a bit awkward on the 1.8. 2.X are better on this point,
> the best being the 2.4.

Agreed, I'd even say that 64 threads in 1.8 should be wa slower than
a single thread.

A few things are worth having a look at, Robin:
  - please run "perf top" when this happens, and when you see a function
reporting a high usage, do no hesitate to navigate through it, pressing
enter, and "annotate function ". Then scrolling through it will
reveal some percentage of time certain points were met. It's very possible
you'll find that 100% of the CPU are used in 1, 2 or 3 functions and
that there is a logic error somewhere. Usually if you find a single one
you'll end up spotting a lock.

  - please also check if your machine is not swapping, as this can have
terrible consequences and could explain why it only happens on certain
machines dealing with certain workloads. I remember having spent several
weeks many years ago chasing a response time issue happening only in the
morning, which was in fact caused by the log upload batch having swapped
and left a good part of the unused memory in the swap, making it very
difficult for the network stack to allocate buffers during send() and
recv(), thus causing losses and retransmits as the load grew. This was
never reproducible in the lab because we were not syncing logs :-)

Willy



Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-14 Thread Christopher Faulet

Le 10/04/2021 à 00:34, Robin H. Johnson a écrit :

On Fri, Apr 09, 2021 at 10:14:26PM +0200, Christopher Faulet wrote:

It seems you have a blocking call in one of your lua script. The threads dump
shows many threads blocked in hlua_ctx_init. Many others are executing lua.
Unfortunately, for a unknown reason, there is no stack traceback.

All of our Lua is string handling. Parsing headers, and then building
TXN variables as well as a set of applets that return responses in cases
where we don't want to go to a backend server as the response is simple
enough to generate inside the LB.


For the 2.3 and prior, the lua scripts are executed under a global lock. Thus
blocking calls in a lua script are awful, because it does not block only one
thread but all others too. I guess the same issue exists on the 1.8, but there
is no watchdog on this version. Thus, time to time HAProxy hangs and may report
huge latencies but, at the end it recovers and continues to process data. It is
exactly the purpose of the watchdog, reporting hidden bugs related to spinning
loops and deadlocks.

Nothing in this Lua code should have blocking calls at all.
The Lua code has zero calls to external services, no sockets, no sleeps,
no print, no Map.new (single call in the Lua startup, not inside any
applet or fetch), no usage of other packages, no file handling, no other
IO.

I'm hoping I can get $work to agree to fully open-source the Lua, so you
can see this fact and review the code to confirm that it SHOULD be
non-blocking.


I trust you on this point. If you are not using external component, it should 
indeed be ok. So, it is probably a contention issue on the global Lua lock. If 
you are able to generate and inspect a core file, it should help you to figure 
out what really happens.




However, I may be wrong. It may be just a contention problem because your are
executing lua with 64 threads and a huge workload. In this case, you may give a
try to the 2.4 (under development). There is a way to have a separate lua
context for each thread loading the scripts with "lua-load-per-thread"
directive. Out of curiosity, on the 1.8, are you running HAProxy with several
threads or are you spawning several processes?

nbthread=64, nbproc=1 on both 1.8/2.x


It is thus surprising, if it is really a contention issue, that you never 
observed slow down on the 1.8. There is no watchdog, but the thread 
implementation is a bit awkward on the 1.8. 2.X are better on this point, the 
best being the 2.4.



Yes, we're hoping to try 2.4.x, just working on some parts to get there.



--
Christopher Faulet



Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-09 Thread Robin H. Johnson
On Fri, Apr 09, 2021 at 10:14:26PM +0200, Christopher Faulet wrote:
> It seems you have a blocking call in one of your lua script. The threads dump 
> shows many threads blocked in hlua_ctx_init. Many others are executing lua. 
> Unfortunately, for a unknown reason, there is no stack traceback.
All of our Lua is string handling. Parsing headers, and then building
TXN variables as well as a set of applets that return responses in cases
where we don't want to go to a backend server as the response is simple
enough to generate inside the LB.

> For the 2.3 and prior, the lua scripts are executed under a global lock. Thus 
> blocking calls in a lua script are awful, because it does not block only one 
> thread but all others too. I guess the same issue exists on the 1.8, but 
> there 
> is no watchdog on this version. Thus, time to time HAProxy hangs and may 
> report 
> huge latencies but, at the end it recovers and continues to process data. It 
> is 
> exactly the purpose of the watchdog, reporting hidden bugs related to 
> spinning 
> loops and deadlocks.
Nothing in this Lua code should have blocking calls at all.
The Lua code has zero calls to external services, no sockets, no sleeps,
no print, no Map.new (single call in the Lua startup, not inside any
applet or fetch), no usage of other packages, no file handling, no other
IO.

I'm hoping I can get $work to agree to fully open-source the Lua, so you
can see this fact and review the code to confirm that it SHOULD be
non-blocking.

> However, I may be wrong. It may be just a contention problem because your are 
> executing lua with 64 threads and a huge workload. In this case, you may give 
> a 
> try to the 2.4 (under development). There is a way to have a separate lua 
> context for each thread loading the scripts with "lua-load-per-thread" 
> directive. Out of curiosity, on the 1.8, are you running HAProxy with several 
> threads or are you spawning several processes?
nbthread=64, nbproc=1 on both 1.8/2.x

Yes, we're hoping to try 2.4.x, just working on some parts to get there.

-- 
Robin Hugh Johnson
E-Mail : robb...@orbis-terrarum.net
Home Page  : http://www.orbis-terrarum.net/?l=people.robbat2
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85


signature.asc
Description: PGP signature


Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-09 Thread Christopher Faulet

Le 09/04/2021 à 19:26, Robin H. Johnson a écrit :

Hi,

Maciej had said they were going to create a new thread, but I didn't see
one yet.

I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and
that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big
thanks for that initial work in fixing the issue.

As I mentioned in my other mail asking for a 1.8.30 release, we're
experiencing this problem in DigitalOcean's HAProxy instances used to
run the Spaces product.

I've been trying to dig out deeper detail as well with a debug threads
version, but I have the baseline error output from 2.3.9 here to share,
after passing redaction review. This dump was generated with vbernat's
PPA version of 2.3.9, not any internal builds.

We have struggled to reproduce the problem in testing environments, it
only turns up at the biggest regions, and plotting occurances of the
issue over the time dimension suggest that it might have some partial
correlation w/ a weird workload input.

The dumps do suggest Lua is implicated as well, and we've got some
extensive Lua code, so it's impossible to rule it out as contributing to
the problem (We have been discussing plans to move it to SPOA instead).

The Lua code in question hasn't changed significantly in nearly 6
months, and it was problem-free on the 1.8 series (having a test suite
for the Lua code has been invaluable).



Hi,

It seems you have a blocking call in one of your lua script. The threads dump 
shows many threads blocked in hlua_ctx_init. Many others are executing lua. 
Unfortunately, for a unknown reason, there is no stack traceback.


For the 2.3 and prior, the lua scripts are executed under a global lock. Thus 
blocking calls in a lua script are awful, because it does not block only one 
thread but all others too. I guess the same issue exists on the 1.8, but there 
is no watchdog on this version. Thus, time to time HAProxy hangs and may report 
huge latencies but, at the end it recovers and continues to process data. It is 
exactly the purpose of the watchdog, reporting hidden bugs related to spinning 
loops and deadlocks.


However, I may be wrong. It may be just a contention problem because your are 
executing lua with 64 threads and a huge workload. In this case, you may give a 
try to the 2.4 (under development). There is a way to have a separate lua 
context for each thread loading the scripts with "lua-load-per-thread" 
directive. Out of curiosity, on the 1.8, are you running HAProxy with several 
threads or are you spawning several processes?


--
Christopher Faulet



Re: Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-09 Thread Maciej Zdeb
Hi Robin,

W dniu pt., 9.04.2021 o 19:26 Robin H. Johnson 
napisał(a):

> Maciej had said they were going to create a new thread, but I didn't see
> one yet.
>
> I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and
> that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big
> thanks for that initial work in fixing the issue.
>

After Christopher patch using another function (that does not allocate
memory) to dump LUA stack trace the situation is much better. It used 100%
cpu only once and only one thread, however HAProxy didn't kill itself like
in your situation. Since then the issue did not occur again, I have
anything useful to share. :(

I'm using 2.2.11 HAProxy but will upgrade next week to 2.3, I'll report if
anything interesting pops out.


> We have struggled to reproduce the problem in testing environments, it
> only turns up at the biggest regions, and plotting occurances of the
> issue over the time dimension suggest that it might have some partial
> correlation w/ a weird workload input.
>

On our side the issue occured on a machine with unusual workload, hundred
thousands tcp connections and very little http requests/responses. We also
use lua.

Kind regards,
Maciej


Still 100% CPU usage in 2.3.9 & 2.2.13 (Was: Re: [2.2.9] 100% CPU usage)

2021-04-09 Thread Robin H. Johnson
Hi,

Maciej had said they were going to create a new thread, but I didn't see
one yet.

I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and
that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big
thanks for that initial work in fixing the issue.

As I mentioned in my other mail asking for a 1.8.30 release, we're
experiencing this problem in DigitalOcean's HAProxy instances used to
run the Spaces product.

I've been trying to dig out deeper detail as well with a debug threads
version, but I have the baseline error output from 2.3.9 here to share,
after passing redaction review. This dump was generated with vbernat's
PPA version of 2.3.9, not any internal builds.

We have struggled to reproduce the problem in testing environments, it
only turns up at the biggest regions, and plotting occurances of the
issue over the time dimension suggest that it might have some partial
correlation w/ a weird workload input.

The dumps do suggest Lua is implicated as well, and we've got some
extensive Lua code, so it's impossible to rule it out as contributing to
the problem (We have been discussing plans to move it to SPOA instead).

The Lua code in question hasn't changed significantly in nearly 6
months, and it was problem-free on the 1.8 series (having a test suite
for the Lua code has been invaluable).

-- 
Robin Hugh Johnson
E-Mail : robb...@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85


20210409_haproxy-crashlogs.REDACTED.txt.gz
Description: Binary data


signature.asc
Description: PGP signature


Re: [2.2.9] 100% CPU usage

2021-04-02 Thread Maciej Zdeb
Hi Christopher,

Yes I know, my issues are always pretty weird. ;) Of course it's not
reproducible. :(

I'll try to collect more data and return to you. I will start a new thread
to not mix those two cases.

Kind regards,

pt., 2 kwi 2021 o 10:13 Christopher Faulet  napisał(a):

> Le 31/03/2021 à 13:28, Maciej Zdeb a écrit :
> > Hi,
> >
> > Well it's a bit better situation than earlier because only one thread is
> looping
> > forever and the rest is working properly. I've tried to verify where
> exactly the
> > thread looped but doing "n" in gdb fixed the problem :( After quitting
> gdb
> > session all threads were idle. Before I started gdb it looped about 3h
> not
> > serving any traffic, because I've put it into maintenance as soon as I
> observed
> > abnormal cpu usage.
> >
>
> Hi Maciej,
>
> I'm puzzled. It seems to be a different issue than the first one. You
> reported
> an issue during reloads, on the old processes. There was a loop because of
> a
> deadlock, but the traces showed the watchdog was fired, probably because
> of the
> lua (this must be confirm though).
>
> Here, it is a loop on a living process, right ? Reading the trace, it is
> for now
> a bit hard to figure out what happens. If it is reproducible, you may try
> to use
> "perf top" command, selecting the right thread ID with "-t" argument.
>
> In addition, if the loop always falls in the H2 multiplexer, it could be
> good to
> print the h2C structure to have more info on its internal state.
>
> And to be complete, the output of "show activity", "show fd" and "show
> sess all"
> may help. Because it still loops with no traffic, it should be small.
> "show
> threads" may be good, but HAProxy should be compiled with
> USE_THREAD_DUMP=1 to
> have an advanced dump.
>
> Sorry to ask you so much work, it is pretty hard to track this kind of bug.
>
> Thanks !
> --
> Christopher Faulet
>


Re: [2.2.9] 100% CPU usage

2021-04-02 Thread Christopher Faulet

Le 31/03/2021 à 13:28, Maciej Zdeb a écrit :

Hi,

Well it's a bit better situation than earlier because only one thread is looping 
forever and the rest is working properly. I've tried to verify where exactly the 
thread looped but doing "n" in gdb fixed the problem :( After quitting gdb 
session all threads were idle. Before I started gdb it looped about 3h not 
serving any traffic, because I've put it into maintenance as soon as I observed 
abnormal cpu usage.




Hi Maciej,

I'm puzzled. It seems to be a different issue than the first one. You reported 
an issue during reloads, on the old processes. There was a loop because of a 
deadlock, but the traces showed the watchdog was fired, probably because of the 
lua (this must be confirm though).


Here, it is a loop on a living process, right ? Reading the trace, it is for now 
a bit hard to figure out what happens. If it is reproducible, you may try to use 
"perf top" command, selecting the right thread ID with "-t" argument.


In addition, if the loop always falls in the H2 multiplexer, it could be good to 
print the h2C structure to have more info on its internal state.


And to be complete, the output of "show activity", "show fd" and "show sess all" 
may help. Because it still loops with no traffic, it should be small. "show 
threads" may be good, but HAProxy should be compiled with USE_THREAD_DUMP=1 to 
have an advanced dump.


Sorry to ask you so much work, it is pretty hard to track this kind of bug.

Thanks !
--
Christopher Faulet



Re: [2.2.9] 100% CPU usage

2021-03-31 Thread Maciej Zdeb
I've forgot to mention that the backtrace is from 2.2.11 built from
http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=601704962bc9d82b3b1cc97d90d2763db0ae4479

śr., 31 mar 2021 o 13:28 Maciej Zdeb  napisał(a):

> Hi,
>
> Well it's a bit better situation than earlier because only one thread is
> looping forever and the rest is working properly. I've tried to verify
> where exactly the thread looped but doing "n" in gdb fixed the problem :(
> After quitting gdb session all threads were idle. Before I started gdb it
> looped about 3h not serving any traffic, because I've put it into
> maintenance as soon as I observed abnormal cpu usage.
>
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> 0x7f2cf0df6a47 in epoll_wait (epfd=3, events=0x55d7aaa04920,
> maxevents=200, timeout=timeout@entry=39) at
> ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
> (gdb) thread 11
> [Switching to thread 11 (Thread 0x7f2c3c53d700 (LWP 20608))]
> #0  trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=,
> where=..., src=, mask=,
> level=) at include/haproxy/trace.h:149
> 149 if (unlikely(src->state != TRACE_STATE_STOPPED))
> (gdb) bt
> #0  trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=,
> where=..., src=, mask=,
> level=) at include/haproxy/trace.h:149
> #1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
> head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
> #2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
> src/mux_h2.c:3329
> #3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
> #4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
> src/mux_h2.c:3624
> #5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
> status=) at src/mux_h2.c:3583
> #6  0x55d7a4381f62 in run_tasks_from_lists 
> (budgets=budgets@entry=0x7f2c3c51a35c)
> at src/task.c:454
> #7  0x55d7a438282d in process_runnable_tasks () at src/task.c:679
> #8  0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942
> #9  0x55d7a4339819 in run_thread_poll_loop (data=) at
> src/haproxy.c:3107
> #10 0x7f2cf1e606db in start_thread (arg=0x7f2c3c53d700) at
> pthread_create.c:463
> #11 0x7f2cf0df671f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> (gdb) bt full
> #0  trace (msg=..., cb=, a4=, a3= out>, a2=, a1=, func=,
> where=..., src=, mask=,
> level=) at include/haproxy/trace.h:149
> No locals.
> #1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
> head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
> h2s = 
> h2s_back = 
> __FUNCTION__ = "h2_resume_each_sending_h2s"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
> src/mux_h2.c:3329
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
> flags = 
> released = 
> buf = 
> conn = 0x7f2bf658b8d0
> done = 0
> sent = 0
> __FUNCTION__ = "h2_send"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> ---Type  to continue, or q  to quit---
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
> src/mux_h2.c:3624
> conn = 0x7f2bf658b8d0
> __FUNCTION__ = "h2_process"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
> status=) at src/mux_h2.c:3583
> conn = 0x7f2bf658b8d0
> tl = 
> conn_in_list = 0
> h2c = 0x7f2c18dca740
> ret = 
> __FUNCTION__ = "h2_io_cb"
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> __x = 
> __l = 
> #6  0x55d7a4381f62 in run_tasks_from_lists 
> (budgets=budgets@entry=0x7f2c3c51a35c)
> at src/task.c:454
> process = 
> tl_queues = 
> t = 0x7f2c0d3fa1c0
> budget_mask = 7 '\a'
> done = 
> queue = 
> state = 
> ---Type  to continue, or q  to quit---
> ctx = 
> __ret = 
> __n = 
> __p = 
> #7  0x55d7a438282d in process_runnable_tasks () at 

Re: [2.2.9] 100% CPU usage

2021-03-31 Thread Maciej Zdeb
Hi,

Well it's a bit better situation than earlier because only one thread is
looping forever and the rest is working properly. I've tried to verify
where exactly the thread looped but doing "n" in gdb fixed the problem :(
After quitting gdb session all threads were idle. Before I started gdb it
looped about 3h not serving any traffic, because I've put it into
maintenance as soon as I observed abnormal cpu usage.

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x7f2cf0df6a47 in epoll_wait (epfd=3, events=0x55d7aaa04920,
maxevents=200, timeout=timeout@entry=39) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
(gdb) thread 11
[Switching to thread 11 (Thread 0x7f2c3c53d700 (LWP 20608))]
#0  trace (msg=..., cb=, a4=, a3=, a2=, a1=, func=,
where=..., src=, mask=,
level=) at include/haproxy/trace.h:149
149 if (unlikely(src->state != TRACE_STATE_STOPPED))
(gdb) bt
#0  trace (msg=..., cb=, a4=, a3=, a2=, a1=, func=,
where=..., src=, mask=,
level=) at include/haproxy/trace.h:149
#1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
#2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
src/mux_h2.c:3329
#3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
#4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
src/mux_h2.c:3624
#5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
status=) at src/mux_h2.c:3583
#6  0x55d7a4381f62 in run_tasks_from_lists
(budgets=budgets@entry=0x7f2c3c51a35c)
at src/task.c:454
#7  0x55d7a438282d in process_runnable_tasks () at src/task.c:679
#8  0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942
#9  0x55d7a4339819 in run_thread_poll_loop (data=) at
src/haproxy.c:3107
#10 0x7f2cf1e606db in start_thread (arg=0x7f2c3c53d700) at
pthread_create.c:463
#11 0x7f2cf0df671f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) bt full
#0  trace (msg=..., cb=, a4=, a3=, a2=, a1=, func=,
where=..., src=, mask=,
level=) at include/haproxy/trace.h:149
No locals.
#1  h2_resume_each_sending_h2s (h2c=h2c@entry=0x7f2c18dca740,
head=head@entry=0x7f2c18dcabf8) at src/mux_h2.c:3255
h2s = 
h2s_back = 
__FUNCTION__ = "h2_resume_each_sending_h2s"
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#2  0x55d7a426c8e2 in h2_process_mux (h2c=0x7f2c18dca740) at
src/mux_h2.c:3329
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#3  h2_send (h2c=h2c@entry=0x7f2c18dca740) at src/mux_h2.c:3479
flags = 
released = 
buf = 
conn = 0x7f2bf658b8d0
done = 0
sent = 0
__FUNCTION__ = "h2_send"
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
---Type  to continue, or q  to quit---
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#4  0x55d7a42734bd in h2_process (h2c=h2c@entry=0x7f2c18dca740) at
src/mux_h2.c:3624
conn = 0x7f2bf658b8d0
__FUNCTION__ = "h2_process"
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#5  0x55d7a4276678 in h2_io_cb (t=, ctx=0x7f2c18dca740,
status=) at src/mux_h2.c:3583
conn = 0x7f2bf658b8d0
tl = 
conn_in_list = 0
h2c = 0x7f2c18dca740
ret = 
__FUNCTION__ = "h2_io_cb"
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
__x = 
__l = 
#6  0x55d7a4381f62 in run_tasks_from_lists
(budgets=budgets@entry=0x7f2c3c51a35c)
at src/task.c:454
process = 
tl_queues = 
t = 0x7f2c0d3fa1c0
budget_mask = 7 '\a'
done = 
queue = 
state = 
---Type  to continue, or q  to quit---
ctx = 
__ret = 
__n = 
__p = 
#7  0x55d7a438282d in process_runnable_tasks () at src/task.c:679
tt = 0x55d7a47a6d00 
lrq = 
grq = 
t = 
max = {0, 0, 141}
max_total = 
tmp_list = 
queue = 3
max_processed = 
#8  0x55d7a4339467 in run_poll_loop () at src/haproxy.c:2942
next = 
wake = 
#9  0x55d7a4339819 in run_thread_poll_loop (data=) at
src/haproxy.c:3107
ptaf = 
ptif = 
ptdf = 
ptff = 
init_left = 0
init_mutex = pthread_mutex_t = {Type = Normal, Status = Not
acquired, Robust 

Re: [2.2.9] 100% CPU usage

2021-03-25 Thread Christopher Faulet

Le 25/03/2021 à 13:38, Maciej Zdeb a écrit :

Hi,

I deployed a patched (with volatile hlua_not_dumpable) HAProxy and so far so 
good, no looping. Christopher I saw new patches with hlua_traceback used 
instead, looks much cleaner to me, should I verify them instead? :)


Christopher & Willy I've forgotten to thank you for help!

Yes please, try the last 2.2 snapshot. It is a really a better way to fix this 
issue because the Lua traceback is never ignored. And it is really safer to not 
allocate memory in the debugger.


So now, we should be able to figure out why the Lua fires the watchdog. Because, 
under the hood, it is the true issue :)


--
Christopher Faulet



Re: [2.2.9] 100% CPU usage

2021-03-25 Thread Maciej Zdeb
Hi,

I deployed a patched (with volatile hlua_not_dumpable) HAProxy and so far
so good, no looping. Christopher I saw new patches with hlua_traceback used
instead, looks much cleaner to me, should I verify them instead? :)

Christopher & Willy I've forgotten to thank you for help!

Kind regards,

śr., 24 mar 2021 o 10:51 Maciej Zdeb  napisał(a):

> śr., 24 mar 2021 o 10:37 Christopher Faulet 
> napisał(a):
>
>> However, reading the other trace Maciej sent (bussy_thread_peers.txt), it
>> seems
>> possible to stop a memory allocation from other places. Thus, I guess we
>> must
>> find a more global way to prevent the lua stack dump.
>>
>
> I'm not sure which part of trace you're referring to but I need to clarify
> that both "bussy_thread_peers.txt" and "free_thread_spoe_lua.txt" occurred
> at the same time but on different threads (free_thread_spoe_lua on thread
> 10 and bussy_thread_peers on other threads). So If I understand it
> correctly thread 10 locked itself and other threads looping to acquire lock.
>


Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Maciej Zdeb
śr., 24 mar 2021 o 10:37 Christopher Faulet 
napisał(a):

> However, reading the other trace Maciej sent (bussy_thread_peers.txt), it
> seems
> possible to stop a memory allocation from other places. Thus, I guess we
> must
> find a more global way to prevent the lua stack dump.
>

I'm not sure which part of trace you're referring to but I need to clarify
that both "bussy_thread_peers.txt" and "free_thread_spoe_lua.txt" occurred
at the same time but on different threads (free_thread_spoe_lua on thread
10 and bussy_thread_peers on other threads). So If I understand it
correctly thread 10 locked itself and other threads looping to acquire lock.


Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Christopher Faulet

Le 24/03/2021 à 10:16, Willy Tarreau a écrit :

On Wed, Mar 24, 2021 at 10:11:19AM +0100, Maciej Zdeb wrote:

Wow, that's it! :)

0x55d94949e965 <+53>: addl   $0x1,%fs:0xfffdd688
0x55d94949e96e <+62>: callq  0x55d9494782c0 
0x55d94949e973 <+67>: subl   $0x1,%fs:0xfffdd688
[...]
0x55d94949e99f <+111>: ja 0x55d94949e9b0 
0x55d94949e9a1 <+113>: mov%rbx,%rdi
0x55d94949e9a4 <+116>: callq  0x55d949477d50 
0x55d94949e9a9 <+121>: test   %rax,%rax
[...]
0x55d94949e9c1 <+145>: addl   $0x1,%fs:0xfffdd688
0x55d94949e9ca <+154>: callq  0x55d949477b50 
0x55d94949e9cf <+159>: subl   $0x1,%fs:0xfffdd688



Cool!


Ok I'll make hlua_not_dumpable volatile instead of applying compiler
barriers.


Yes, it's safer for the long term as nobody will have to think about
it anymore. We'll have to integrate this one as well.


Thanks Guys, I will push a fix to make the variable volatile.

However, reading the other trace Maciej sent (bussy_thread_peers.txt), it seems 
possible to stop a memory allocation from other places. Thus, I guess we must 
find a more global way to prevent the lua stack dump.


--
Christopher Faulet



Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Willy Tarreau
On Wed, Mar 24, 2021 at 10:11:19AM +0100, Maciej Zdeb wrote:
> Wow, that's it! :)
> 
>0x55d94949e965 <+53>: addl   $0x1,%fs:0xfffdd688
>0x55d94949e96e <+62>: callq  0x55d9494782c0 
>0x55d94949e973 <+67>: subl   $0x1,%fs:0xfffdd688
> [...]
>0x55d94949e99f <+111>: ja 0x55d94949e9b0 
>0x55d94949e9a1 <+113>: mov%rbx,%rdi
>0x55d94949e9a4 <+116>: callq  0x55d949477d50 
>0x55d94949e9a9 <+121>: test   %rax,%rax
> [...]
>0x55d94949e9c1 <+145>: addl   $0x1,%fs:0xfffdd688
>0x55d94949e9ca <+154>: callq  0x55d949477b50 
>0x55d94949e9cf <+159>: subl   $0x1,%fs:0xfffdd688
> 

Cool!

> Ok I'll make hlua_not_dumpable volatile instead of applying compiler
> barriers.

Yes, it's safer for the long term as nobody will have to think about
it anymore. We'll have to integrate this one as well.

I'm wondering how many programs do *really* benefit from having gcc have
a builtin malloc that bypasses the library-provided allocator, compared
to all the ones experiencing pain and breakage, especially when using a
non-legacy allocator or a debug-enabled one :-/

Willy



Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Maciej Zdeb
Wow, that's it! :)

   0x55d94949e965 <+53>: addl   $0x1,%fs:0xfffdd688
   0x55d94949e96e <+62>: callq  0x55d9494782c0 
   0x55d94949e973 <+67>: subl   $0x1,%fs:0xfffdd688
[...]
   0x55d94949e99f <+111>: ja 0x55d94949e9b0 
   0x55d94949e9a1 <+113>: mov%rbx,%rdi
   0x55d94949e9a4 <+116>: callq  0x55d949477d50 
   0x55d94949e9a9 <+121>: test   %rax,%rax
[...]
   0x55d94949e9c1 <+145>: addl   $0x1,%fs:0xfffdd688
   0x55d94949e9ca <+154>: callq  0x55d949477b50 
   0x55d94949e9cf <+159>: subl   $0x1,%fs:0xfffdd688

Ok I'll make hlua_not_dumpable volatile instead of applying compiler
barriers.

śr., 24 mar 2021 o 10:00 Willy Tarreau  napisał(a):

> On Wed, Mar 24, 2021 at 09:52:22AM +0100, Willy Tarreau wrote:
> > So yes, it looks like gcc decides that a function called "malloc" will
> > never use your program's global variables but that "blablalloc" may. I
> > have no explanation to this except "optimization craziness" resulting
> > in breaking valid code, since it means that if I provide my own malloc
> > function it might not work. Pfff
>
> And that's exactly it!
>
> https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#Other-Builtins
>
> By default a number of functions including malloc, strchr etc are mapped
> to builtins so it figures it knows better how these work and how to
> optimize
> around. Once built with -fno-builtin-malloc no more probmlem. So the
> volatile
> is the way to go to respect the risk of a signal hitting it.
>
> I found an old report of gcc-4.5 breaking chromium and tcmalloc because of
> this, which put me on the track:
>
> https://github.com/gperftools/gperftools/issues/239
>
> Willy
>


Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Willy Tarreau
On Wed, Mar 24, 2021 at 09:52:22AM +0100, Willy Tarreau wrote:
> So yes, it looks like gcc decides that a function called "malloc" will
> never use your program's global variables but that "blablalloc" may. I
> have no explanation to this except "optimization craziness" resulting
> in breaking valid code, since it means that if I provide my own malloc
> function it might not work. Pfff

And that's exactly it!

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#Other-Builtins

By default a number of functions including malloc, strchr etc are mapped
to builtins so it figures it knows better how these work and how to optimize
around. Once built with -fno-builtin-malloc no more probmlem. So the volatile
is the way to go to respect the risk of a signal hitting it.

I found an old report of gcc-4.5 breaking chromium and tcmalloc because of
this, which put me on the track:

https://github.com/gperftools/gperftools/issues/239

Willy



Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Willy Tarreau
On Wed, Mar 24, 2021 at 09:41:03AM +0100, Willy Tarreau wrote:
> This is particularly strange. Could you please disassemble hlua_alloc ?
> (dis hlua_alloc) ?
> 
> You should find something like this:
> 
>0x004476c3 <+147>:   addDWORD PTR fs:0xfffdd678,0x1
>0x004476cc <+156>:   lock add QWORD PTR [rip+0x25df8b],0x1
> # 0x6a5660 <_.44806>
>0x004476d5 <+165>:   lock add QWORD PTR [rip+0x25df8b],rbx
> # 0x6a5668 <_.44806+8>
>0x004476dd <+173>:   movrdi,rbx
>0x004476e0 <+176>:   call   0x4248b0 
>0x004476e5 <+181>:   subDWORD PTR fs:0xfffdd678,0x1
>0x004476ee <+190>:   test   rax,rax
>0x004476f1 <+193>:   jne0x44769d 
> 
> The "add DWORD..." is the ++, and the "sub DWORD..." is the --, both of
> them around the function call.
> 
> If the variable had been declared static I would have agreed with the
> possibility of reordering. But here it's global and it's not legal to
> move it when you don't know if the called function will make use of it.
> Otherwise global variables cannot be used anymore!
> 
> By the way, in your case above it should even equal 2. That's rather
> strange.
> 
> One thing you can test is simply to declare it volatile. If it changes
> anything, it will mean that gcc has some specific assumptions on this
> malloc function (i.e. it decides it will never read your global variables) :-/

And this seems to be the case :-(

I was wondering where the two lock add instructions were coming from
and figure that I was building with DEBUG_MEM_STATS. Removing it
resulted in the add/sub being remove, but only around malloc, not
around realloc nor free.

Malloc is declared with __attribute__((malloc)) on my system so I
wondered if it could cause this. I tried adding this in the function:

extern void *malloc (size_t __size);

And it didn't change the output code. I just *renamed* the function
to blablalloc() and now the code is correct again:

22cb:   64 83 04 25 00 00 00addl   $0x1,%fs:0x0
22d2:   00 01 
22cf: R_X86_64_TPOFF32  hlua_not_dumpable
22d4:   48 89 dfmov%rbx,%rdi
22d7:   e8 00 00 00 00  callq  22dc 
22d8: R_X86_64_PLT32blablalloc-0x4
22dc:   64 83 2c 25 00 00 00subl   $0x1,%fs:0x0
22e3:   00 01 
22e0: R_X86_64_TPOFF32  hlua_not_dumpable
22e5:   48 85 c0test   %rax,%rax
22e8:   75 be   jne22a8 

So yes, it looks like gcc decides that a function called "malloc" will
never use your program's global variables but that "blablalloc" may. I
have no explanation to this except "optimization craziness" resulting
in breaking valid code, since it means that if I provide my own malloc
function it might not work. Pfff

At least the volatile solves it and is cleaner since it indicates it
should assume it doesn't know it (e.g. for use from a signal handler
which is what we're doing by the way):

diff --git a/include/haproxy/hlua.h b/include/haproxy/hlua.h
index 29656258e..46fbba6ec 100644
--- a/include/haproxy/hlua.h
+++ b/include/haproxy/hlua.h
@@ -50,7 +50,7 @@ void hlua_applet_tcp_fct(struct appctx *ctx);
 void hlua_applet_http_fct(struct appctx *ctx);
 struct task *hlua_process_task(struct task *task, void *context, unsigned 
short state);
 
-extern THREAD_LOCAL unsigned int hlua_not_dumpable;
+extern THREAD_LOCAL volatile unsigned int hlua_not_dumpable;
 #else /* USE_LUA */
 
 / For use when Lua is disabled /
diff --git a/src/hlua.c b/src/hlua.c
index 5bef67a48..6ccce72dd 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -237,7 +237,7 @@ struct hlua_mem_allocator {
 static struct hlua_mem_allocator hlua_global_allocator;
 
  /* > 0 if lua is in a non-rentrant part, thus with a non-dumpable stack */
-THREAD_LOCAL unsigned int hlua_not_dumpable = 0;
+THREAD_LOCAL unsigned int volatile hlua_not_dumpable = 0;
 
 /* These functions converts types between HAProxy internal args or
  * sample and LUA types. Another function permits to check if the

Willy



Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Willy Tarreau
On Wed, Mar 24, 2021 at 08:55:33AM +0100, Maciej Zdeb wrote:
> After reading I wasn't sure anymore I even tested properly patched package.
> :)

Hehe, I know that this happens quite a lot when starting to play with
different binaries.

> Fortunately I have a core file so I verified if hlua_not_dumpable
> variable exists. I don't know how it is possible that hlua_not_dumpable is
> 0 in this state (bt), if Christopher theory about lacking compiler barriers
> is wrong.
> 
> (gdb) p hlua_not_dumpable
> $1 = 0
> 
> (gdb) bt
> #0  __lll_lock_wait_private () at
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #1  0x7f52fb16c34b in __GI___libc_malloc (bytes=bytes@entry=41) at
> malloc.c:3063
> #2  0x55d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0
> , ptr=, osize=4, nsize=41) at
> src/hlua.c:8210
> #3  0x55d949622271 in luaM_realloc_ ()
> #4  0x55d949621dbe in luaC_newobj ()
> #5  0x55d949626f4d in internshrstr ()
> #6  0x55d9496271fc in luaS_new ()
> #7  0x55d94961bd03 in lua_pushstring ()
> #8  0x55d94962d48a in luaL_traceback ()
> #9  0x55d9495a1b62 in ha_task_dump (buf=buf@entry=0x7f5247c7f628,
> task=0x7f5234057ba0, pfx=pfx@entry=0x55d949672f19 ' ' )
> at src/debug.c:227
> #10 0x55d9495a1fd7 in ha_thread_dump (buf=0x7f5247c7f628,
> thr=, calling_tid=7) at src/debug.c:91
> #11 0x55d9495a2236 in debug_handler (sig=, si= out>, arg=) at src/debug.c:847
> #12 
> #13 _int_malloc (av=av@entry=0x7f523420, bytes=bytes@entry=56) at
> malloc.c:4100
> #14 0x7f52fb16c35d in __GI___libc_malloc (bytes=bytes@entry=56) at
> malloc.c:3065
> #15 0x55d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0
> , ptr=, osize=5, nsize=56) at
> src/hlua.c:8210

This is particularly strange. Could you please disassemble hlua_alloc ?
(dis hlua_alloc) ?

You should find something like this:

   0x004476c3 <+147>:   addDWORD PTR fs:0xfffdd678,0x1
   0x004476cc <+156>:   lock add QWORD PTR [rip+0x25df8b],0x1# 
0x6a5660 <_.44806>
   0x004476d5 <+165>:   lock add QWORD PTR [rip+0x25df8b],rbx# 
0x6a5668 <_.44806+8>
   0x004476dd <+173>:   movrdi,rbx
   0x004476e0 <+176>:   call   0x4248b0 
   0x004476e5 <+181>:   subDWORD PTR fs:0xfffdd678,0x1
   0x004476ee <+190>:   test   rax,rax
   0x004476f1 <+193>:   jne0x44769d 

The "add DWORD..." is the ++, and the "sub DWORD..." is the --, both of
them around the function call.

If the variable had been declared static I would have agreed with the
possibility of reordering. But here it's global and it's not legal to
move it when you don't know if the called function will make use of it.
Otherwise global variables cannot be used anymore!

By the way, in your case above it should even equal 2. That's rather
strange.

One thing you can test is simply to declare it volatile. If it changes
anything, it will mean that gcc has some specific assumptions on this
malloc function (i.e. it decides it will never read your global variables) :-/

Willy



Re: [2.2.9] 100% CPU usage

2021-03-24 Thread Maciej Zdeb
Hi,

wt., 23 mar 2021 o 18:36 Willy Tarreau  napisał(a):

> > It is most probably because of compiler optimizations. Some compiler
> > barriers are necessary to avoid instructions reordering. It is the
> purpose
> > of attached patches. Sorry to ask you it again, but could you make some
> > tests ?
>
 Sure, I'll verify!

I don't believe in it at all. free(), malloc() etc are free to manipulate
> global variables so the compiler cannot reorder these operations around
> them. We're probably facing other corner cases (or the same but not
> totally addressed maybe).
>

After reading I wasn't sure anymore I even tested properly patched package.
:) Fortunately I have a core file so I verified if hlua_not_dumpable
variable exists. I don't know how it is possible that hlua_not_dumpable is
0 in this state (bt), if Christopher theory about lacking compiler barriers
is wrong.

(gdb) p hlua_not_dumpable
$1 = 0

(gdb) bt
#0  __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x7f52fb16c34b in __GI___libc_malloc (bytes=bytes@entry=41) at
malloc.c:3063
#2  0x55d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0
, ptr=, osize=4, nsize=41) at
src/hlua.c:8210
#3  0x55d949622271 in luaM_realloc_ ()
#4  0x55d949621dbe in luaC_newobj ()
#5  0x55d949626f4d in internshrstr ()
#6  0x55d9496271fc in luaS_new ()
#7  0x55d94961bd03 in lua_pushstring ()
#8  0x55d94962d48a in luaL_traceback ()
#9  0x55d9495a1b62 in ha_task_dump (buf=buf@entry=0x7f5247c7f628,
task=0x7f5234057ba0, pfx=pfx@entry=0x55d949672f19 ' ' )
at src/debug.c:227
#10 0x55d9495a1fd7 in ha_thread_dump (buf=0x7f5247c7f628,
thr=, calling_tid=7) at src/debug.c:91
#11 0x55d9495a2236 in debug_handler (sig=, si=, arg=) at src/debug.c:847
#12 
#13 _int_malloc (av=av@entry=0x7f523420, bytes=bytes@entry=56) at
malloc.c:4100
#14 0x7f52fb16c35d in __GI___libc_malloc (bytes=bytes@entry=56) at
malloc.c:3065
#15 0x55d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0
, ptr=, osize=5, nsize=56) at
src/hlua.c:8210


Re: [2.2.9] 100% CPU usage

2021-03-23 Thread Willy Tarreau
On Tue, Mar 23, 2021 at 04:12:41PM +0100, Christopher Faulet wrote:
> Le 23/03/2021 à 11:14, Maciej Zdeb a écrit :
> > Hi Christopher,
> > 
> > Bad news, patches didn't help. Attaching stacktraces, now it looks that
> > spoe that executes lua scripts (free_thread_spue_lua.txt) tried to
> > malloc twice. :(
> > 
> 
> It is most probably because of compiler optimizations. Some compiler
> barriers are necessary to avoid instructions reordering. It is the purpose
> of attached patches. Sorry to ask you it again, but could you make some
> tests ?

I don't believe in it at all. free(), malloc() etc are free to manipulate
global variables so the compiler cannot reorder these operations around
them. We're probably facing other corner cases (or the same but not
totally addressed maybe).

Willy



Re: [2.2.9] 100% CPU usage

2021-03-23 Thread Christopher Faulet

Le 23/03/2021 à 11:14, Maciej Zdeb a écrit :

Hi Christopher,

Bad news, patches didn't help. Attaching stacktraces, now it looks that spoe 
that executes lua scripts (free_thread_spue_lua.txt) tried to malloc twice. :(




It is most probably because of compiler optimizations. Some compiler barriers 
are necessary to avoid instructions reordering. It is the purpose of attached 
patches. Sorry to ask you it again, but could you make some tests ?


Thanks !
--
Christopher Faulet
>From b4f55500b514e5bfcdaba938cbd2b0ba3cfb2f62 Mon Sep 17 00:00:00 2001
From: Christopher Faulet 
Date: Tue, 23 Mar 2021 15:17:22 +0100
Subject: [PATCH] BUG/MEDIUM: hlua: Use compiler barrier to protect
 hlua_not_dumpable increments

In hlua_alloc() function, the hlua_not_dumpable variable is incremented
before any call to realloc() and decremented just after. To be sure it
really works, we must prevent any instruction reordering. Thus compiler
barriers have been added to do so.

The patch fixes a bug in the commit a61789a1d ("MEDIUM: lua: Use a
per-thread counter to track some non-reentrant parts of lua"). It must be
backported as far as 2.0.
---
 src/hlua.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/hlua.c b/src/hlua.c
index 5bef67a48b..9fc30336f1 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -8196,7 +8196,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 		if (ptr)
 			zone->allocated -= osize;
 		hlua_not_dumpable++;
+		__ha_compiler_barrier();
 		free(ptr);
+		__ha_compiler_barrier();
 		hlua_not_dumpable--;
 		return NULL;
 	}
@@ -8207,7 +8209,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 			return NULL;
 
 		hlua_not_dumpable++;
+		__ha_compiler_barrier();
 		ptr = malloc(nsize);
+		__ha_compiler_barrier();
 		hlua_not_dumpable--;
 		if (ptr)
 			zone->allocated += nsize;
@@ -8219,7 +8223,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 		return NULL;
 
 	hlua_not_dumpable++;
+	__ha_compiler_barrier();
 	ptr = realloc(ptr, nsize);
+	__ha_compiler_barrier();
 	hlua_not_dumpable--;
 	if (ptr)
 		zone->allocated += nsize - osize;
-- 
2.30.2

>From 8eee5c4ba6eb92ceece0414b716c075c81d689ab Mon Sep 17 00:00:00 2001
From: Christopher Faulet 
Date: Tue, 23 Mar 2021 15:16:52 +0100
Subject: [PATCH] BUG/MEDIUM: hlua: Use compiler barrier to protect
 hlua_not_dumpable increments

In hlua_alloc() function, the hlua_not_dumpable variable is incremented
before any call to realloc() and decremented just after. To be sure it
really works, we must prevent any instruction reordering. Thus compiler
barriers have been added to do so.

The patch fixes a bug in the commit a61789a1d ("MEDIUM: lua: Use a
per-thread counter to track some non-reentrant parts of lua"). It must be
backported as far as 2.0.
---
 src/hlua.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/hlua.c b/src/hlua.c
index 990080b8c..ba7f93a99 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -8245,7 +8245,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 		if (ptr)
 			zone->allocated -= osize;
 		hlua_not_dumpable++;
+		__ha_compiler_barrier();
 		free(ptr);
+		__ha_compiler_barrier();
 		hlua_not_dumpable--;
 		return NULL;
 	}
@@ -8256,7 +8258,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 			return NULL;
 
 		hlua_not_dumpable++;
+		__ha_compiler_barrier();
 		ptr = malloc(nsize);
+		__ha_compiler_barrier();
 		hlua_not_dumpable--;
 		if (ptr)
 			zone->allocated += nsize;
@@ -8268,7 +8272,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 		return NULL;
 
 	hlua_not_dumpable++;
+	__ha_compiler_barrier();
 	ptr = realloc(ptr, nsize);
+	__ha_compiler_barrier();
 	hlua_not_dumpable--;
 	if (ptr)
 		zone->allocated += nsize - osize;
-- 
2.30.2

>From 35c1b30ef2d725a37a82f0a293b2bd0bd7c473e2 Mon Sep 17 00:00:00 2001
From: Christopher Faulet 
Date: Tue, 23 Mar 2021 15:10:46 +0100
Subject: [PATCH] BUG/MEDIUM: hlua: Use compiler barrier to protect
 hlua_not_dumpable increments

In hlua_alloc() function, the hlua_not_dumpable variable is incremented
before any call to realloc() and decremented just after. To be sure it
really works, we must prevent any instruction reordering. Thus compiler
barriers have been added to do so.

The patch fixes a bug in the commit a61789a1d ("MEDIUM: lua: Use a
per-thread counter to track some non-reentrant parts of lua"). It must be
backported as far as 2.0.
---
 src/hlua.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/hlua.c b/src/hlua.c
index 962195a60..d47973dcf 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -8639,7 +8639,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 	 */
 	if (likely(~zone->limit == 0)) {
 		hlua_not_dumpable++;
+		__ha_compiler_barrier();
 		ptr = realloc(ptr, nsize);
+		__ha_compiler_barrier();
 		hlua_not_dumpable--;
 		return ptr;
 	}
@@ -8657,7 +8659,9 @@ static void 

Re: [2.2.9] 100% CPU usage

2021-03-23 Thread Maciej Zdeb
Hi Christopher,

Bad news, patches didn't help. Attaching stacktraces, now it looks that
spoe that executes lua scripts (free_thread_spue_lua.txt) tried to malloc
twice. :(

Kind regards,

pon., 22 mar 2021 o 08:39 Maciej Zdeb  napisał(a):

> Hi Christopher,
>
> Thanks! I'm building a patched version and will return with feedback!
>
> Kind regards,
>
> pt., 19 mar 2021 o 16:40 Christopher Faulet 
> napisał(a):
>
>> Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
>> > Sorry for spam. In the last message I said that the old process (after
>> reload)
>> > is consuming cpu for lua processing and that's not true, it is
>> processing other
>> > things also.
>> >
>> > I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and
>> maybe 2.4
>> > branch. For each version I need a week or two to be sure the issue does
>> not
>> > occur. :(
>> >
>> > If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm
>> if there
>> > is any relation between infinite loops and custom configuration:
>> > - lua scripts (mainly used for header generation/manipulation),
>> > - spoe (used for sending metadata about each request to external
>> service),
>> > - peers (we have a cluster of 12 HAProxy servers connected to each
>> other).
>> >
>> Hi Maciej,
>>
>> I've read more carefully your backtraces, and indeed, it seems to be
>> related to
>> lua processing. I don't know if the watchdog is triggered because of the
>> lua or
>> if it is just a side-effect. But the lua execution is interrupted inside
>> the
>> memory allocator. And malloc/realloc are not async-signal-safe.
>> Unfortunately,
>> when the lua stack is dumped, the same allocator is also used. At this
>> stage,
>> because a lock was not properly released, HAProxy enter in a deadlock.
>>
>> On other threads, we loop in the watchdog, waiting for the hand to dump
>> the
>> thread information and that explains the 100% CPU usage you observe.
>>
>> So, to prevent this situation, the lua stack must not be dumped if it was
>> interrupted inside an unsafe part. It is the easiest way we found to
>> workaround
>> this bug. And because it is pretty rare, it should be good enough.
>>
>> However, I'm unable to reproduce the bug. Could you test attached patches
>> please
>> ? I attached patched for the 2.4, 2.3 and 2.2. Because you experienced
>> this bug
>> on the 2.2, it is probably easier to test patches for this version.
>>
>> If fixed, it could be good to figure out why the watchdog is triggered on
>> your
>> old processes.
>>
>> --
>> Christopher Faulet
>>
>
[Switching to thread 10 (Thread 0x7f5246c9d700 (LWP 15807))]
#0  __lll_lock_wait_private () at 
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95  ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or 
directory.
(gdb) bt full
#0  __lll_lock_wait_private () at 
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
No locals.
#1  0x7f52fb16c34b in __GI___libc_malloc (bytes=bytes@entry=41) at 
malloc.c:3063
ignore1 = 
ignore2 = 
ignore3 = 
ar_ptr = 0x7f523420
victim = 
hook = 
tbytes = 
tc_idx = 
__x = 
ignore1 = 
ignore2 = 
ignore3 = 
ignore = 
#2  0x55d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0 
, ptr=, osize=4, nsize=41) at 
src/hlua.c:8210
zone = 0x55d9498e90d0 
#3  0x55d949622271 in luaM_realloc_ ()
No symbol table info available.
#4  0x55d949621dbe in luaC_newobj ()
No symbol table info available.
#5  0x55d949626f4d in internshrstr ()
No symbol table info available.
#6  0x55d9496271fc in luaS_new ()
No symbol table info available.
#7  0x55d94961bd03 in lua_pushstring ()
No symbol table info available.
#8  0x55d94962d48a in luaL_traceback ()
No symbol table info available.
#9  0x55d9495a1b62 in ha_task_dump (buf=buf@entry=0x7f5247c7f628, 
task=0x7f5234057ba0, pfx=pfx@entry=0x55d949672f19 ' ' ) at 
src/debug.c:227
s = 
appctx = 
hlua = 0x7f52341bdc60
#10 0x55d9495a1fd7 in ha_thread_dump (buf=0x7f5247c7f628, thr=, calling_tid=7) at src/debug.c:91
thr_bit = 512
p = 74138721
n = 
stuck = 0
#11 0x55d9495a2236 in debug_handler (sig=, si=, arg=) at src/debug.c:847
sig = 
---Type  to continue, or q  to quit---
si = 
arg = 
#12 
No locals.
#13 _int_malloc (av=av@entry=0x7f523420, bytes=bytes@entry=56) at 
malloc.c:4100
p = 
iters = 
nb = 64
idx = 5
bin = 
victim = 0x7f52341cbff0
size = 
victim_index = 
remainder = 0x7f52341cc030
remainder_size = 
block = 
bit = 
map = 
fwd = 
bck = 
tcache_unsorted_count = 0
tcache_nb = 64
tc_idx = 2
return_cached = 0
__PRETTY_FUNCTION__ = "_int_malloc"
#14 0x7f52fb16c35d in __GI___libc_malloc (bytes=bytes@entry=56) at 
malloc.c:3065
ar_ptr = 

Re: [2.2.9] 100% CPU usage

2021-03-22 Thread Maciej Zdeb
Hi Christopher,

Thanks! I'm building a patched version and will return with feedback!

Kind regards,

pt., 19 mar 2021 o 16:40 Christopher Faulet 
napisał(a):

> Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
> > Sorry for spam. In the last message I said that the old process (after
> reload)
> > is consuming cpu for lua processing and that's not true, it is
> processing other
> > things also.
> >
> > I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and
> maybe 2.4
> > branch. For each version I need a week or two to be sure the issue does
> not
> > occur. :(
> >
> > If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if
> there
> > is any relation between infinite loops and custom configuration:
> > - lua scripts (mainly used for header generation/manipulation),
> > - spoe (used for sending metadata about each request to external
> service),
> > - peers (we have a cluster of 12 HAProxy servers connected to each
> other).
> >
> Hi Maciej,
>
> I've read more carefully your backtraces, and indeed, it seems to be
> related to
> lua processing. I don't know if the watchdog is triggered because of the
> lua or
> if it is just a side-effect. But the lua execution is interrupted inside
> the
> memory allocator. And malloc/realloc are not async-signal-safe.
> Unfortunately,
> when the lua stack is dumped, the same allocator is also used. At this
> stage,
> because a lock was not properly released, HAProxy enter in a deadlock.
>
> On other threads, we loop in the watchdog, waiting for the hand to dump
> the
> thread information and that explains the 100% CPU usage you observe.
>
> So, to prevent this situation, the lua stack must not be dumped if it was
> interrupted inside an unsafe part. It is the easiest way we found to
> workaround
> this bug. And because it is pretty rare, it should be good enough.
>
> However, I'm unable to reproduce the bug. Could you test attached patches
> please
> ? I attached patched for the 2.4, 2.3 and 2.2. Because you experienced
> this bug
> on the 2.2, it is probably easier to test patches for this version.
>
> If fixed, it could be good to figure out why the watchdog is triggered on
> your
> old processes.
>
> --
> Christopher Faulet
>


Re: [2.2.9] 100% CPU usage

2021-03-19 Thread Christopher Faulet

Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
Sorry for spam. In the last message I said that the old process (after reload) 
is consuming cpu for lua processing and that's not true, it is processing other 
things also.


I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and maybe 2.4 
branch. For each version I need a week or two to be sure the issue does not 
occur. :(


If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if there 
is any relation between infinite loops and custom configuration:

- lua scripts (mainly used for header generation/manipulation),
- spoe (used for sending metadata about each request to external service),
- peers (we have a cluster of 12 HAProxy servers connected to each other).


Hi Maciej,

I've read more carefully your backtraces, and indeed, it seems to be related to 
lua processing. I don't know if the watchdog is triggered because of the lua or 
if it is just a side-effect. But the lua execution is interrupted inside the 
memory allocator. And malloc/realloc are not async-signal-safe. Unfortunately, 
when the lua stack is dumped, the same allocator is also used. At this stage, 
because a lock was not properly released, HAProxy enter in a deadlock.


On other threads, we loop in the watchdog, waiting for the hand to dump the 
thread information and that explains the 100% CPU usage you observe.


So, to prevent this situation, the lua stack must not be dumped if it was 
interrupted inside an unsafe part. It is the easiest way we found to workaround 
this bug. And because it is pretty rare, it should be good enough.


However, I'm unable to reproduce the bug. Could you test attached patches please 
? I attached patched for the 2.4, 2.3 and 2.2. Because you experienced this bug 
on the 2.2, it is probably easier to test patches for this version.


If fixed, it could be good to figure out why the watchdog is triggered on your 
old processes.


--
Christopher Faulet
>From a61789a1d62fd71c751189faf5371740dd375f33 Mon Sep 17 00:00:00 2001
From: Christopher Faulet 
Date: Fri, 19 Mar 2021 15:16:28 +0100
Subject: [PATCH 1/2] MEDIUM: lua: Use a per-thread counter to track some
 non-reentrant parts of lua

Some parts of the Lua are non-reentrant. We must be sure to carefully track
these parts to not dump the lua stack when it is interrupted inside such
parts. For now, we only identified the custom lua allocator. If the thread
is interrupted during the memory allocation, we must not try to print the
lua stack wich also allocate memory. Indeed, realloc() is not
async-signal-safe.

In this patch we introduce a thread-local counter. It is incremented before
entering in a non-reentrant part and decremented when exiting. It is only
performed in hlua_alloc() for now.
---
 include/haproxy/hlua.h |  1 +
 src/hlua.c | 13 +++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/haproxy/hlua.h b/include/haproxy/hlua.h
index 6aca17f39..36629e637 100644
--- a/include/haproxy/hlua.h
+++ b/include/haproxy/hlua.h
@@ -50,6 +50,7 @@ void hlua_applet_tcp_fct(struct appctx *ctx);
 void hlua_applet_http_fct(struct appctx *ctx);
 struct task *hlua_process_task(struct task *task, void *context, unsigned int state);
 
+extern THREAD_LOCAL unsigned int hlua_not_dumpable;
 #else /* USE_LUA */
 
 / For use when Lua is disabled /
diff --git a/src/hlua.c b/src/hlua.c
index 42d5be2a5..962195a60 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -274,6 +274,9 @@ struct hlua_mem_allocator {
 
 static struct hlua_mem_allocator hlua_global_allocator THREAD_ALIGNED(64);
 
+ /* > 0 if lua is in a non-rentrant part, thus with a non-dumpable stack */
+THREAD_LOCAL unsigned int hlua_not_dumpable = 0;
+
 /* These functions converts types between HAProxy internal args or
  * sample and LUA types. Another function permits to check if the
  * LUA stack contains arguments according with an required ARG_T
@@ -8634,8 +8637,12 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 	/* a limit of ~0 means unlimited and boot complete, so there's no need
 	 * for accounting anymore.
 	 */
-	if (likely(~zone->limit == 0))
-		return realloc(ptr, nsize);
+	if (likely(~zone->limit == 0)) {
+		hlua_not_dumpable++;
+		ptr = realloc(ptr, nsize);
+		hlua_not_dumpable--;
+		return ptr;
+	}
 
 	if (!ptr)
 		osize = 0;
@@ -8649,7 +8656,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 			return NULL;
 	} while (!_HA_ATOMIC_CAS(>allocated, , new));
 
+	hlua_not_dumpable++;
 	ptr = realloc(ptr, nsize);
+	hlua_not_dumpable--;
 
 	if (unlikely(!ptr && nsize)) // failed
 		_HA_ATOMIC_SUB(>allocated, nsize - osize);
-- 
2.30.2

>From 83926a04febe94f332c6f45c98773286678346d3 Mon Sep 17 00:00:00 2001
From: Christopher Faulet 
Date: Fri, 19 Mar 2021 15:41:08 +0100
Subject: [PATCH 2/2] BUG/MEDIUM: debug/lua: Don't dump the lua stack if not
 dumpable

When we try to dump the stack of a lua 

Re: [2.2.9] 100% CPU usage

2021-03-17 Thread Maciej Zdeb
Hi Christopher,

That's good news! If you need me to test a patch then let me know.

On my side I'm preparing to update HAProxy to 2.3 and solving some simple
issues like lacking new lines on the end of configuration. ;)

Kind regards,

śr., 17 mar 2021 o 10:49 Christopher Faulet 
napisał(a):

> Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
> > Sorry for spam. In the last message I said that the old process (after
> reload)
> > is consuming cpu for lua processing and that's not true, it is
> processing other
> > things also.
> >
> > I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and
> maybe 2.4
> > branch. For each version I need a week or two to be sure the issue does
> not
> > occur. :(
> >
> > If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if
> there
> > is any relation between infinite loops and custom configuration:
> > - lua scripts (mainly used for header generation/manipulation),
> > - spoe (used for sending metadata about each request to external
> service),
> > - peers (we have a cluster of 12 HAProxy servers connected to each
> other).
> >
> Hi Maciej,
>
> Just to let you know we identified a potential AB/BA lock problem when
> HAProxy
> is stopping a listener. At least on the 2.2. The listeners code has
> changed in
> the 2.3 but we have no idea if the bug can be triggered too on 2.3 or 2.4.
> However, we have not found the time to investigate more for now. So, stay
> tuned.
>
> --
> Christopher Faulet
>
>


Re: [2.2.9] 100% CPU usage

2021-03-17 Thread Christopher Faulet

Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
Sorry for spam. In the last message I said that the old process (after reload) 
is consuming cpu for lua processing and that's not true, it is processing other 
things also.


I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and maybe 2.4 
branch. For each version I need a week or two to be sure the issue does not 
occur. :(


If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if there 
is any relation between infinite loops and custom configuration:

- lua scripts (mainly used for header generation/manipulation),
- spoe (used for sending metadata about each request to external service),
- peers (we have a cluster of 12 HAProxy servers connected to each other).


Hi Maciej,

Just to let you know we identified a potential AB/BA lock problem when HAProxy 
is stopping a listener. At least on the 2.2. The listeners code has changed in 
the 2.3 but we have no idea if the bug can be triggered too on 2.3 or 2.4. 
However, we have not found the time to investigate more for now. So, stay tuned.


--
Christopher Faulet



Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Willy Tarreau
On Tue, Mar 16, 2021 at 01:46:48PM +0100, Maciej Zdeb wrote:
> In the last message I said that the old process (after
> reload) is consuming cpu for lua processing and that's not true, it is
> processing other things also.
> 
> I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and maybe
> 2.4 branch. For each version I need a week or two to be sure the issue does
> not occur. :(

Don't worry, we know how it is. On the developer side it's annoying to
have such bugs, but on the other hand, when bugs take one week to become
visible, it's likely that there aren't that many left, so this gives some
incentive to find the root cause :-)

The perf top you sent indicates that *something* is waking a task in
loops. Unfortunately we don't know what. In 2.4 we have more tools to
observe this, so if you still observe it I may even indicate you some
gdb commands or suggest a few patches to figure exactly what is
happening.

> If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if
> there is any relation between infinite loops and custom configuration:
> - lua scripts (mainly used for header generation/manipulation),
> - spoe (used for sending metadata about each request to external service),
> - peers (we have a cluster of 12 HAProxy servers connected to each other).

Any of the 3 could be indirectly responsible, as the stopping state
sometimes has to be taken care of in special ways. The peers are a bit
less likely because they're used a lot, though that doesn't mean they're
immune to this. But Lua and SPOE are a bit less commonly deployed and
could still be facing some rare stupid corner cases that have not yet
been identified. Obviously I have no idea which ones :-/

Cheers,
Willy



Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Maciej Zdeb
Sorry for spam. In the last message I said that the old process (after
reload) is consuming cpu for lua processing and that's not true, it is
processing other things also.

I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and maybe
2.4 branch. For each version I need a week or two to be sure the issue does
not occur. :(

If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if
there is any relation between infinite loops and custom configuration:
- lua scripts (mainly used for header generation/manipulation),
- spoe (used for sending metadata about each request to external service),
- peers (we have a cluster of 12 HAProxy servers connected to each other).

Kind regards,

wt., 16 mar 2021 o 13:05 Maciej Zdeb  napisał(a):

> Below is the output from perf top - it happens during reload (all threads
> on an old process spike and use 100% cpu, processing lua) and after 15-30
> seconds old process exits. It is probably a different bug, because it
> happens on an old process and I have no idea how it could affect the new
> process. Maybe through peers protocol? Still both issues are related to lua
> scripts (on my initial email in low_cpu_thread.txt it hung on lua script
> responsible for removing http headers on hlua_http_get_headers function).
>
> Samples: 770K of event 'cycles:ppp', Event count (approx.): 336817908209
> Overhead  Shared Object   Symbol
>   36.39%  haproxy [.]
> process_runnable_tasks
>   34.82%  haproxy [.]
> run_tasks_from_lists
>   10.04%  haproxy [.] __task_wakeup
>5.63%  libc-2.27.so[.]
> __memset_avx2_erms
>1.87%  haproxy [.] hlua_action
>0.79%  haproxy [.] hlua_ctx_init
>0.77%  haproxy [.]
> stream_release_buffers
>0.58%  libcrypto.so.1.1[.]
> rsaz_1024_sqr_avx2
>0.56%  haproxy [.]
> stream_int_notify
>0.49%  haproxy [.]
> task_run_applet
>0.42%  haproxy [.]
> hlua_sample_fetch_wrapper
>0.38%  haproxy [.]
> spoe_handle_appctx
>0.38%  [kernel][k] do_syscall_64
>0.37%  haproxy [.] eb32sc_insert
>0.32%  [vdso]  [.]
> __vdso_clock_gettime
>0.27%  haproxy [.]
> hlua_ctx_destroy
>0.22%  haproxy [.]
> stktable_get_entry
>0.20%  haproxy [.]
> process_stream
>0.19%  [kernel][k]
> audit_filter_rules.constprop.13
>0.17%  haproxy [.]
> si_applet_wake_cb
>0.14%  libcrypto.so.1.1[.]
> rsaz_1024_mul_avx2
>
> wt., 16 mar 2021 o 12:00 Maciej Zdeb  napisał(a):
>
>> Sure, patch from Christopher attached. :)
>>
>> wt., 16 mar 2021 o 10:58 Willy Tarreau  napisał(a):
>>
>>> Hi Maciej,
>>>
>>> On Tue, Mar 16, 2021 at 10:55:11AM +0100, Maciej Zdeb wrote:
>>> > Hi,
>>> >
>>> > I'm returning with bad news, the patch did not help and the issue
>>> occurred
>>> > today (on patched 2.2.10). It is definitely related to reloads,
>>> however it
>>> > is very rare issue it worked flawlessly the whole week.
>>>
>>> OK. Just as a reminder (because I can't find it here), can you please
>>> remind us what this patch was ?
>>>
>>> Thanks,
>>> Willy
>>>
>>


Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Maciej Zdeb
Below is the output from perf top - it happens during reload (all threads
on an old process spike and use 100% cpu, processing lua) and after 15-30
seconds old process exits. It is probably a different bug, because it
happens on an old process and I have no idea how it could affect the new
process. Maybe through peers protocol? Still both issues are related to lua
scripts (on my initial email in low_cpu_thread.txt it hung on lua script
responsible for removing http headers on hlua_http_get_headers function).

Samples: 770K of event 'cycles:ppp', Event count (approx.): 336817908209
Overhead  Shared Object   Symbol
  36.39%  haproxy [.]
process_runnable_tasks
  34.82%  haproxy [.]
run_tasks_from_lists
  10.04%  haproxy [.] __task_wakeup
   5.63%  libc-2.27.so[.]
__memset_avx2_erms
   1.87%  haproxy [.] hlua_action
   0.79%  haproxy [.] hlua_ctx_init
   0.77%  haproxy [.]
stream_release_buffers
   0.58%  libcrypto.so.1.1[.]
rsaz_1024_sqr_avx2
   0.56%  haproxy [.]
stream_int_notify
   0.49%  haproxy [.]
task_run_applet
   0.42%  haproxy [.]
hlua_sample_fetch_wrapper
   0.38%  haproxy [.]
spoe_handle_appctx
   0.38%  [kernel][k] do_syscall_64
   0.37%  haproxy [.] eb32sc_insert
   0.32%  [vdso]  [.]
__vdso_clock_gettime
   0.27%  haproxy [.]
hlua_ctx_destroy
   0.22%  haproxy [.]
stktable_get_entry
   0.20%  haproxy [.] process_stream
   0.19%  [kernel][k]
audit_filter_rules.constprop.13
   0.17%  haproxy [.]
si_applet_wake_cb
   0.14%  libcrypto.so.1.1[.]
rsaz_1024_mul_avx2

wt., 16 mar 2021 o 12:00 Maciej Zdeb  napisał(a):

> Sure, patch from Christopher attached. :)
>
> wt., 16 mar 2021 o 10:58 Willy Tarreau  napisał(a):
>
>> Hi Maciej,
>>
>> On Tue, Mar 16, 2021 at 10:55:11AM +0100, Maciej Zdeb wrote:
>> > Hi,
>> >
>> > I'm returning with bad news, the patch did not help and the issue
>> occurred
>> > today (on patched 2.2.10). It is definitely related to reloads, however
>> it
>> > is very rare issue it worked flawlessly the whole week.
>>
>> OK. Just as a reminder (because I can't find it here), can you please
>> remind us what this patch was ?
>>
>> Thanks,
>> Willy
>>
>


Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Maciej Zdeb
Sure, patch from Christopher attached. :)

wt., 16 mar 2021 o 10:58 Willy Tarreau  napisał(a):

> Hi Maciej,
>
> On Tue, Mar 16, 2021 at 10:55:11AM +0100, Maciej Zdeb wrote:
> > Hi,
> >
> > I'm returning with bad news, the patch did not help and the issue
> occurred
> > today (on patched 2.2.10). It is definitely related to reloads, however
> it
> > is very rare issue it worked flawlessly the whole week.
>
> OK. Just as a reminder (because I can't find it here), can you please
> remind us what this patch was ?
>
> Thanks,
> Willy
>


0001-BUG-MEDIUM-lists-Avoid-an-infinite-loop-in-MT_LIST_T.patch
Description: Binary data


Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Willy Tarreau
Hi Maciej,

On Tue, Mar 16, 2021 at 10:55:11AM +0100, Maciej Zdeb wrote:
> Hi,
> 
> I'm returning with bad news, the patch did not help and the issue occurred
> today (on patched 2.2.10). It is definitely related to reloads, however it
> is very rare issue it worked flawlessly the whole week.

OK. Just as a reminder (because I can't find it here), can you please
remind us what this patch was ?

Thanks,
Willy



Re: [2.2.9] 100% CPU usage

2021-03-16 Thread Maciej Zdeb
Hi,

I'm returning with bad news, the patch did not help and the issue occurred
today (on patched 2.2.10). It is definitely related to reloads, however it
is very rare issue it worked flawlessly the whole week.

wt., 9 mar 2021 o 09:17 Willy Tarreau  napisał(a):

> On Tue, Mar 09, 2021 at 09:04:43AM +0100, Maciej Zdeb wrote:
> > Hi,
> >
> > After applying the patch, the issue did not occur, however I'm still not
> > sure it is fixed. Unfortunately I don't have a reliable way to trigger
> it.
>
> OK. If it's related, it's very possible that some of the issues we've
> identified there recently are at least responsible for a part of the
> problem. In short, if the CPUs are too fair, some contention can last
> long because two steps are required to complete such operations. We've
> added calls to cpu_relax() in 2.4 and some are queued in 2.3 already.
> After a while, some of them should also be backported to 2.2 as they
> significantly improved the situation with many threads.
>
> > pt., 5 mar 2021 o 22:07 Willy Tarreau  napisal(a):
> >
> > > Note, before 2.4, a single thread can execute Lua scripts at once,
> > > with the others waiting behind, and if the Lua load is important, maybe
> > > this can happen (but I've never experienced it yet, and the premption
> > > interval is short enough not to cause issues in theory).
> > >
> > I'm not sure if related but during every reload, for a couple seconds all
> > 12 threads on OLD pid are using 100% cpu and then one after one core
> return
> > to normal usage, finally the old haproxy process exits. I have no idea
> why
> > it is behaving like that.
>
> It could be related but it's hard to tell. It is also possible that
> for any reason the old threads constantly believe they have something
> to do, for example a health check that doesn't get stopped and that
> keep ringing.
>
> > > Maciej, if this happens often,
> > > would you be interested in running one machine on 2.4-dev11 ?
> >
> > It is a very rare issue and of course it occurs only in production
> > environment. :(
>
> Obviously!
>
> > I'm very willing to test the 2.4 version, especially with
> > that tasty lua optimization for multiple threads, but I can't do it on
> > production until it's stable.
>
> This makes sense. We'll try to issue 2.3 with some thread fixes this
> week, maybe it will be a possible step for you then.
>
> Cheers,
> Willy
>


Re: [2.2.9] 100% CPU usage

2021-03-09 Thread Willy Tarreau
On Tue, Mar 09, 2021 at 09:04:43AM +0100, Maciej Zdeb wrote:
> Hi,
> 
> After applying the patch, the issue did not occur, however I'm still not
> sure it is fixed. Unfortunately I don't have a reliable way to trigger it.

OK. If it's related, it's very possible that some of the issues we've
identified there recently are at least responsible for a part of the
problem. In short, if the CPUs are too fair, some contention can last
long because two steps are required to complete such operations. We've
added calls to cpu_relax() in 2.4 and some are queued in 2.3 already.
After a while, some of them should also be backported to 2.2 as they
significantly improved the situation with many threads.

> pt., 5 mar 2021 o 22:07 Willy Tarreau  napisal(a):
> 
> > Note, before 2.4, a single thread can execute Lua scripts at once,
> > with the others waiting behind, and if the Lua load is important, maybe
> > this can happen (but I've never experienced it yet, and the premption
> > interval is short enough not to cause issues in theory).
> >
> I'm not sure if related but during every reload, for a couple seconds all
> 12 threads on OLD pid are using 100% cpu and then one after one core return
> to normal usage, finally the old haproxy process exits. I have no idea why
> it is behaving like that.

It could be related but it's hard to tell. It is also possible that
for any reason the old threads constantly believe they have something
to do, for example a health check that doesn't get stopped and that
keep ringing.

> > Maciej, if this happens often,
> > would you be interested in running one machine on 2.4-dev11 ?
> 
> It is a very rare issue and of course it occurs only in production
> environment. :(

Obviously!

> I'm very willing to test the 2.4 version, especially with
> that tasty lua optimization for multiple threads, but I can't do it on
> production until it's stable.

This makes sense. We'll try to issue 2.3 with some thread fixes this
week, maybe it will be a possible step for you then.

Cheers,
Willy



Re: [2.2.9] 100% CPU usage

2021-03-09 Thread Maciej Zdeb
Hi,

After applying the patch, the issue did not occur, however I'm still not
sure it is fixed. Unfortunately I don't have a reliable way to trigger it.

pt., 5 mar 2021 o 22:07 Willy Tarreau  napisał(a):

> Note, before 2.4, a single thread can execute Lua scripts at once,
> with the others waiting behind, and if the Lua load is important, maybe
> this can happen (but I've never experienced it yet, and the premption
> interval is short enough not to cause issues in theory).
>
I'm not sure if related but during every reload, for a couple seconds all
12 threads on OLD pid are using 100% cpu and then one after one core return
to normal usage, finally the old haproxy process exits. I have no idea why
it is behaving like that.


> Maciej, if this happens often,
> would you be interested in running one machine on 2.4-dev11 ?

It is a very rare issue and of course it occurs only in production
environment. :( I'm very willing to test the 2.4 version, especially with
that tasty lua optimization for multiple threads, but I can't do it on
production until it's stable.


> We'd
> need to have a quick look at your config (off-list if needed) to
> figure what Lua parts could run in multi-thread.
>
I'll provide configs (off the list) shortly!

Kind regards,


Re: [2.2.9] 100% CPU usage

2021-03-05 Thread Willy Tarreau
On Fri, Mar 05, 2021 at 12:00:52PM +0100, Christopher Faulet wrote:
> Le 05/03/2021 à 11:35, Maciej Zdeb a écrit :
> > Hi Christopher,
> > 
> > Thanks, I'll check but it'll take a couple days because the issue is
> > quite rare. I'll return with feedback!
> > 
> > Maybe the patch is not backported to 2.2 because of commit message that
> > states only 2.3 branch?
> > 
> 
> That's it. And it was finally backported in 2.2 and 2.1.

Note, before 2.4, a single thread can execute Lua scripts at once,
with the others waiting behind, and if the Lua load is important, maybe
this can happen (but I've never experienced it yet, and the premption
interval is short enough not to cause issues in theory). However the
trace shows an issue on setjmp(), which doesn't make much sense in
theory, unless we consider that it's triggered there because it's the
first syscall after waiting too long. Maciej, if this happens often,
would you be interested in running one machine on 2.4-dev11 ? We'd
need to have a quick look at your config (off-list if needed) to
figure what Lua parts could run in multi-thread.

Cheers,
Willy



Re: [2.2.9] 100% CPU usage

2021-03-05 Thread Christopher Faulet

Le 05/03/2021 à 11:35, Maciej Zdeb a écrit :

Hi Christopher,

Thanks, I'll check but it'll take a couple days because the issue is quite rare. 
I'll return with feedback!


Maybe the patch is not backported to 2.2 because of commit message that states 
only 2.3 branch?




That's it. And it was finally backported in 2.2 and 2.1.

--
Christopher Faulet



Re: [2.2.9] 100% CPU usage

2021-03-05 Thread Maciej Zdeb
Hi Christopher,

Thanks, I'll check but it'll take a couple days because the issue is quite
rare. I'll return with feedback!

Maybe the patch is not backported to 2.2 because of commit message that
states only 2.3 branch?

Kind regards,

czw., 4 mar 2021 o 22:34 Christopher Faulet 
napisał(a):

> Le 04/03/2021 à 14:01, Maciej Zdeb a écrit :
> > Hi,
> >
> > Sometimes after HAProxy reload it starts to loop infinitely, for example
> 9 of 10
> > threads using 100% CPU (gdb sessions attached). I've also dumped the
> core file
> > from gdb.
> >
> Hi Maciej,
>
> The 2.2.1O is out. But I'm afraid that a fix is missing. Could you test
> with the
> attached patch please ? On top of the 2.2.9 or 2.2.10, as you want.
>
> Thanks,
> --
> Christopher Faulet
>


Re: [2.2.9] 100% CPU usage

2021-03-04 Thread Christopher Faulet

Le 04/03/2021 à 14:01, Maciej Zdeb a écrit :

Hi,

Sometimes after HAProxy reload it starts to loop infinitely, for example 9 of 10 
threads using 100% CPU (gdb sessions attached). I've also dumped the core file 
from gdb.



Hi Maciej,

The 2.2.1O is out. But I'm afraid that a fix is missing. Could you test with the 
attached patch please ? On top of the 2.2.9 or 2.2.10, as you want.


Thanks,
--
Christopher Faulet
>From 6406528c25ec73ff0a47696ac2decde95867fdda Mon Sep 17 00:00:00 2001
From: Olivier Houchard 
Date: Thu, 18 Feb 2021 23:55:30 +0100
Subject: [PATCH] BUG/MEDIUM: lists: Avoid an infinite loop in
 MT_LIST_TRY_ADDQ().

In MT_LIST_TRY_ADDQ(), deal with the "prev" field of the element before the
"next". If the element is the first in the list, then its next will
already have been locked when we locked list->prev->next, so locking it
again will fail, and we'll start over and over.

This should be backported to 2.3.

(cherry picked from commit 5567f41d0ab61dd6843535edc8081407d599024d)
Signed-off-by: Christopher Faulet 
(cherry picked from commit 6f682bea6ab08830d17ef3e973be6cc4d2474e69)
Signed-off-by: Christopher Faulet 
---
 include/haproxy/list.h | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/haproxy/list.h b/include/haproxy/list.h
index 548c9b0f52..6233b2c79f 100644
--- a/include/haproxy/list.h
+++ b/include/haproxy/list.h
@@ -300,28 +300,28 @@
 			__ha_barrier_store();  \
 			continue;  \
 		}  \
-		n2 = _HA_ATOMIC_XCHG(>next, MT_LIST_BUSY); \
-		if (n2 != el) { /* element already linked */   \
-			if (n2 != MT_LIST_BUSY)\
-el->next = n2; \
+		p2 = _HA_ATOMIC_XCHG(>prev, MT_LIST_BUSY); \
+		if (p2 != el) {\
+			if (p2 != MT_LIST_BUSY)\
+el->prev = p2; \
 			p->next = n;   \
 			__ha_barrier_store();  \
 			lh->prev = p;  \
 			__ha_barrier_store();  \
-			if (n2 == MT_LIST_BUSY)\
+			if (p2 == MT_LIST_BUSY)\
 continue;  \
 			break; \
 		}  \
-		p2 = _HA_ATOMIC_XCHG(>prev, MT_LIST_BUSY); \
-		if (p2 != el) {\
-			if (p2 != MT_LIST_BUSY)\
-el->prev = p2; \
+		n2 = _HA_ATOMIC_XCHG(>next, MT_LIST_BUSY); \
+		if (n2 != el) { /* element already linked */   \
+			if (n2 != MT_LIST_BUSY)\
+el->next = n2; \
 			p->next = n;   \
-			el->next = el; \
+			el->prev = el; \
 			__ha_barrier_store();  \
 			lh->prev = p;  \
 			__ha_barrier_store();  \
-			if (p2 == MT_LIST_BUSY)\
+			if (n2 == MT_LIST_BUSY)\
 continue;  \
 			break; \
 		}  \
-- 
2.29.2



[2.2.9] 100% CPU usage

2021-03-04 Thread Maciej Zdeb
Hi,

Sometimes after HAProxy reload it starts to loop infinitely, for example 9
of 10 threads using 100% CPU (gdb sessions attached). I've also dumped the
core file from gdb.

# haproxy -v
HA-Proxy version 2.2.9-a947cc2 2021/02/06 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2
2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.9.html
Running on: Linux 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC
2019 x86_64

thread with high CPU is looping on:
809 if (HA_ATOMIC_CAS(_to_dump, , all_threads_mask))
(gdb)
811 ha_thread_relax();
(gdb)
809 if (HA_ATOMIC_CAS(_to_dump, , all_threads_mask))
(gdb)
811 ha_thread_relax();

it tried to panic but is failing (from bt full):
#0  0x7fa73309cf37 in sched_yield () at
../sysdeps/unix/syscall-template.S:78
No locals.
#1  0x559253cb7fce in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#2  ha_thread_dump_all_to_trash () at src/debug.c:811
old = 
#3  0x559253cb802f in ha_panic () at src/debug.c:268
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x7fa73309cf37 in sched_yield () at ../sysdeps/unix/syscall-template.S:78
78  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt full
#0  0x7fa73309cf37 in sched_yield () at 
../sysdeps/unix/syscall-template.S:78
No locals.
#1  0x559253cb7fce in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#2  ha_thread_dump_all_to_trash () at src/debug.c:811
old = 
#3  0x559253cb802f in ha_panic () at src/debug.c:268
No locals.
#4  0x559253d23f08 in wdt_handler (sig=14, si=, 
arg=) at src/wdt.c:119
n = 
p = 
thr = 0
#5  
No locals.
#6  0x7fa73309cf37 in sched_yield () at 
../sysdeps/unix/syscall-template.S:78
No locals.
#7  0x559253cb9be5 in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#8  debug_handler (sig=, si=, arg=) at src/debug.c:859
sig = 
si = 
arg = 
#9  
No locals.
#10 0x559253bc642f in hlua_action (rule=0x5592596b5ba0, px=0x559259686260, 
sess=, s=0x55925b8f2fb0, flags=2) at src/hlua.c:6711
__pl_r = 4294967300
ret = 
arg = 
dir = 0
act_ret = 0
error = 
#11 0x559253c353c3 in http_req_get_intercept_rule 
(px=px@entry=0x559259686260, rules=rules@entry=0x5592596862b0, 
s=s@entry=0x55925b8f2fb0) at src/http_ana.c:2884
sess = 
txn = 0x55925b8f3790
rule = 0x5592596b5ba0
rule_ret = HTTP_RULE_RES_CONT
act_opts = 2
#12 0x559253c382f8 in http_process_req_common (s=s@entry=0x55925b8f2fb0, 
req=req@entry=0x55925b8f2fc0, an_bit=an_bit@entry=16, px=) at 
src/http_ana.c:501
sess = 
txn = 
msg = 
---Type  to continue, or q  to quit---
htx = 
rule = 
verdict = 
conn = 
#13 0x559253c48669 in process_stream (t=, 
context=0x55925b8f2fb0, state=) at src/stream.c:1781
max_loops = 
ana_list = 409648
ana_back = 409648
flags = 
srv = 
s = 0x55925b8f2fb0
sess = 
rqf_last = 
rpf_last = 2147483648
rq_prod_last = 
rq_cons_last = 0
rp_cons_last = 8
rp_prod_last = 0
req_ana_back = 
req = 0x55925b8f2fc0
res = 0x55925b8f3020
si_f = 0x55925b8f32e8
si_b = 0x55925b8f3340
rate = 
#14 0x559253d0a643 in run_tasks_from_lists 
(budgets=budgets@entry=0x7ffc669f811c) at src/task.c:483
process = 
tl_queues = 
t = 0x55925b8f34a0
budget_mask = 6 '\006'
done = 
queue = 
state = 
ctx = 
__ret = 
__n = 
__p = 
#15 0x559253d0b05d in process_runnable_tasks () at src/task.c:679
tt = 0x55925412f7c0 
lrq = 
grq = 
t = 
---Type  to continue, or q  to quit---
max = {0, 37, 25}
max_total = 
tmp_list = 
queue = 3
max_processed = 
#16 0x559253cc1df7 in run_poll_loop () at src/haproxy.c:2939
next = 
wake = 
#17 0x559253cc21a9 in run_thread_poll_loop (data=) at 
src/haproxy.c:3104
ptaf = 
ptif = 
ptdf = 
ptff = 
init_left = 0
init_mutex = pthread_mutex_t = {Type = Normal, Status = Not acquired, 
Robust = No, Shared = No, Protocol = None}
init_cond = pthread_cond_t = {Threads known to still execute a wait 
function = 0, Clock ID = CLOCK_REALTIME, Shared = No}
#18 0x559253b95840 in main (argc=, argv=0x7ffc669f8658) at 
src/haproxy.c:3803
blocked_sig = {__val = {1844674406710583, 18446744073709551615 
}}
old_sig = {__val = {0, 13925524187434950144, 0, 140722030216352, 8, 32, 
12884901897, 7, 48, 94086972698648, 80, 18446744073709409504, 0, 206158430211, 
0, 0}}
i = 
err = 
retry = 
limit =