Hi Christopher,
Bad news, patches didn't help. Attaching stacktraces, now it looks that
spoe that executes lua scripts (free_thread_spue_lua.txt) tried to malloc
twice. :(
Kind regards,
pon., 22 mar 2021 o 08:39 Maciej Zdeb <mac...@zdeb.pl> napisał(a):
> Hi Christopher,
>
> Thanks! I'm building a patched version and will return with feedback!
>
> Kind regards,
>
> pt., 19 mar 2021 o 16:40 Christopher Faulet <cfau...@haproxy.com>
> napisał(a):
>
>> Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
>> > Sorry for spam. In the last message I said that the old process (after
>> reload)
>> > is consuming cpu for lua processing and that's not true, it is
>> processing other
>> > things also.
>> >
>> > I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and
>> maybe 2.4
>> > branch. For each version I need a week or two to be sure the issue does
>> not
>> > occur. :(
>> >
>> > If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm
>> if there
>> > is any relation between infinite loops and custom configuration:
>> > - lua scripts (mainly used for header generation/manipulation),
>> > - spoe (used for sending metadata about each request to external
>> service),
>> > - peers (we have a cluster of 12 HAProxy servers connected to each
>> other).
>> >
>> Hi Maciej,
>>
>> I've read more carefully your backtraces, and indeed, it seems to be
>> related to
>> lua processing. I don't know if the watchdog is triggered because of the
>> lua or
>> if it is just a side-effect. But the lua execution is interrupted inside
>> the
>> memory allocator. And malloc/realloc are not async-signal-safe.
>> Unfortunately,
>> when the lua stack is dumped, the same allocator is also used. At this
>> stage,
>> because a lock was not properly released, HAProxy enter in a deadlock.
>>
>> On other threads, we loop in the watchdog, waiting for the hand to dump
>> the
>> thread information and that explains the 100% CPU usage you observe.
>>
>> So, to prevent this situation, the lua stack must not be dumped if it was
>> interrupted inside an unsafe part. It is the easiest way we found to
>> workaround
>> this bug. And because it is pretty rare, it should be good enough.
>>
>> However, I'm unable to reproduce the bug. Could you test attached patches
>> please
>> ? I attached patched for the 2.4, 2.3 and 2.2. Because you experienced
>> this bug
>> on the 2.2, it is probably easier to test patches for this version.
>>
>> If fixed, it could be good to figure out why the watchdog is triggered on
>> your
>> old processes.
>>
>> --
>> Christopher Faulet
>>
>
[Switching to thread 10 (Thread 0x7f5246c9d700 (LWP 15807))]
#0 __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95 ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or
directory.
(gdb) bt full
#0 __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
No locals.
#1 0x00007f52fb16c34b in __GI___libc_malloc (bytes=bytes@entry=41) at
malloc.c:3063
ignore1 = <optimized out>
ignore2 = <optimized out>
ignore3 = <optimized out>
ar_ptr = 0x7f5234000020
victim = <optimized out>
hook = <optimized out>
tbytes = <optimized out>
tc_idx = <optimized out>
__x = <optimized out>
ignore1 = <optimized out>
ignore2 = <optimized out>
ignore3 = <optimized out>
ignore = <optimized out>
#2 0x000055d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0
<hlua_global_allocator>, ptr=<optimized out>, osize=4, nsize=41) at
src/hlua.c:8210
zone = 0x55d9498e90d0 <hlua_global_allocator>
#3 0x000055d949622271 in luaM_realloc_ ()
No symbol table info available.
#4 0x000055d949621dbe in luaC_newobj ()
No symbol table info available.
#5 0x000055d949626f4d in internshrstr ()
No symbol table info available.
#6 0x000055d9496271fc in luaS_new ()
No symbol table info available.
#7 0x000055d94961bd03 in lua_pushstring ()
No symbol table info available.
#8 0x000055d94962d48a in luaL_traceback ()
No symbol table info available.
#9 0x000055d9495a1b62 in ha_task_dump (buf=buf@entry=0x7f5247c7f628,
task=0x7f5234057ba0, pfx=pfx@entry=0x55d949672f19 ' ' <repeats 13 times>) at
src/debug.c:227
s = <optimized out>
appctx = <optimized out>
hlua = 0x7f52341bdc60
#10 0x000055d9495a1fd7 in ha_thread_dump (buf=0x7f5247c7f628, thr=<optimized
out>, calling_tid=7) at src/debug.c:91
thr_bit = 512
p = 74138721
n = <optimized out>
stuck = 0
#11 0x000055d9495a2236 in debug_handler (sig=<optimized out>, si=<optimized
out>, arg=<optimized out>) at src/debug.c:847
sig = <optimized out>
---Type <return> to continue, or q <return> to quit---
si = <optimized out>
arg = <optimized out>
#12 <signal handler called>
No locals.
#13 _int_malloc (av=av@entry=0x7f5234000020, bytes=bytes@entry=56) at
malloc.c:4100
p = <optimized out>
iters = <optimized out>
nb = 64
idx = 5
bin = <optimized out>
victim = 0x7f52341cbff0
size = <optimized out>
victim_index = <optimized out>
remainder = 0x7f52341cc030
remainder_size = <optimized out>
block = <optimized out>
bit = <optimized out>
map = <optimized out>
fwd = <optimized out>
bck = <optimized out>
tcache_unsorted_count = 0
tcache_nb = 64
tc_idx = 2
return_cached = 0
__PRETTY_FUNCTION__ = "_int_malloc"
#14 0x00007f52fb16c35d in __GI___libc_malloc (bytes=bytes@entry=56) at
malloc.c:3065
ar_ptr = 0x7f5234000020
victim = <optimized out>
hook = <optimized out>
tbytes = <optimized out>
tc_idx = <optimized out>
__x = <optimized out>
ignore1 = <optimized out>
ignore2 = <optimized out>
ignore3 = <optimized out>
ignore = <optimized out>
#15 0x000055d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0
<hlua_global_allocator>, ptr=<optimized out>, osize=5, nsize=56) at
src/hlua.c:8210
zone = 0x55d9498e90d0 <hlua_global_allocator>
#16 0x000055d949622271 in luaM_realloc_ ()
No symbol table info available.
#17 0x000055d949621dbe in luaC_newobj ()
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#18 0x000055d9496277f3 in luaH_new ()
No symbol table info available.
#19 0x000055d94961c221 in lua_createtable ()
No symbol table info available.
#20 0x000055d94949fab6 in hlua_txn_new (L=L@entry=0x7f52341cb638,
s=s@entry=0x7f52340576b0, p=p@entry=0x55d94f40bb90, dir=dir@entry=0,
flags=flags@entry=1) at src/hlua.c:5298
htxn = <optimized out>
#21 0x000055d9494ae723 in hlua_txn_new (flags=1, dir=0, p=0x55d94f40bb90,
s=0x7f52340576b0, L=0x7f52341cb638) at src/hlua.c:6406
htxn = <optimized out>
htxn = <optimized out>
#22 hlua_sample_fetch_wrapper (arg_p=0x55d94f4404b0, smp=0x7f5246c7be10,
kw=<optimized out>, private=0x55d94a6f91a0) at src/hlua.c:6436
fcn = <optimized out>
stream = 0x7f52340576b0
error = <optimized out>
#23 0x000055d9495449ed in sample_process (px=<optimized out>, sess=<optimized
out>, strm=strm@entry=0x7f52340576b0, opt=opt@entry=2, expr=0x55d94f440470,
p=0x7f5246c7be10, p@entry=0x0)
at src/sample.c:1063
conv_expr = <optimized out>
#24 0x000055d949583fe6 in spoe_encode_message (s=s@entry=0x7f52340576b0,
ctx=ctx@entry=0x7f5234057d40, msg=msg@entry=0x55d94f43d2c0, dir=dir@entry=0,
buf=buf@entry=0x7f5246c7a078,
end=end@entry=0x7f52341cb58c "") at src/flt_spoe.c:2200
smp = <optimized out>
arg = 0x55d94f440440
ret = <optimized out>
#25 0x000055d949587d51 in spoe_encode_messages (type=<optimized out>, dir=0,
messages=0x55d94f43d038, ctx=0x7f5234057d40, s=0x7f52340576b0) at
src/flt_spoe.c:2246
conf = <optimized out>
agent = <optimized out>
msg = 0x55d94f43d2c0
end = 0x7f52341cb58c ""
p = 0x7f52341c7b2e ""
conf = <optimized out>
agent = <optimized out>
msg = <optimized out>
p = <optimized out>
end = <optimized out>
#26 spoe_process_messages (s=s@entry=0x7f52340576b0,
ctx=ctx@entry=0x7f5234057d40, messages=0x55d94f43d038, dir=0,
type=type@entry=1) at src/flt_spoe.c:2687
conf = <optimized out>
agent = 0x55d94f43cf90
ret = 1
#27 0x000055d949589774 in spoe_process_event (s=0x7f52340576b0,
ctx=0x7f5234057d40, ev=ev@entry=SPOE_EV_ON_HTTP_REQ_FE) at src/flt_spoe.c:2803
conf = <optimized out>
agent = <optimized out>
dir = 0
---Type <return> to continue, or q <return> to quit---
ret = <optimized out>
#28 0x000055d949589994 in spoe_chn_pre_analyze (s=<optimized out>,
filter=<optimized out>, chn=0x7f52340576c0, an_bit=<optimized out>) at
src/flt_spoe.c:3287
ctx = <optimized out>
ret = 1
#29 0x000055d949590cdf in flt_pre_analyze (s=s@entry=0x7f52340576b0,
chn=chn@entry=0x7f52340576c0, an_bit=an_bit@entry=16) at src/filters.c:757
filter = 0x7f5234057cf0
ret = <optimized out>
#30 0x000055d94952fac2 in process_stream (t=<optimized out>,
context=0x7f52340576b0, state=<optimized out>) at src/stream.c:1781
max_loops = <optimized out>
ana_list = 409648
ana_back = 409648
flags = <optimized out>
srv = <optimized out>
s = 0x7f52340576b0
sess = <optimized out>
rqf_last = <optimized out>
rpf_last = 2147483648
rq_prod_last = <optimized out>
rq_cons_last = 0
rp_cons_last = 8
rp_prod_last = 0
req_ana_back = <optimized out>
req = 0x7f52340576c0
res = 0x7f5234057720
si_f = 0x7f52340579e8
si_b = 0x7f5234057a40
rate = <optimized out>
#31 0x000055d9495f3233 in run_tasks_from_lists
(budgets=budgets@entry=0x7f5246c7a35c) at src/task.c:483
process = <optimized out>
tl_queues = <optimized out>
t = 0x7f5234057ba0
budget_mask = 7 '\a'
done = <optimized out>
queue = <optimized out>
state = <optimized out>
ctx = <optimized out>
__ret = <optimized out>
__n = <optimized out>
__p = <optimized out>
#32 0x000055d9495f3c4d in process_runnable_tasks () at src/task.c:679
tt = 0x55d949a18c40 <task_per_thread+1152>
---Type <return> to continue, or q <return> to quit---
lrq = <optimized out>
grq = <optimized out>
t = <optimized out>
max = {0, 105, 36}
max_total = <optimized out>
tmp_list = <optimized out>
queue = 3
max_processed = <optimized out>
#33 0x000055d9495aa517 in run_poll_loop () at src/haproxy.c:2942
next = <optimized out>
wake = <optimized out>
#34 0x000055d9495aa8c9 in run_thread_poll_loop (data=<optimized out>) at
src/haproxy.c:3107
ptaf = <optimized out>
ptif = <optimized out>
ptdf = <optimized out>
ptff = <optimized out>
init_left = 0
init_mutex = pthread_mutex_t = {Type = Normal, Status = Not acquired,
Robust = No, Shared = No, Protocol = None}
init_cond = pthread_cond_t = {Threads known to still execute a wait
function = 0, Clock ID = CLOCK_REALTIME, Shared = No}
#35 0x00007f52fc2606db in start_thread (arg=0x7f5246c9d700) at
pthread_create.c:463
pd = 0x7f5246c9d700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139991351678720,
-824606276042670708, 139991351534848, 0, 9, 140728503817168,
732650204576824716, 732838492675980684}, mask_was_saved = 0}},
priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup =
0x0, canceltype = 0}}}
not_first_call = <optimized out>
#36 0x00007f52fb1f6a3f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.
(gdb) gcore
warning: target file /proc/15798/cmdline contained unexpected null characters
Saved corefile core.15798
(gdb) bt full
#0 0x00007f52fb1d8f37 in sched_yield () at
../sysdeps/unix/syscall-template.S:78
No locals.
#1 0x000055d9495a062e in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#2 ha_thread_dump_all_to_trash () at src/debug.c:815
old = <optimized out>
#3 0x000055d9495a068f in ha_panic () at src/debug.c:272
No locals.
#4 0x000055d94960cbe8 in wdt_handler (sig=14, si=<optimized out>,
arg=<optimized out>) at src/wdt.c:119
n = <optimized out>
p = <optimized out>
thr = 0
#5 <signal handler called>
No locals.
#6 0x00007f52fb1d8f37 in sched_yield () at
../sysdeps/unix/syscall-template.S:78
No locals.
#7 0x000055d9495a2275 in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#8 debug_handler (sig=<optimized out>, si=<optimized out>, arg=<optimized
out>) at src/debug.c:863
sig = <optimized out>
si = <optimized out>
arg = <optimized out>
#9 <signal handler called>
No locals.
#10 _int_malloc (av=av@entry=0x7f52fb4c0c40 <main_arena>,
bytes=bytes@entry=160) at malloc.c:4100
p = <optimized out>
iters = <optimized out>
nb = 176
idx = 12
bin = <optimized out>
victim = 0x55d94fd92f70
size = <optimized out>
victim_index = <optimized out>
remainder = 0x55d94fd93020
remainder_size = <optimized out>
block = <optimized out>
bit = <optimized out>
map = <optimized out>
fwd = <optimized out>
bck = <optimized out>
tcache_unsorted_count = 0
---Type <return> to continue, or q <return> to quit---
tcache_nb = 176
tc_idx = 9
return_cached = 0
__PRETTY_FUNCTION__ = "_int_malloc"
#11 0x00007f52fb16c35d in __GI___libc_malloc (bytes=bytes@entry=160) at
malloc.c:3065
ar_ptr = 0x7f52fb4c0c40 <main_arena>
victim = <optimized out>
hook = <optimized out>
tbytes = <optimized out>
tc_idx = <optimized out>
__x = <optimized out>
ignore1 = <optimized out>
ignore2 = <optimized out>
ignore3 = <optimized out>
ignore = <optimized out>
#12 0x000055d9495c85ca in pool_alloc_area (size=160) at
include/haproxy/pool-os.h:41
No locals.
#13 __pool_refill_alloc (pool=pool@entry=0x55d9499ee300 <pool_base_start+384>,
avail=22596, avail@entry=0) at src/pool.c:202
ptr = <optimized out>
free_list = <optimized out>
failed = 0
size = 160
limit = 0
allocated = 22596
allocated_orig = 22596
#14 0x000055d94957814c in pool_alloc_dirty (pool=0x55d9499ee300
<pool_base_start+384>) at include/haproxy/pool.h:305
p = 0x584000
p = <optimized out>
#15 pool_alloc (pool=0x55d9499ee300 <pool_base_start+384>) at
include/haproxy/pool.h:321
p = <optimized out>
p = <optimized out>
#16 __stksess_new (t=t@entry=0x55d94a7a2e20, key=0x0) at src/stick_table.c:267
ts = <optimized out>
#17 0x000055d9495781c5 in stksess_new (t=0x55d94a7a2e20, key=key@entry=0x0) at
src/stick_table.c:291
ts = <optimized out>
#18 0x000055d9495ad9b7 in peer_treat_updatemsg (p=p@entry=0x55d94a79a660,
updt=1, exp=1, msg_cur=msg_cur@entry=0x7ffde87ad1d0, msg_end=0x55d94a6e5911 "
to kill the process.\n", msg_len=17, totl=20,
appctx=<optimized out>, appctx=<optimized out>) at src/peers.c:1571
si = 0x55d94a757360
st = 0x55d94fb68850
ts = <optimized out>
newts = <optimized out>
---Type <return> to continue, or q <return> to quit---
update = <optimized out>
expire = 4003
data_type = <optimized out>
data_ptr = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__pl_r = <optimized out>
__pl_i = <optimized out>
ret = <optimized out>
__pl_i = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
#19 0x000055d9495b120a in peer_treat_awaited_msg (totl=20, msg_len=<optimized
out>, msg_end=<optimized out>, msg_cur=0x7ffde87ad1d0, msg_head=0x7ffde87ad231
"\n\205\021\200", peer=0x55d94a79a660,
---Type <return> to continue, or q <return> to quit---
appctx=0x55d94a7544f0) at src/peers.c:2142
update = <optimized out>
expire = <optimized out>
si = 0x55d94a757360
s = <optimized out>
peers = <optimized out>
si = <optimized out>
s = <optimized out>
peers = <optimized out>
st = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
st = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
update = <optimized out>
expire = <optimized out>
#20 peer_io_handler (appctx=0x55d94a7544f0) at src/peers.c:2658
msg_len = <optimized out>
msg_end = <optimized out>
msg_head = "\n\205\021\200\000I)"
msg_cur = 0x55d94a6e5908 " is about to kill the process.\n"
totl = 20
si = 0x55d94a757360
s = <optimized out>
---Type <return> to continue, or q <return> to quit---
curpeers = 0x55d94a7935d0
curpeer = 0x55d94a79a660
reql = <optimized out>
repl = <optimized out>
maj_ver = <optimized out>
min_ver = <optimized out>
prev_state = 7
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
__x = <optimized out>
__l = <optimized out>
#21 0x000055d9495ef3bf in task_run_applet (t=0x55d94a757570,
context=0x55d94a7544f0, state=<optimized out>) at src/applet.c:88
app = 0x55d94a7544f0
si = 0x55d94a757360
rate = <optimized out>
#22 0x000055d9495f321e in run_tasks_from_lists
(budgets=budgets@entry=0x7ffde87ad36c) at src/task.c:485
process = <optimized out>
tl_queues = <optimized out>
t = 0x55d94a757570
budget_mask = 7 '\a'
done = <optimized out>
queue = <optimized out>
state = <optimized out>
ctx = <optimized out>
__ret = <optimized out>
__n = <optimized out>
__p = <optimized out>
#23 0x000055d9495f3c4d in process_runnable_tasks () at src/task.c:679
tt = 0x55d949a187c0 <task_per_thread>
lrq = <optimized out>
grq = <optimized out>
t = <optimized out>
max = {0, 159, 0}
max_total = <optimized out>
tmp_list = <optimized out>
queue = 3
max_processed = <optimized out>
---Type <return> to continue, or q <return> to quit---
#24 0x000055d9495aa517 in run_poll_loop () at src/haproxy.c:2942
next = <optimized out>
wake = <optimized out>
#25 0x000055d9495aa8c9 in run_thread_poll_loop (data=<optimized out>) at
src/haproxy.c:3107
ptaf = <optimized out>
ptif = <optimized out>
ptdf = <optimized out>
ptff = <optimized out>
init_left = 0
init_mutex = pthread_mutex_t = {Type = Normal, Status = Not acquired,
Robust = No, Shared = No, Protocol = None}
init_cond = pthread_cond_t = {Threads known to still execute a wait
function = 0, Clock ID = CLOCK_REALTIME, Shared = No}
#26 0x000055d94947c990 in main (argc=<optimized out>, argv=0x7ffde87ad8a8) at
src/haproxy.c:3806
blocked_sig = {__val = {18446744067199990583, 18446744073709551615
<repeats 15 times>}}
old_sig = {__val = {0, 9271568505485978368, 139994407574496,
140728503817968, 8, 32, 12884901897, 7, 48, 94391744880664, 80,
18446744073709409488, 0, 206158430211, 0, 0}}
i = <optimized out>
err = <optimized out>
retry = <optimized out>
limit = {rlim_cur = 4000364, rlim_max = 6000000}
errmsg =
"\000Øzèý\177\000\000ÚgZIÙU\000\000\060Ô\214IÙU\000\000ðÖzèý\177\000\000\b\000\000\000\000\000\000\000ÏÕGIÙU\000\000°AnJÙU\000\000\000G\rûR\177\000\000\017",
'\000' <repeats 34 times>
pidfd = <optimized out>
(gdb)