Hi Christopher,

Bad news, patches didn't help. Attaching stacktraces, now it looks that
spoe that executes lua scripts (free_thread_spue_lua.txt) tried to malloc
twice. :(

Kind regards,

pon., 22 mar 2021 o 08:39 Maciej Zdeb <mac...@zdeb.pl> napisał(a):

> Hi Christopher,
>
> Thanks! I'm building a patched version and will return with feedback!
>
> Kind regards,
>
> pt., 19 mar 2021 o 16:40 Christopher Faulet <cfau...@haproxy.com>
> napisał(a):
>
>> Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
>> > Sorry for spam. In the last message I said that the old process (after
>> reload)
>> > is consuming cpu for lua processing and that's not true, it is
>> processing other
>> > things also.
>> >
>> > I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and
>> maybe 2.4
>> > branch. For each version I need a week or two to be sure the issue does
>> not
>> > occur. :(
>> >
>> > If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm
>> if there
>> > is any relation between infinite loops and custom configuration:
>> > - lua scripts (mainly used for header generation/manipulation),
>> > - spoe (used for sending metadata about each request to external
>> service),
>> > - peers (we have a cluster of 12 HAProxy servers connected to each
>> other).
>> >
>> Hi Maciej,
>>
>> I've read more carefully your backtraces, and indeed, it seems to be
>> related to
>> lua processing. I don't know if the watchdog is triggered because of the
>> lua or
>> if it is just a side-effect. But the lua execution is interrupted inside
>> the
>> memory allocator. And malloc/realloc are not async-signal-safe.
>> Unfortunately,
>> when the lua stack is dumped, the same allocator is also used. At this
>> stage,
>> because a lock was not properly released, HAProxy enter in a deadlock.
>>
>> On other threads, we loop in the watchdog, waiting for the hand to dump
>> the
>> thread information and that explains the 100% CPU usage you observe.
>>
>> So, to prevent this situation, the lua stack must not be dumped if it was
>> interrupted inside an unsafe part. It is the easiest way we found to
>> workaround
>> this bug. And because it is pretty rare, it should be good enough.
>>
>> However, I'm unable to reproduce the bug. Could you test attached patches
>> please
>> ? I attached patched for the 2.4, 2.3 and 2.2. Because you experienced
>> this bug
>> on the 2.2, it is probably easier to test patches for this version.
>>
>> If fixed, it could be good to figure out why the watchdog is triggered on
>> your
>> old processes.
>>
>> --
>> Christopher Faulet
>>
>
[Switching to thread 10 (Thread 0x7f5246c9d700 (LWP 15807))]
#0  __lll_lock_wait_private () at 
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95      ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or 
directory.
(gdb) bt full
#0  __lll_lock_wait_private () at 
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
No locals.
#1  0x00007f52fb16c34b in __GI___libc_malloc (bytes=bytes@entry=41) at 
malloc.c:3063
        ignore1 = <optimized out>
        ignore2 = <optimized out>
        ignore3 = <optimized out>
        ar_ptr = 0x7f5234000020
        victim = <optimized out>
        hook = <optimized out>
        tbytes = <optimized out>
        tc_idx = <optimized out>
        __x = <optimized out>
        ignore1 = <optimized out>
        ignore2 = <optimized out>
        ignore3 = <optimized out>
        ignore = <optimized out>
#2  0x000055d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0 
<hlua_global_allocator>, ptr=<optimized out>, osize=4, nsize=41) at 
src/hlua.c:8210
        zone = 0x55d9498e90d0 <hlua_global_allocator>
#3  0x000055d949622271 in luaM_realloc_ ()
No symbol table info available.
#4  0x000055d949621dbe in luaC_newobj ()
No symbol table info available.
#5  0x000055d949626f4d in internshrstr ()
No symbol table info available.
#6  0x000055d9496271fc in luaS_new ()
No symbol table info available.
#7  0x000055d94961bd03 in lua_pushstring ()
No symbol table info available.
#8  0x000055d94962d48a in luaL_traceback ()
No symbol table info available.
#9  0x000055d9495a1b62 in ha_task_dump (buf=buf@entry=0x7f5247c7f628, 
task=0x7f5234057ba0, pfx=pfx@entry=0x55d949672f19 ' ' <repeats 13 times>) at 
src/debug.c:227
        s = <optimized out>
        appctx = <optimized out>
        hlua = 0x7f52341bdc60
#10 0x000055d9495a1fd7 in ha_thread_dump (buf=0x7f5247c7f628, thr=<optimized 
out>, calling_tid=7) at src/debug.c:91
        thr_bit = 512
        p = 74138721
        n = <optimized out>
        stuck = 0
#11 0x000055d9495a2236 in debug_handler (sig=<optimized out>, si=<optimized 
out>, arg=<optimized out>) at src/debug.c:847
        sig = <optimized out>
---Type <return> to continue, or q <return> to quit---
        si = <optimized out>
        arg = <optimized out>
#12 <signal handler called>
No locals.
#13 _int_malloc (av=av@entry=0x7f5234000020, bytes=bytes@entry=56) at 
malloc.c:4100
        p = <optimized out>
        iters = <optimized out>
        nb = 64
        idx = 5
        bin = <optimized out>
        victim = 0x7f52341cbff0
        size = <optimized out>
        victim_index = <optimized out>
        remainder = 0x7f52341cc030
        remainder_size = <optimized out>
        block = <optimized out>
        bit = <optimized out>
        map = <optimized out>
        fwd = <optimized out>
        bck = <optimized out>
        tcache_unsorted_count = 0
        tcache_nb = 64
        tc_idx = 2
        return_cached = 0
        __PRETTY_FUNCTION__ = "_int_malloc"
#14 0x00007f52fb16c35d in __GI___libc_malloc (bytes=bytes@entry=56) at 
malloc.c:3065
        ar_ptr = 0x7f5234000020
        victim = <optimized out>
        hook = <optimized out>
        tbytes = <optimized out>
        tc_idx = <optimized out>
        __x = <optimized out>
        ignore1 = <optimized out>
        ignore2 = <optimized out>
        ignore3 = <optimized out>
        ignore = <optimized out>
#15 0x000055d94949e9a9 in hlua_alloc (ud=0x55d9498e90d0 
<hlua_global_allocator>, ptr=<optimized out>, osize=5, nsize=56) at 
src/hlua.c:8210
        zone = 0x55d9498e90d0 <hlua_global_allocator>
#16 0x000055d949622271 in luaM_realloc_ ()
No symbol table info available.
#17 0x000055d949621dbe in luaC_newobj ()
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#18 0x000055d9496277f3 in luaH_new ()
No symbol table info available.
#19 0x000055d94961c221 in lua_createtable ()
No symbol table info available.
#20 0x000055d94949fab6 in hlua_txn_new (L=L@entry=0x7f52341cb638, 
s=s@entry=0x7f52340576b0, p=p@entry=0x55d94f40bb90, dir=dir@entry=0, 
flags=flags@entry=1) at src/hlua.c:5298
        htxn = <optimized out>
#21 0x000055d9494ae723 in hlua_txn_new (flags=1, dir=0, p=0x55d94f40bb90, 
s=0x7f52340576b0, L=0x7f52341cb638) at src/hlua.c:6406
        htxn = <optimized out>
        htxn = <optimized out>
#22 hlua_sample_fetch_wrapper (arg_p=0x55d94f4404b0, smp=0x7f5246c7be10, 
kw=<optimized out>, private=0x55d94a6f91a0) at src/hlua.c:6436
        fcn = <optimized out>
        stream = 0x7f52340576b0
        error = <optimized out>
#23 0x000055d9495449ed in sample_process (px=<optimized out>, sess=<optimized 
out>, strm=strm@entry=0x7f52340576b0, opt=opt@entry=2, expr=0x55d94f440470, 
p=0x7f5246c7be10, p@entry=0x0)
    at src/sample.c:1063
        conv_expr = <optimized out>
#24 0x000055d949583fe6 in spoe_encode_message (s=s@entry=0x7f52340576b0, 
ctx=ctx@entry=0x7f5234057d40, msg=msg@entry=0x55d94f43d2c0, dir=dir@entry=0, 
buf=buf@entry=0x7f5246c7a078,
    end=end@entry=0x7f52341cb58c "") at src/flt_spoe.c:2200
        smp = <optimized out>
        arg = 0x55d94f440440
        ret = <optimized out>
#25 0x000055d949587d51 in spoe_encode_messages (type=<optimized out>, dir=0, 
messages=0x55d94f43d038, ctx=0x7f5234057d40, s=0x7f52340576b0) at 
src/flt_spoe.c:2246
        conf = <optimized out>
        agent = <optimized out>
        msg = 0x55d94f43d2c0
        end = 0x7f52341cb58c ""
        p = 0x7f52341c7b2e ""
        conf = <optimized out>
        agent = <optimized out>
        msg = <optimized out>
        p = <optimized out>
        end = <optimized out>
#26 spoe_process_messages (s=s@entry=0x7f52340576b0, 
ctx=ctx@entry=0x7f5234057d40, messages=0x55d94f43d038, dir=0, 
type=type@entry=1) at src/flt_spoe.c:2687
        conf = <optimized out>
        agent = 0x55d94f43cf90
        ret = 1
#27 0x000055d949589774 in spoe_process_event (s=0x7f52340576b0, 
ctx=0x7f5234057d40, ev=ev@entry=SPOE_EV_ON_HTTP_REQ_FE) at src/flt_spoe.c:2803
        conf = <optimized out>
        agent = <optimized out>
        dir = 0
---Type <return> to continue, or q <return> to quit---
        ret = <optimized out>
#28 0x000055d949589994 in spoe_chn_pre_analyze (s=<optimized out>, 
filter=<optimized out>, chn=0x7f52340576c0, an_bit=<optimized out>) at 
src/flt_spoe.c:3287
        ctx = <optimized out>
        ret = 1
#29 0x000055d949590cdf in flt_pre_analyze (s=s@entry=0x7f52340576b0, 
chn=chn@entry=0x7f52340576c0, an_bit=an_bit@entry=16) at src/filters.c:757
        filter = 0x7f5234057cf0
        ret = <optimized out>
#30 0x000055d94952fac2 in process_stream (t=<optimized out>, 
context=0x7f52340576b0, state=<optimized out>) at src/stream.c:1781
        max_loops = <optimized out>
        ana_list = 409648
        ana_back = 409648
        flags = <optimized out>
        srv = <optimized out>
        s = 0x7f52340576b0
        sess = <optimized out>
        rqf_last = <optimized out>
        rpf_last = 2147483648
        rq_prod_last = <optimized out>
        rq_cons_last = 0
        rp_cons_last = 8
        rp_prod_last = 0
        req_ana_back = <optimized out>
        req = 0x7f52340576c0
        res = 0x7f5234057720
        si_f = 0x7f52340579e8
        si_b = 0x7f5234057a40
        rate = <optimized out>
#31 0x000055d9495f3233 in run_tasks_from_lists 
(budgets=budgets@entry=0x7f5246c7a35c) at src/task.c:483
        process = <optimized out>
        tl_queues = <optimized out>
        t = 0x7f5234057ba0
        budget_mask = 7 '\a'
        done = <optimized out>
        queue = <optimized out>
        state = <optimized out>
        ctx = <optimized out>
        __ret = <optimized out>
        __n = <optimized out>
        __p = <optimized out>
#32 0x000055d9495f3c4d in process_runnable_tasks () at src/task.c:679
        tt = 0x55d949a18c40 <task_per_thread+1152>
---Type <return> to continue, or q <return> to quit---
        lrq = <optimized out>
        grq = <optimized out>
        t = <optimized out>
        max = {0, 105, 36}
        max_total = <optimized out>
        tmp_list = <optimized out>
        queue = 3
        max_processed = <optimized out>
#33 0x000055d9495aa517 in run_poll_loop () at src/haproxy.c:2942
        next = <optimized out>
        wake = <optimized out>
#34 0x000055d9495aa8c9 in run_thread_poll_loop (data=<optimized out>) at 
src/haproxy.c:3107
        ptaf = <optimized out>
        ptif = <optimized out>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = pthread_mutex_t = {Type = Normal, Status = Not acquired, 
Robust = No, Shared = No, Protocol = None}
        init_cond = pthread_cond_t = {Threads known to still execute a wait 
function = 0, Clock ID = CLOCK_REALTIME, Shared = No}
#35 0x00007f52fc2606db in start_thread (arg=0x7f5246c9d700) at 
pthread_create.c:463
        pd = 0x7f5246c9d700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139991351678720, 
-824606276042670708, 139991351534848, 0, 9, 140728503817168, 
732650204576824716, 732838492675980684}, mask_was_saved = 0}},
          priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 
0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#36 0x00007f52fb1f6a3f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.
(gdb) gcore
warning: target file /proc/15798/cmdline contained unexpected null characters
Saved corefile core.15798
(gdb) bt full
#0  0x00007f52fb1d8f37 in sched_yield () at 
../sysdeps/unix/syscall-template.S:78
No locals.
#1  0x000055d9495a062e in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#2  ha_thread_dump_all_to_trash () at src/debug.c:815
        old = <optimized out>
#3  0x000055d9495a068f in ha_panic () at src/debug.c:272
No locals.
#4  0x000055d94960cbe8 in wdt_handler (sig=14, si=<optimized out>, 
arg=<optimized out>) at src/wdt.c:119
        n = <optimized out>
        p = <optimized out>
        thr = 0
#5  <signal handler called>
No locals.
#6  0x00007f52fb1d8f37 in sched_yield () at 
../sysdeps/unix/syscall-template.S:78
No locals.
#7  0x000055d9495a2275 in ha_thread_relax () at include/haproxy/thread.h:233
No locals.
#8  debug_handler (sig=<optimized out>, si=<optimized out>, arg=<optimized 
out>) at src/debug.c:863
        sig = <optimized out>
        si = <optimized out>
        arg = <optimized out>
#9  <signal handler called>
No locals.
#10 _int_malloc (av=av@entry=0x7f52fb4c0c40 <main_arena>, 
bytes=bytes@entry=160) at malloc.c:4100
        p = <optimized out>
        iters = <optimized out>
        nb = 176
        idx = 12
        bin = <optimized out>
        victim = 0x55d94fd92f70
        size = <optimized out>
        victim_index = <optimized out>
        remainder = 0x55d94fd93020
        remainder_size = <optimized out>
        block = <optimized out>
        bit = <optimized out>
        map = <optimized out>
        fwd = <optimized out>
        bck = <optimized out>
        tcache_unsorted_count = 0
---Type <return> to continue, or q <return> to quit---
        tcache_nb = 176
        tc_idx = 9
        return_cached = 0
        __PRETTY_FUNCTION__ = "_int_malloc"
#11 0x00007f52fb16c35d in __GI___libc_malloc (bytes=bytes@entry=160) at 
malloc.c:3065
        ar_ptr = 0x7f52fb4c0c40 <main_arena>
        victim = <optimized out>
        hook = <optimized out>
        tbytes = <optimized out>
        tc_idx = <optimized out>
        __x = <optimized out>
        ignore1 = <optimized out>
        ignore2 = <optimized out>
        ignore3 = <optimized out>
        ignore = <optimized out>
#12 0x000055d9495c85ca in pool_alloc_area (size=160) at 
include/haproxy/pool-os.h:41
No locals.
#13 __pool_refill_alloc (pool=pool@entry=0x55d9499ee300 <pool_base_start+384>, 
avail=22596, avail@entry=0) at src/pool.c:202
        ptr = <optimized out>
        free_list = <optimized out>
        failed = 0
        size = 160
        limit = 0
        allocated = 22596
        allocated_orig = 22596
#14 0x000055d94957814c in pool_alloc_dirty (pool=0x55d9499ee300 
<pool_base_start+384>) at include/haproxy/pool.h:305
        p = 0x584000
        p = <optimized out>
#15 pool_alloc (pool=0x55d9499ee300 <pool_base_start+384>) at 
include/haproxy/pool.h:321
        p = <optimized out>
        p = <optimized out>
#16 __stksess_new (t=t@entry=0x55d94a7a2e20, key=0x0) at src/stick_table.c:267
        ts = <optimized out>
#17 0x000055d9495781c5 in stksess_new (t=0x55d94a7a2e20, key=key@entry=0x0) at 
src/stick_table.c:291
        ts = <optimized out>
#18 0x000055d9495ad9b7 in peer_treat_updatemsg (p=p@entry=0x55d94a79a660, 
updt=1, exp=1, msg_cur=msg_cur@entry=0x7ffde87ad1d0, msg_end=0x55d94a6e5911 " 
to kill the process.\n", msg_len=17, totl=20,
    appctx=<optimized out>, appctx=<optimized out>) at src/peers.c:1571
        si = 0x55d94a757360
        st = 0x55d94fb68850
        ts = <optimized out>
        newts = <optimized out>
---Type <return> to continue, or q <return> to quit---
        update = <optimized out>
        expire = 4003
        data_type = <optimized out>
        data_ptr = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __pl_r = <optimized out>
        __pl_i = <optimized out>
        ret = <optimized out>
        __pl_i = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
#19 0x000055d9495b120a in peer_treat_awaited_msg (totl=20, msg_len=<optimized 
out>, msg_end=<optimized out>, msg_cur=0x7ffde87ad1d0, msg_head=0x7ffde87ad231 
"\n\205\021\200", peer=0x55d94a79a660,
---Type <return> to continue, or q <return> to quit---
    appctx=0x55d94a7544f0) at src/peers.c:2142
        update = <optimized out>
        expire = <optimized out>
        si = 0x55d94a757360
        s = <optimized out>
        peers = <optimized out>
        si = <optimized out>
        s = <optimized out>
        peers = <optimized out>
        st = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        st = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        update = <optimized out>
        expire = <optimized out>
#20 peer_io_handler (appctx=0x55d94a7544f0) at src/peers.c:2658
        msg_len = <optimized out>
        msg_end = <optimized out>
        msg_head = "\n\205\021\200\000I)"
        msg_cur = 0x55d94a6e5908 " is about to kill the process.\n"
        totl = 20
        si = 0x55d94a757360
        s = <optimized out>
---Type <return> to continue, or q <return> to quit---
        curpeers = 0x55d94a7935d0
        curpeer = 0x55d94a79a660
        reql = <optimized out>
        repl = <optimized out>
        maj_ver = <optimized out>
        min_ver = <optimized out>
        prev_state = 7
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
        __x = <optimized out>
        __l = <optimized out>
#21 0x000055d9495ef3bf in task_run_applet (t=0x55d94a757570, 
context=0x55d94a7544f0, state=<optimized out>) at src/applet.c:88
        app = 0x55d94a7544f0
        si = 0x55d94a757360
        rate = <optimized out>
#22 0x000055d9495f321e in run_tasks_from_lists 
(budgets=budgets@entry=0x7ffde87ad36c) at src/task.c:485
        process = <optimized out>
        tl_queues = <optimized out>
        t = 0x55d94a757570
        budget_mask = 7 '\a'
        done = <optimized out>
        queue = <optimized out>
        state = <optimized out>
        ctx = <optimized out>
        __ret = <optimized out>
        __n = <optimized out>
        __p = <optimized out>
#23 0x000055d9495f3c4d in process_runnable_tasks () at src/task.c:679
        tt = 0x55d949a187c0 <task_per_thread>
        lrq = <optimized out>
        grq = <optimized out>
        t = <optimized out>
        max = {0, 159, 0}
        max_total = <optimized out>
        tmp_list = <optimized out>
        queue = 3
        max_processed = <optimized out>
---Type <return> to continue, or q <return> to quit---
#24 0x000055d9495aa517 in run_poll_loop () at src/haproxy.c:2942
        next = <optimized out>
        wake = <optimized out>
#25 0x000055d9495aa8c9 in run_thread_poll_loop (data=<optimized out>) at 
src/haproxy.c:3107
        ptaf = <optimized out>
        ptif = <optimized out>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = pthread_mutex_t = {Type = Normal, Status = Not acquired, 
Robust = No, Shared = No, Protocol = None}
        init_cond = pthread_cond_t = {Threads known to still execute a wait 
function = 0, Clock ID = CLOCK_REALTIME, Shared = No}
#26 0x000055d94947c990 in main (argc=<optimized out>, argv=0x7ffde87ad8a8) at 
src/haproxy.c:3806
        blocked_sig = {__val = {18446744067199990583, 18446744073709551615 
<repeats 15 times>}}
        old_sig = {__val = {0, 9271568505485978368, 139994407574496, 
140728503817968, 8, 32, 12884901897, 7, 48, 94391744880664, 80, 
18446744073709409488, 0, 206158430211, 0, 0}}
        i = <optimized out>
        err = <optimized out>
        retry = <optimized out>
        limit = {rlim_cur = 4000364, rlim_max = 6000000}
        errmsg = 
"\000Øzèý\177\000\000ÚgZIÙU\000\000\060Ô\214IÙU\000\000ðÖzèý\177\000\000\b\000\000\000\000\000\000\000ÏÕGIÙU\000\000°AnJÙU\000\000\000G\rûR\177\000\000\017",
 '\000' <repeats 34 times>
        pidfd = <optimized out>
(gdb)

Reply via email to