> On 8 May 2018, at 00:33, PiBa-NL <[email protected]> wrote: > > Hi List, Thierry, > > Actually this is not limited to restarts, and also happens with 1.9dev. It > now happens while haproxy was running for a while and no restart was > attempted while running/debugging in my NetBeans IDE.. > > Root cause imo is that hlua_socket_receive_yield and hlua_socket_release both > try and acquire the same lock. > > > For debugging purposes ive added some code in hlua_socket_receive_yield(..) > before the stream_int_notify: > > struct channel *ic2 = si_ic(si); > struct channel *oc2 = si_oc(si); > ha_warning("hlua_socket_receive_yield calling notify peer:%9x > si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state, > oc2->flags, ic2->flags); > stream_int_notify(&s->si[0]); > > And: > static void hlua_socket_release(struct appctx *appctx) > { > struct xref *peer; > if (appctx->ctx.hlua_cosocket.xref.peer > 1) > ha_warning("hlua_socket_release peer: %9x %9x\n", > appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer); > else > ha_warning("hlua_socket_release peer: %9x 0\n", > appctx->ctx.hlua_cosocket.xref); > > > And also added code in xref_get_peer_and_lock(..): > static inline struct xref *xref_get_peer_and_lock(struct xref *xref) > { > if (xref->peer == 1) { > printf(" xref_get_peer_and_lock xref->peer == 1 \n"); > } > > > This produces the logging: > > [WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify peer: > 2355590 si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024 > [WARNING] 127/001127 (36579) : hlua_socket_release peer: 1 0 > xref_get_peer_and_lock xref->peer == 1 > > When xref_get_peer_and_lock is called with a parameter xref->peer value of 1 > then it looks like it keeps swapping 1 and 1 until it is not 1, that never > happens.. > > As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still not > 100% when that flag is exactly set to get a foolproof reproduction.. but it > happens on pretty much a daily basis for me in production and in test i can > now usually trigger it after a few testruns with no actual traffic passing > along within the first minute of running (healthchecks are performed on > several backend, and a mail or 2 is send by the lua code during this startup > period..).. with the full production config.. > > Below the stacktrace that comes with it.. > > xref_get_peer_and_lock (xref=0x802355590) at > P:\Git\haproxy\include\common\xref.h:37 > hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595 > si_applet_release (si=0x8023514c8) at > P:\Git\haproxy\include\proto\stream_interface.h:233 > stream_int_shutw_applet (si=0x8023514c8) at > P:\Git\haproxy\src\stream_interface.c:1504 > si_shutw (si=0x8023514c8) at > P:\Git\haproxy\include\proto\stream_interface.h:320 > stream_int_notify (si=0x8023514c8) at > P:\Git\haproxy\src\stream_interface.c:465 > hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at > P:\Git\haproxy\src\hlua.c:1789 > ?? () at null: > ?? () at null: > lua_resume () at null: > hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at > P:\Git\haproxy\src\hlua.c:1022 > hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556 > process_runnable_tasks () at P:\Git\haproxy\src\task.c:232 > run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401 > run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463 > main (argc=4, argv=0x7fffffffea80) at P:\Git\haproxy\src\haproxy.c:3053 > > I don't yet have a any idea about the direction of a possible fix.. :(.. > Issue is that probably the hlua_socket_release should happen, but it doesnt > know what socket / peer it should release at that point.. its in the local > peer variable of the hlua_socket_receive_yield funtion.. Should it be > 'unlocked' before calling stream_int_notify?
This part of the code is tricky. Just a precision: - I write the “socket” session an haproxy classic session which is created by Lua code for performing network I/O. - I write “Lua” session for a classic HAProxy session which executes “Lua”, and this “Lua” code communicates with the “socket” session. This lock have two goal: maintain the link between the Lua session and the “socket” session in a coherent state and ensure the access of the “socket” buffer from the Lua session. When the lock is acquired, the “socket” session can’t be cleaned. I resume: - “Lua” session is executed - hlua_socket_receive_yield() acquire a lock - hlua_socket_receive_yield() call stream_int_notify() NOTE: the function hlua_socket_receive_yield() can’t close the “socket” session - stream_int_notify() is executed from “Lua” session. - stream_int_notify() try to close the “socket” session - stream_int_notify() try to acquire the lock which is already acquired by the caller function hlua_socket_receive_yield(). It seems a dead lock, but you observe a loop. The end of the function “hlua_socket_receive_yield()” seems suspect. You’re right: the lock can be released before the stream_int_notify(). Unfortunately, the stream_int_notify() is applied to the variable “s” which is a “socket” session. And this session can be cleared between the unlock and the stream_int_notify(). A first idea is moving the stream_int_notify() from direct function to a task. It is not a simple job, and this need more brainstorm. BR, Thierry > Does anyone dare to take a stab at a creating a patch ? If so thanks in > advance ;) > > Regards, > PiBa-NL (Pieter) > > > Op 3-5-2018 om 1:30 schreef PiBa-NL: >> Hi List, >> >> Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf <pid> >> parameter) one of the processes seems to get into a 'hanging' state >> consuming 100% cpu.. >> >> In this configuration i'm using 'nbthread 1' not sure if this is related to >> the corrupted task-tree from my other lua issue.?. >> https://www.mail-archive.com/[email protected]/msg29801.html .?. >> >> Also i'm using my new smtpmailqueue and serverhealthchecker lua scripts (can >> be found on github.).. these probably 'contribute' to triggering the >> condition. >> >> Anything i can check / provide.? >> >> (cant really share the config itself a.t.m. as its from our production env, >> but it has like 15 backends with 1 server each, a little header >> rewriting/insertion but nothing big..) >> >> GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD] >> Copyright (C) 2017 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later >> <http://gnu.org/licenses/gpl.html> >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-portbld-freebsd11.1". >> Type "show configuration" for configuration details. >> For bug reporting instructions, please see: >> <http://www.gnu.org/software/gdb/bugs/>. >> Find the GDB manual and other documentation resources online at: >> <http://www.gnu.org/software/gdb/documentation/>. >> For help, type "help". >> Type "apropos word" to search for commands related to "word"... >> Reading symbols from /usr/local/sbin/haproxy...done. >> Attaching to program: /usr/local/sbin/haproxy, process 68580 >> Reading symbols from /lib/libcrypt.so.5...(no debugging symbols >> found)...done. >> Reading symbols from /lib/libz.so.6...(no debugging symbols found)...done. >> Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done. >> Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols >> found)...done. >> Reading symbols from /lib/libcrypto.so.8...(no debugging symbols >> found)...done. >> Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging symbols >> found)...done. >> Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done. >> Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done. >> Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols >> found)...done. >> [Switching to LWP 100340 of process 68580] >> 0x000000000044890b in xref_get_peer_and_lock (xref=0x80254b680) at >> include/common/xref.h:34 >> 34 include/common/xref.h: No such file or directory. >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) next >> 45 in include/common/xref.h >> (gdb) next >> 46 in include/common/xref.h >> (gdb) next >> 34 in include/common/xref.h >> (gdb) next >> 37 in include/common/xref.h >> (gdb) bt full >> #0 xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:37 >> local = 0x1 >> remote = 0x7fffffffe210 >> #1 0x000000000044833d in hlua_socket_release (appctx=0x80254b600) at >> src/hlua.c:1583 >> peer = 0x0 >> #2 0x0000000000546519 in si_applet_release (si=0x802551dc8) at >> include/proto/stream_interface.h:233 >> appctx = 0x80254b600 >> #3 0x0000000000541a8d in stream_int_shutw_applet (si=0x802551dc8) at >> src/stream_interface.c:1502 >> ic = 0x802551b90 >> oc = 0x802551bd0 >> #4 0x0000000000543809 in si_shutw (si=0x802551dc8) at >> include/proto/stream_interface.h:320 >> No locals. >> #5 0x0000000000543320 in stream_int_notify (si=0x802551dc8) at >> src/stream_interface.c:465 >> conn = 0x0 >> ic = 0x802551b90 >> oc = 0x802551bd0 >> #6 0x000000000044dc81 in hlua_socket_receive_yield (L=0x80243f2a8, >> status=1, ctx=0) at src/hlua.c:1770 >> socket = 0x8026a0028 >> wanted = -1 >> hlua = 0x8024d6780 >> appctx = 0x80254b600 >> len = 25 >> nblk = 1 >> blk1 = 0x8027da4d4 "501 5.1.3 Invalid address\r\nclass=\"tbl\" >> width=\"100%\">\n<tr class=\"titre\"><th class=\"pxname\" width=\"10%\"><a >> name=\"SMTP-Mailroute_XXXX-25_EX01-125_tcp_ipv4\"></a><a class=px >> href=\"#SMTP-Mailroute_XXXX-25_E"... >> len1 = 25 >> blk2 = 0x100000005 <error: Cannot access memory at address >> 0x100000005> >> len2 = 3 >> skip_at_end = 2 >> oc = 0x802551bd0 >> si = 0x802551dc8 >> s = 0x802551b80 >> peer = 0x80254b680 >> #7 0x0000000801880417 in ?? () from /usr/local/lib/liblua-5.3.so >> ---Type <return> to continue, or q <return> to quit--- >> No symbol table info available. >> #8 0x000000080187f1a6 in ?? () from /usr/local/lib/liblua-5.3.so >> No symbol table info available. >> #9 0x0000000801880194 in lua_resume () from /usr/local/lib/liblua-5.3.so >> No symbol table info available. >> #10 0x00000000004370a9 in hlua_ctx_resume (lua=0x8024d6780, yield_allowed=1) >> at src/hlua.c:1014 >> ret = 39132616 >> msg = 0x802647f00 "" >> #11 0x00000000004406c7 in hlua_process_task (task=0x80242d460) at >> src/hlua.c:5523 >> hlua = 0x8024d6780 >> status = HLUA_E_OK >> #12 0x0000000000591177 in process_runnable_tasks () at src/task.c:231 >> t = 0x80242d460 >> i = 32767 >> max_processed = 199 >> rq_next = 0x0 >> local_tasks = {0x0, 0x8add10 <applet_active_lock>, 0x0, 0x80254b600, >> 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x100000000, 0x0, 0x0, >> 0x0, >> 0x8add10 <applet_active_lock>, 0x0, 0x0, 0x7fffffffe758} >> local_tasks_count = 0 >> final_tasks_count = 0 >> #13 0x0000000000510a5b in run_poll_loop () at src/haproxy.c:2399 >> next = 588656154 >> exp = 588656154 >> #14 0x000000000050e370 in run_thread_poll_loop (data=0x802446068) at >> src/haproxy.c:2461 >> start_lock = 0 >> ptif = 0x8a0908 <per_thread_init_list> >> ptdf = 0x202 >> #15 0x000000000050aef2 in main (argc=10, argv=0x7fffffffec58) at >> src/haproxy.c:3050 >> tids = 0x802446068 >> threads = 0x8024efd60 >> i = 1 >> err = 0 >> retry = 200 >> limit = {rlim_cur = 2285, rlim_max = 2285} >> ---Type <return> to continue, or q <return> to quit--- >> errmsg = >> "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000" >> pidfd = 26 >> (gdb) info locals >> tids = 0x802446068 >> threads = 0x8024efd60 >> i = 1 >> err = 0 >> retry = 200 >> limit = {rlim_cur = 2285, rlim_max = 2285} >> errmsg = >> "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000" >> pidfd = 26 >> >

