> On 8 May 2018, at 00:33, PiBa-NL <[email protected]> wrote:
> 
> Hi List, Thierry,
> 
> Actually this is not limited to restarts, and also happens with 1.9dev. It 
> now happens while haproxy was running for a while and no restart was 
> attempted while running/debugging in my NetBeans IDE..
> 
> Root cause imo is that hlua_socket_receive_yield and hlua_socket_release both 
> try and acquire the same lock.
> 
> 
> For debugging purposes ive added some code in hlua_socket_receive_yield(..) 
> before the stream_int_notify:
> 
>     struct channel *ic2 = si_ic(si);
>     struct channel *oc2 = si_oc(si);
>     ha_warning("hlua_socket_receive_yield calling notify peer:%9x  
> si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state, 
> oc2->flags, ic2->flags);
>     stream_int_notify(&s->si[0]);
> 
> And:
> static void hlua_socket_release(struct appctx *appctx)
> {
>     struct xref *peer;
>     if (appctx->ctx.hlua_cosocket.xref.peer > 1)
>         ha_warning("hlua_socket_release peer: %9x %9x\n", 
> appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);
>     else
>         ha_warning("hlua_socket_release peer: %9x 0\n", 
> appctx->ctx.hlua_cosocket.xref);
> 
> 
> And also added code in xref_get_peer_and_lock(..):
> static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
> {
>     if (xref->peer == 1) {
>         printf("  xref_get_peer_and_lock xref->peer == 1 \n");
>     }
> 
> 
> This produces the logging:
> 
> [WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify peer: 
>  2355590  si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024
> [WARNING] 127/001127 (36579) : hlua_socket_release peer:         1 0
>   xref_get_peer_and_lock xref->peer == 1 
> 
> When xref_get_peer_and_lock is called with a parameter xref->peer value of 1 
> then it looks like it keeps swapping 1 and 1 until it is not 1, that never 
> happens..
> 
> As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still not 
> 100% when that flag is exactly set to get a foolproof reproduction.. but it 
> happens on pretty much a daily basis for me in production and in test i can 
> now usually trigger it after a few testruns with no actual traffic passing 
> along within the first minute of running (healthchecks are performed on 
> several backend, and a mail or 2 is send by the lua code during this startup 
> period..).. with the full production config..
> 
> Below the stacktrace that comes with it..
> 
> xref_get_peer_and_lock (xref=0x802355590) at 
> P:\Git\haproxy\include\common\xref.h:37
> hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
> si_applet_release (si=0x8023514c8) at 
> P:\Git\haproxy\include\proto\stream_interface.h:233
> stream_int_shutw_applet (si=0x8023514c8) at 
> P:\Git\haproxy\src\stream_interface.c:1504
> si_shutw (si=0x8023514c8) at 
> P:\Git\haproxy\include\proto\stream_interface.h:320
> stream_int_notify (si=0x8023514c8) at 
> P:\Git\haproxy\src\stream_interface.c:465
> hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at 
> P:\Git\haproxy\src\hlua.c:1789
> ?? () at null:
> ?? () at null:
> lua_resume () at null:
> hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at 
> P:\Git\haproxy\src\hlua.c:1022
> hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
> process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
> run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
> run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
> main (argc=4, argv=0x7fffffffea80) at P:\Git\haproxy\src\haproxy.c:3053
> 
> I don't yet have a any idea about the direction of a possible fix.. :(..
> Issue is that probably the hlua_socket_release should happen, but it doesnt 
> know what socket / peer it should release at that point.. its in the local 
> peer variable of the hlua_socket_receive_yield funtion.. Should it be 
> 'unlocked' before calling stream_int_notify?


This part of the code is tricky.

Just a precision:

 - I write the “socket” session an haproxy classic session which is created by
   Lua code for performing network I/O.

 - I write “Lua” session for a classic HAProxy session which executes “Lua”, and
   this “Lua” code communicates with the “socket” session.

This lock have two goal: maintain the link between the Lua session and the 
“socket”
session in a coherent state and ensure the access of the “socket” buffer from 
the
Lua session. When the lock is acquired, the “socket” session can’t be cleaned.

I resume:

 - “Lua” session is executed

 - hlua_socket_receive_yield() acquire a lock

 - hlua_socket_receive_yield() call stream_int_notify()
   NOTE: the function hlua_socket_receive_yield() can’t close the “socket” 
session

 - stream_int_notify() is executed from “Lua” session.

 - stream_int_notify() try to close the “socket” session

 - stream_int_notify() try to acquire the lock which is already acquired by the
   caller function hlua_socket_receive_yield().

It seems a dead lock, but you observe a loop.

The end of the function “hlua_socket_receive_yield()” seems suspect. You’re 
right:
the lock can be released before the stream_int_notify(). Unfortunately, the
stream_int_notify() is applied to the variable “s” which is a “socket” session. 
And
this session can be cleared between the unlock and the stream_int_notify().

A first idea is moving the stream_int_notify() from direct function to a task. 
It is
not a simple job, and this need more brainstorm.

BR,
Thierry



> Does anyone dare to take a stab at a creating a patch ? If so thanks in 
> advance ;)
> 
> Regards,
> PiBa-NL (Pieter)
> 
> 
> Op 3-5-2018 om 1:30 schreef PiBa-NL:
>> Hi List, 
>> 
>> Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf <pid> 
>> parameter) one of the processes seems to get into a 'hanging' state 
>> consuming 100% cpu.. 
>> 
>> In this configuration i'm using 'nbthread 1' not sure if this is related to 
>> the corrupted task-tree from my other lua issue.?. 
>> https://www.mail-archive.com/[email protected]/msg29801.html .?. 
>> 
>> Also i'm using my new smtpmailqueue and serverhealthchecker lua scripts (can 
>> be found on github.).. these probably 'contribute' to triggering the 
>> condition. 
>> 
>> Anything i can check / provide.? 
>> 
>> (cant really share the config itself a.t.m. as its from our production env, 
>> but it has like 15 backends with 1 server each, a little header 
>> rewriting/insertion but nothing big..) 
>> 
>> GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD] 
>> Copyright (C) 2017 Free Software Foundation, Inc. 
>> License GPLv3+: GNU GPL version 3 or later 
>> <http://gnu.org/licenses/gpl.html> 
>> This is free software: you are free to change and redistribute it. 
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying" 
>> and "show warranty" for details. 
>> This GDB was configured as "x86_64-portbld-freebsd11.1". 
>> Type "show configuration" for configuration details. 
>> For bug reporting instructions, please see: 
>> <http://www.gnu.org/software/gdb/bugs/>. 
>> Find the GDB manual and other documentation resources online at: 
>> <http://www.gnu.org/software/gdb/documentation/>. 
>> For help, type "help". 
>> Type "apropos word" to search for commands related to "word"... 
>> Reading symbols from /usr/local/sbin/haproxy...done. 
>> Attaching to program: /usr/local/sbin/haproxy, process 68580 
>> Reading symbols from /lib/libcrypt.so.5...(no debugging symbols 
>> found)...done. 
>> Reading symbols from /lib/libz.so.6...(no debugging symbols found)...done. 
>> Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done. 
>> Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols 
>> found)...done. 
>> Reading symbols from /lib/libcrypto.so.8...(no debugging symbols 
>> found)...done. 
>> Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging symbols 
>> found)...done. 
>> Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done. 
>> Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done. 
>> Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols 
>> found)...done. 
>> [Switching to LWP 100340 of process 68580] 
>> 0x000000000044890b in xref_get_peer_and_lock (xref=0x80254b680) at 
>> include/common/xref.h:34 
>> 34      include/common/xref.h: No such file or directory. 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) next 
>> 45      in include/common/xref.h 
>> (gdb) next 
>> 46      in include/common/xref.h 
>> (gdb) next 
>> 34      in include/common/xref.h 
>> (gdb) next 
>> 37      in include/common/xref.h 
>> (gdb) bt full 
>> #0  xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:37 
>>         local = 0x1 
>>         remote = 0x7fffffffe210 
>> #1  0x000000000044833d in hlua_socket_release (appctx=0x80254b600) at 
>> src/hlua.c:1583 
>>         peer = 0x0 
>> #2  0x0000000000546519 in si_applet_release (si=0x802551dc8) at 
>> include/proto/stream_interface.h:233 
>>         appctx = 0x80254b600 
>> #3  0x0000000000541a8d in stream_int_shutw_applet (si=0x802551dc8) at 
>> src/stream_interface.c:1502 
>>         ic = 0x802551b90 
>>         oc = 0x802551bd0 
>> #4  0x0000000000543809 in si_shutw (si=0x802551dc8) at 
>> include/proto/stream_interface.h:320 
>> No locals. 
>> #5  0x0000000000543320 in stream_int_notify (si=0x802551dc8) at 
>> src/stream_interface.c:465 
>>         conn = 0x0 
>>         ic = 0x802551b90 
>>         oc = 0x802551bd0 
>> #6  0x000000000044dc81 in hlua_socket_receive_yield (L=0x80243f2a8, 
>> status=1, ctx=0) at src/hlua.c:1770 
>>         socket = 0x8026a0028 
>>         wanted = -1 
>>         hlua = 0x8024d6780 
>>         appctx = 0x80254b600 
>>         len = 25 
>>         nblk = 1 
>>         blk1 = 0x8027da4d4 "501 5.1.3 Invalid address\r\nclass=\"tbl\" 
>> width=\"100%\">\n<tr class=\"titre\"><th class=\"pxname\" width=\"10%\"><a 
>> name=\"SMTP-Mailroute_XXXX-25_EX01-125_tcp_ipv4\"></a><a class=px 
>> href=\"#SMTP-Mailroute_XXXX-25_E"... 
>>         len1 = 25 
>>         blk2 = 0x100000005 <error: Cannot access memory at address 
>> 0x100000005> 
>>         len2 = 3 
>>         skip_at_end = 2 
>>         oc = 0x802551bd0 
>>         si = 0x802551dc8 
>>         s = 0x802551b80 
>>         peer = 0x80254b680 
>> #7  0x0000000801880417 in ?? () from /usr/local/lib/liblua-5.3.so 
>> ---Type <return> to continue, or q <return> to quit--- 
>> No symbol table info available. 
>> #8  0x000000080187f1a6 in ?? () from /usr/local/lib/liblua-5.3.so 
>> No symbol table info available. 
>> #9  0x0000000801880194 in lua_resume () from /usr/local/lib/liblua-5.3.so 
>> No symbol table info available. 
>> #10 0x00000000004370a9 in hlua_ctx_resume (lua=0x8024d6780, yield_allowed=1) 
>> at src/hlua.c:1014 
>>         ret = 39132616 
>>         msg = 0x802647f00 "" 
>> #11 0x00000000004406c7 in hlua_process_task (task=0x80242d460) at 
>> src/hlua.c:5523 
>>         hlua = 0x8024d6780 
>>         status = HLUA_E_OK 
>> #12 0x0000000000591177 in process_runnable_tasks () at src/task.c:231 
>>         t = 0x80242d460 
>>         i = 32767 
>>         max_processed = 199 
>>         rq_next = 0x0 
>>         local_tasks = {0x0, 0x8add10 <applet_active_lock>, 0x0, 0x80254b600, 
>> 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x100000000, 0x0, 0x0, 
>> 0x0, 
>>           0x8add10 <applet_active_lock>, 0x0, 0x0, 0x7fffffffe758} 
>>         local_tasks_count = 0 
>>         final_tasks_count = 0 
>> #13 0x0000000000510a5b in run_poll_loop () at src/haproxy.c:2399 
>>         next = 588656154 
>>         exp = 588656154 
>> #14 0x000000000050e370 in run_thread_poll_loop (data=0x802446068) at 
>> src/haproxy.c:2461 
>>         start_lock = 0 
>>         ptif = 0x8a0908 <per_thread_init_list> 
>>         ptdf = 0x202 
>> #15 0x000000000050aef2 in main (argc=10, argv=0x7fffffffec58) at 
>> src/haproxy.c:3050 
>>         tids = 0x802446068 
>>         threads = 0x8024efd60 
>>         i = 1 
>>         err = 0 
>>         retry = 200 
>>         limit = {rlim_cur = 2285, rlim_max = 2285} 
>> ---Type <return> to continue, or q <return> to quit--- 
>>         errmsg = 
>> "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
>>         pidfd = 26 
>> (gdb) info locals 
>> tids = 0x802446068 
>> threads = 0x8024efd60 
>> i = 1 
>> err = 0 
>> retry = 200 
>> limit = {rlim_cur = 2285, rlim_max = 2285} 
>> errmsg = 
>> "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
>> pidfd = 26 
>> 
> 


Reply via email to