Thanks very much Gustaf,

Looking in the build output I see the "-DSYSTEM_MALLOC" was already in
place during the build of the binary which is now crashing.

So I have *removed* DSYSTEM_MALLOC from the default build flags and created
a new build... so far I've not been able to get that to crash in test
(where-as it was crashing every 3-4 requests when built with
DSYSTEM_MALLOC)

I don't fully understand the implications of this - Is it a suitable
solution which we could use in production?



On Mon, 14 Dec 2020 at 20:36, Gustaf Neumann <neum...@wu.ac.at> wrote:

> Dear David,
>
> the crash looks like a problem in the OpenSSL memory management.
>
> In general, i would believe that this is a problem in the NaviServer code,
> but of the interplay of the various memory management options of OpenSSL,
> NaviServer and Tcl. We use these functions under heavy load on many
> servers, but we are careful to use everywhere the same malloc
> implementation (actually Google's TCmalloc).
>
> OpenSSL:
> ======
>
> In general, OpenSSL supports configuration of management routines.
> However, the memory management interface of OpenSSL changed with the
> release of OpenSSL 1.1.0. As a consequence, when compiling NaviServer with
> newer versions, of OpenSSL, the native OpenSSL memory routines are used.
> The commit [1] says: "Registering our own functions does not seem
> necessary". So, if one compiles a version of NaviServer between 4.99.15 and
> 4.99.20 with newer versions of OpenSSL, there might a problem arise, when
> the native OpenSSL malloc implementation is not full thread-safe, or when a
> mix between different malloc implementation happens.
>
> NaviServer:
> =======
>
> When NaviServer is compiled with -DSYSTEM_MALLOC, ns_malloc() uses
> malloc() etc., otherwise it uses Tcl's ckalloc() and friends.
>
> Tcl:
> ===
> There exists as well a patch [2] for using internally in Tcl as well
> system malloc instead of Tcl's own mt-threaded version.
>
> In Oct there was as well a small patch for NaviServer for cases, were Tcl
> and NaviServer are compiled with different memory allocators [3].
>
> My first attempt would be to compile NaviServer with SYSTEM_MALLOC and
> check, whether you still experience a problem. The next recommendation
> would be to check, what malloc versions are used by which subsystems and
> align these if necessary.
>
> i will look into reviving the configuration of OpenSSL to allow to
> configure its malloc implementation as it was possible before OpenSSL 1.1.0.
>
> -gn
>
> [1]
> https://bitbucket.org/naviserver/naviserver/commits/896a4e3765f91b048ccbf570e5afe21b1bb1a41f
> [2] https://github.com/gustafn/install-ns
> [3]
> https://bitbucket.org/naviserver/naviserver/commits/caab40365f0429a44740db1927e9f459d733db3f
> On 14.12.20 18:07, David Osborne wrote:
>
> Hi,
>
> We're building some Naviserver instances (4.99.19) on Debian Buster
> (v10.7).
> One of the instances is a revproxy instance which uses connchans to speak
> to a back end.
>
> We're seeing very frequent signal 11 crashes of NaviServer with this
> combination.
> (We also see this infrequently with 4.99.18 running on Debian Stretch (v9))
>
> Because of the increased frequency I've managed to take a core dump and
> the issue appears to be when calling SSL_CTX_new
> after Ns_TLS_CtxClientCreate.
>
> I realise I don't have gdb properly configured, but wondering if the
> backtrace as it is could shed any light on what's going on or is it still
> too opaque?
>
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `/usr/lib/naviserver/bin/nsd -u nsd -g nsd -b
> 0.0.0.0:80,0.0.0.0:443 -i -t /etc/'.
> Program terminated with signal SIGABRT, Aborted.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> [Current thread is 1 (Thread 0x7f4405ddf700 (LWP 13613))]
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f4407936535 in __GI_abort () at abort.c:79
> #2  0x00007f440847cfe6 in Panic (fmt=<optimized out>) at log.c:928
> #3  0x00007f44080fbc4a in Tcl_PanicVA () from /lib/x86_64-linux-gnu/
> libtcl8.6.so
> #4  0x00007f44080fbdb9 in Tcl_Panic () from /lib/x86_64-linux-gnu/
> libtcl8.6.so
> #5  0x00007f44084bbc74 in Abort (signal=<optimized out>) at unix.c:1115
> #6  <signal handler called>
> #7  malloc_consolidate (av=av@entry=0x7f43bc000020) at malloc.c:4486
> #8  0x00007f4407996a58 in _int_malloc (av=av@entry=0x7f43bc000020,
> bytes=bytes@entry=1024) at malloc.c:3695
> #9  0x00007f440799856a in __GI___libc_malloc (bytes=1024) at malloc.c:3057
> #10 0x00007f4407c63559 in CRYPTO_zalloc () from
> /lib/x86_64-linux-gnu/libcrypto.so.1.1
> #11 0x00007f4407df7699 in SSL_CTX_new () from
> /lib/x86_64-linux-gnu/libssl.so.1.1
> #12 0x00007f44084b4d85 in Ns_TLS_CtxClientCreate 
> (interp=interp@entry=0x7f43bc009ee0,
> cert=cert@entry=0x0, caFile=caFile@entry=0x0, caPath=caPath@entry=0x0,
>     verify=verify@entry=false, ctxPtr=ctxPtr@entry=0x7f4405dde7c0) at
> tls.c:116
> #13 0x00007f44084687a4 in ConnChanOpenObjCmd (clientData=<optimized out>,
> interp=0x7f43bc009ee0, objc=<optimized out>, objv=<optimized out>)
>     at connchan.c:1010
> #14 0x00007f44084a7eb8 in Ns_SubcmdObjv 
> (subcmdSpec=subcmdSpec@entry=0x7f4405dde990,
> clientData=0x7f43bc047870, interp=0x7f43bc009ee0, objc=13,
>     objv=0x7f43bc017ff8) at tclobjv.c:1849
> #15 0x00007f4408469d45 in NsTclConnChanObjCmd (clientData=<optimized out>,
> interp=<optimized out>, objc=<optimized out>, objv=<optimized out>)
>     at connchan.c:1761
> #16 0x00007f440802ffb7 in TclNRRunCallbacks () from /lib/x86_64-linux-gnu/
> libtcl8.6.so
> #17 0x00007f44080313af in ?? () from /lib/x86_64-linux-gnu/libtcl8.6.so
> #18 0x00007f4408030d13 in Tcl_EvalEx () from /lib/x86_64-linux-gnu/
> libtcl8.6.so
> #19 0x00007f44084a9164 in NsTclFilterProc (arg=0x55af6a3e9880,
> conn=0x55af6a502480, why=NS_FILTER_PRE_AUTH) at tclrequest.c:535
> #20 0x00007f4408478370 in NsRunFilters (conn=conn@entry=0x55af6a502480,
> why=why@entry=NS_FILTER_PRE_AUTH) at filter.c:160
> #21 0x00007f440848654d in ConnRun (connPtr=connPtr@entry=0x55af6a502480)
> at queue.c:2450
> #22 0x00007f4408485b33 in NsConnThread (arg=0x55af6a4a0090) at queue.c:2157
> #23 0x00007f44081b2bb1 in NsThreadMain (arg=0x55af6a354f50) at thread.c:230
> #24 0x00007f44081b3af9 in ThreadMain (arg=<optimized out>) at pthread.c:836
> #25 0x00007f44078f5fa3 in start_thread (arg=<optimized out>) at
> pthread_create.c:486
> #26 0x00007f4407a0d4cf in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>
> --
> Regards,
> David
>
> _______________________________________________
> naviserver-devel mailing list
> naviserver-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/naviserver-devel
>


-- 

*David Osborne | Software Engineer*
Qcode Software, Castle House, Fairways Business Park, Inverness, IV2 6AA
*Email:* da...@qcode.co.uk | *Phone:* 01463 896 484
www.qcode.co.uk
_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Reply via email to