On 15.12.20 13:18, David Osborne wrote:
So I have *removed* DSYSTEM_MALLOC from the default build flags and created a new build... so far I've not been able to get that to crash in test (where-as it was crashing every 3-4 requests when built with DSYSTEM_MALLOC)

ah, when it was crashing reliably, then it is easy to debug. The reference [3] below is pointing exactly to a fix for a case, where a mixup of memory allocators could lead to a crash.

i would recommend to try the head version, this should not crash at all, with and without SYSTEM_MALLOC set.

I don't fully understand the implications of this - Is it a suitable solution which we could use in production?

We are using in our production environment always a configuration, where all mallocs are based on the system-malloc, and use as system malloc TCmalloc. See [4] for a comparison of malloc implementations with naviserver + Tcl. These are the memory and performance implications... which are irrelevant for small sites, but make a difference on large and busy sites.

From the deployment side, when using Tcl with SYSTEM_MALLOC, you can't use the stock (debian) version of Tcl. We compile and install Tcl with --prefix=/usr/local/ns/ such that the Tcl-verson is in the /usr/local/ns tree. When producing new binaries of NaviServer, we produce as well new binaries of Tcl.

Everything clear?
-g

[4] https://next-scripting.org/2.3.0/doc/misc/thread-mallocs/index1

On Mon, 14 Dec 2020 at 20:36, Gustaf Neumann <neum...@wu.ac.at <mailto:neum...@wu.ac.at>> wrote:

    Dear David,

    the crash looks like a problem in the OpenSSL memory management.

    In general, i would believe that this is a problem in the
    NaviServer code, but of the interplay of the various memory
    management options of OpenSSL, NaviServer and Tcl. We use these
    functions under heavy load on many servers, but we are careful to
    use everywhere the same malloc implementation (actually Google's
    TCmalloc).

    OpenSSL:
    ======

    In general, OpenSSL supports configuration of management routines.
    However, the memory management interface of OpenSSL changed with
    the release of OpenSSL 1.1.0. As a consequence, when compiling
    NaviServer with newer versions, of OpenSSL, the native OpenSSL
    memory routines are used. The commit [1] says: "Registering our
    own functions does not seem necessary". So, if one compiles a
    version of NaviServer between 4.99.15 and 4.99.20 with newer
    versions of OpenSSL, there might a problem arise, when the native
    OpenSSL malloc implementation is not full thread-safe, or when a
    mix between different malloc implementation happens.

    NaviServer:
    =======

    When NaviServer is compiled with -DSYSTEM_MALLOC, ns_malloc() uses
    malloc() etc., otherwise it uses Tcl's ckalloc() and friends.

    Tcl:
    ===
    There exists as well a patch [2] for using internally in Tcl as
    well system malloc instead of Tcl's own mt-threaded version.

    In Oct there was as well a small patch for NaviServer for cases,
    were Tcl and NaviServer are compiled with different memory
    allocators [3].

    My first attempt would be to compile NaviServer with SYSTEM_MALLOC
    and check, whether you still experience a problem. The next
    recommendation would be to check, what malloc versions are used by
    which subsystems and align these if necessary.

    i will look into reviving the configuration of OpenSSL to allow to
    configure its malloc implementation as it was possible before
    OpenSSL 1.1.0.

    -gn

    [1]
    
https://bitbucket.org/naviserver/naviserver/commits/896a4e3765f91b048ccbf570e5afe21b1bb1a41f
    
<https://bitbucket.org/naviserver/naviserver/commits/896a4e3765f91b048ccbf570e5afe21b1bb1a41f>
    [2] https://github.com/gustafn/install-ns
    <https://github.com/gustafn/install-ns>
    [3]
    
https://bitbucket.org/naviserver/naviserver/commits/caab40365f0429a44740db1927e9f459d733db3f
    
<https://bitbucket.org/naviserver/naviserver/commits/caab40365f0429a44740db1927e9f459d733db3f>

    On 14.12.20 18:07, David Osborne wrote:
    Hi,

    We're building some Naviserver instances (4.99.19) on Debian
    Buster (v10.7).
    One of the instances is a revproxy instance which uses connchans
    to speak to a back end.

    We're seeing very frequent signal 11 crashes of NaviServer with
    this combination.
    (We also see this infrequently with 4.99.18 running on Debian
    Stretch (v9))

    Because of the increased frequency I've managed to take a core
    dump and the issue appears to be when calling SSL_CTX_new
    after Ns_TLS_CtxClientCreate.

    I realise I don't have gdb properly configured, but wondering if
    the backtrace as it is could shed any light on what's going on or
    is it still too opaque?

    Using host libthread_db library
    "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Core was generated by `/usr/lib/naviserver/bin/nsd -u nsd -g nsd
    -b 0.0.0.0:80 <http://0.0.0.0:80>,0.0.0.0:443
    <http://0.0.0.0:443> -i -t /etc/'.
    Program terminated with signal SIGABRT, Aborted.
    #0  __GI_raise (sig=sig@entry=6) at
    ../sysdeps/unix/sysv/linux/raise.c:50
    50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
    [Current thread is 1 (Thread 0x7f4405ddf700 (LWP 13613))]
    (gdb) bt
    #0  __GI_raise (sig=sig@entry=6) at
    ../sysdeps/unix/sysv/linux/raise.c:50
    #1  0x00007f4407936535 in __GI_abort () at abort.c:79
    #2  0x00007f440847cfe6 in Panic (fmt=<optimized out>) at log.c:928
    #3  0x00007f44080fbc4a in Tcl_PanicVA () from
    /lib/x86_64-linux-gnu/libtcl8.6.so <http://libtcl8.6.so>
    #4  0x00007f44080fbdb9 in Tcl_Panic () from
    /lib/x86_64-linux-gnu/libtcl8.6.so <http://libtcl8.6.so>
    #5  0x00007f44084bbc74 in Abort (signal=<optimized out>) at
    unix.c:1115
    #6  <signal handler called>
    #7  malloc_consolidate (av=av@entry=0x7f43bc000020) at malloc.c:4486
    #8  0x00007f4407996a58 in _int_malloc
    (av=av@entry=0x7f43bc000020, bytes=bytes@entry=1024) at malloc.c:3695
    #9  0x00007f440799856a in __GI___libc_malloc (bytes=1024) at
    malloc.c:3057
    #10 0x00007f4407c63559 in CRYPTO_zalloc () from
    /lib/x86_64-linux-gnu/libcrypto.so.1.1
    #11 0x00007f4407df7699 in SSL_CTX_new () from
    /lib/x86_64-linux-gnu/libssl.so.1.1
    #12 0x00007f44084b4d85 in Ns_TLS_CtxClientCreate
    (interp=interp@entry=0x7f43bc009ee0, cert=cert@entry=0x0,
    caFile=caFile@entry=0x0, caPath=caPath@entry=0x0,
        verify=verify@entry=false,
    ctxPtr=ctxPtr@entry=0x7f4405dde7c0) at tls.c:116
    #13 0x00007f44084687a4 in ConnChanOpenObjCmd
    (clientData=<optimized out>, interp=0x7f43bc009ee0,
    objc=<optimized out>, objv=<optimized out>)
        at connchan.c:1010
    #14 0x00007f44084a7eb8 in Ns_SubcmdObjv
    (subcmdSpec=subcmdSpec@entry=0x7f4405dde990,
    clientData=0x7f43bc047870, interp=0x7f43bc009ee0, objc=13,
        objv=0x7f43bc017ff8) at tclobjv.c:1849
    #15 0x00007f4408469d45 in NsTclConnChanObjCmd
    (clientData=<optimized out>, interp=<optimized out>,
    objc=<optimized out>, objv=<optimized out>)
        at connchan.c:1761
    #16 0x00007f440802ffb7 in TclNRRunCallbacks () from
    /lib/x86_64-linux-gnu/libtcl8.6.so <http://libtcl8.6.so>
    #17 0x00007f44080313af in ?? () from
    /lib/x86_64-linux-gnu/libtcl8.6.so <http://libtcl8.6.so>
    #18 0x00007f4408030d13 in Tcl_EvalEx () from
    /lib/x86_64-linux-gnu/libtcl8.6.so <http://libtcl8.6.so>
    #19 0x00007f44084a9164 in NsTclFilterProc (arg=0x55af6a3e9880,
    conn=0x55af6a502480, why=NS_FILTER_PRE_AUTH) at tclrequest.c:535
    #20 0x00007f4408478370 in NsRunFilters
    (conn=conn@entry=0x55af6a502480,
    why=why@entry=NS_FILTER_PRE_AUTH) at filter.c:160
    #21 0x00007f440848654d in ConnRun
    (connPtr=connPtr@entry=0x55af6a502480) at queue.c:2450
    #22 0x00007f4408485b33 in NsConnThread (arg=0x55af6a4a0090) at
    queue.c:2157
    #23 0x00007f44081b2bb1 in NsThreadMain (arg=0x55af6a354f50) at
    thread.c:230
    #24 0x00007f44081b3af9 in ThreadMain (arg=<optimized out>) at
    pthread.c:836
    #25 0x00007f44078f5fa3 in start_thread (arg=<optimized out>) at
    pthread_create.c:486
    #26 0x00007f4407a0d4cf in clone () at
    ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

-- Regards,
    David
    _______________________________________________
    naviserver-devel mailing list
    naviserver-devel@lists.sourceforge.net
    <mailto:naviserver-devel@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/naviserver-devel
    <https://lists.sourceforge.net/lists/listinfo/naviserver-devel>



--

*David Osborne | Software Engineer*
Qcode Software, Castle House, Fairways Business Park, Inverness, IV2 6AA
*Email:* da...@qcode.co.uk <mailto:da...@qcode.co.uk> | *Phone:* 01463 896 484
www.qcode.co.uk <https://www.qcode.co.uk/>


_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel
_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Reply via email to