Hi:

I forgot one task, the Transmit_Task(). It is message based.

PR

On 05/02/2021 06:51 AM, Jameson Nash wrote:
You might want to try compiling your application with Clang TSan. Although, since libuv is a single-threaded library (per uv_loop_t), any access from multiple threads is generally just forbidden.

On Sun, May 2, 2021 at 2:24 AM Paul Romero <pa...@rcom-software.com <mailto:pa...@rcom-software.com>> wrote:

    Hi:

    A trace giving a more refined picture of the problem and more
    details is attached
    to this note. It dumps what I consider to be the significant
    fields of the uv_loop_t structure,
    the Poll_Loop in my code, and the the dynamically allocated
    uv_poll_t structure surrounding
    the essential Libuv API routine calls. I think the format of the
    dump is fairly intuitive
    but please inform me if you disagree.

    My comments about what is happening are identified by the string
    "COMMENT:" and the
    API calls are identified by the string "MPOLL". The trace is
    redacted to eliminate
    irrelevant events.

    Please let me know if you see something amiss about the data
    structure fields at
    any point.

    My server code is based on simple event oriented architecture and
    uses the following 5 concurrent
    tasks.

    1) main(): Handles incoming connections and Linux signals. (Libuv
    based.)
    2) IO_Task(): Handles incoming data from the network. (Libuv based.)
    3) Protocol_Task(): Does most of the work. (Message based.)
    4) DB_Task(): Does database insertions. (Message based.)
    5) Timer_Task(): Handles timer management. (Mostly  Libuv based.)

    I can send you my server code and a test tool if it would be
    useful to you.
    However, it may take a few days to put it in an easily useable form.

    As previously mentioned uv_accept() executes in the main() task
    and the
    rest of the polling related Libuv API routines execute in the
    IO_Task().

    Finally, remember the problem is intermittent. Generally, it takes
    about
    45 to 3 hours of constant concurrent connections for the crash to
    occur.

    Best Regards,

    Paul R.




    On 05/01/2021 12:01 AM, Paul Romero wrote:
    Hi:

    I may have overlooked the following factor.

    We know that Libuv is not reentrant and not really multi-task
    safe. I suspect, but
    don't know, that the internal Libuv data structures attached to
    the uv_loop_t structure,
    Poll_Loop in my case, are not protected by internal Libuv Mutexes.

    1) uv_accept() and uv_fileno() exectute in the main() thread.

    2) uv_poll_init(), uv_poll_start() and uv_poll_stop(), execute in
    the IO_Task() thread.

    That means uv_accept() and that the Libuv API routines in 2,
    above, can execute concurrently.
    Is it possible if such concurrency could damage the internal
    Libuv data structures attached to the
    uv_loop_t structure ?  (Again Poll_Loop in my case.)

    Best Regards,

    Paul R.

    NOTE: I did a few experments and noticed that calling uv_close()
    in a Mutex often
    prevents its callback routine from executing.

    On 04/30/2021 04:59 PM, Paul Romero wrote:

    Hi:

    I think I've isolated the problem and it appears to be in
    uv_poll_stop().

    In particlular if uv_accept() is called immediately after
    uv_poll_stop(), and it reallocates the
    same socket descriptor just freed by uv_poll_stop(), then
    uv_poll_init() fails with the error
    EEXIST or EBADF. This must have something to do with uv_loop_t
    handle passed to uv_poll_init(),
    called Poll_Loop in my code.

    I had to instrument my code to keep track of Libuv API calls to
    see this.

    The following explanatory items follows.

    A - The sequence of calls that produces the problem.
    B - Commented syslog() diagnostics.
    C - The relevant code segments.
    D - The gdb stack backtrace.

    Best Regards,

    Paul R.


    A) The sequence of relevant calls is as follows and the problem
    only occur with this exact
    sequence of events.
    
****************************************************************************************

    1) uv_poll_stop(): The uv_pool_t handle is bound to socket
    descriptor N.
       This should free socket descriptor N.
    2) uv_accept(): Bound to socket desciptor N as set by uv_fileno().
       Socket descriptor N should be newly allocated.
    3) uv_poll_init(): Returns EBADF with socket descriptor N and
    Poll_Loop.watchers[N] is NULL.
       Something is wrong !

    B) The commented syslog() diagnostic sequence from the
    instrumented code.
    **************************************************************************
    * Stop Polling - Connection 1 Fd 21. This frees the FD.

    Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 1 fd 21] -
    IO_TASK - uv_poll_stop() - T_POLL_DISMANTLE_REQ

    * Stop Polling - Connection 0 Fd 20. This frees the FD

    Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 0 fd 20] -
    IO_TASK - uv_poll_stop() - T_POLL_DISMANTLE_REQ

    * New Connection - Connection 1 Fd 20. This is fine since
    uv_poll_stop() was called for Fd 20 which
      freed it and it is newly allcated with uv_accept() and
    obtained with uv_fileno().

    Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 1 fd 20] -
    IO_TASK - uv_accept() + uv_fileno()

    * However we crash because uv_poll_init() returned EBADF !

    Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 1 fd 20] -
    IO_TASK - uv_poll_init() - T_POLL_CONFIG_REQ
Apr 30 15:46:08 mobius pexd[30963]: IO_TASK: POLL_PROXY - Polling Initialization Error -9, EBADF

    C) The edited associated code segments are as follows.
    ********************************************************
    1) Prelude to uv_poll_stop() in poll_proxy() in the IO_Task()
    thread. (network_io.c)
    
------------------------------------------------------------------------------------
                case T_POLL_DISMANTLE_REQ:
                    ppdata = (uv_handle_t **) &msg->buffer[0];
                    poll_handle = (uv_poll_t *) *ppdata;

    syslog(LOG_INFO, "POLL_CALL[i %d fd %d] - IO_TASK -
    uv_poll_stop() - T_POLL_DISMANTLE_REQ\n", cdesc->index, cdesc->fd);

                    if( (r = uv_poll_stop(poll_handle)) )
                      {
                        fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY -
    Poll Stop Error %d, %s",
                             r, uv_err_name(r) );
                      }
                    //
                    // The callback routine notifies the
    Protocol_Task() when everything is done.
                    //
                    poll_handle->data = (void *) cdesc;
                    uv_close((uv_handle_t *) poll_handle,
    async_close_callback);


    2) Connection acceptance in make_incoming_connection() in the
    main() thread. (main.c)
    
---------------------------------------------------------------------------------------
            conn_handle = (uv_tcp_t *) VALLOC(sizeof(uv_tcp_t));
        uv_tcp_init(&Connect_Loop, conn_handle);

        if((r = uv_accept(listen_handle, (uv_stream_t *)
    conn_handle)) == 0)
           {
            //
            // A new connection occured.
            //
                uv_os_fd_t fd;
    ENTER_MUTEX(&Protocol_D_Mutex);
            nsock = N_Sockets++;
    EXIT_MUTEX(&Protocol_D_Mutex);
            //
            // Fetch the socket descriptor from the connection handle.
            //
            uv_fileno((const uv_handle_t*) conn_handle, &fd);
            //
            // Allocate the connection descriptor.
            //
            cdesc = ALLOC_CONN_DESC(fd);
            if( !cdesc )
              {
                syslog(LOG_DEBUG, "MAIN_TASK: CONNECTION REJECTION -
    No Connection Descriptors, N =  %d", nsock);

    ENTER_MUTEX(&Protocol_D_Mutex);
                N_Sockets--;
                close(fd);
    EXIT_MUTEX(&Protocol_D_Mutex);
                uv_close((uv_handle_t *) conn_handle,
    forced_close_callback);
                return;
              }

    syslog(LOG_INFO, "POLL_CALL[i %d fd %d] - IO_TASK - uv_accept()
    + uv_fileno()\n", cdesc->index, cdesc->fd);
            //
            // Save the connection handle and start polling.
            //
            cdesc->conn_handle = (uv_tcp_t *) conn_handle;

            syslog(LOG_INFO, "MAIN_TASK: NEW CONNECTION ESTABLISHED
    - CONN %d FD %d", cdesc->index, cdesc->fd);
            //
            // Set up epoll() plumbing by sending a message to
    IO_Task();
            //
            bzero((void *) &poll_data, sizeof(ASYNC_DATA));
            poll_data.type = T_POLL_CONFIG_REQ;
            poll_data.async_handle = &IO_Task_Async_Handle;
            poll_data.async_q = &IO_Task_Input_Q;
            poll_data.object_handle = NULL;
            //
            // The T_POLL_CONFIG_RSP message will be sent to the
    Protocol_Task() which
            // is in S_IDLE state.
            //
            send_async_msg(cdesc, &poll_data);
          }

    3) Prelude to uv_poll_init() in poll_proxy() in the IO_Task()
    thread. (network_io.c)
    
------------------------------------------------------------------------------------
                case T_POLL_CONFIG_REQ:
                    //
                    // Main process request to start Libuv/epoll()
    operation.
                    //
                    poll_handle = (uv_poll_t *)
    VALLOC(sizeof(uv_poll_t));

    syslog(LOG_INFO, "POLL_CALL[i %d fd %d] - IO_TASK -
    uv_poll_init() - T_POLL_CONFIG_REQ\n", cdesc->index, cdesc->fd);
                    if( (r = uv_poll_init(&Poll_Loop, poll_handle,
    cdesc->fd)) != 0)
                      {

                        fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY
    -  Polling Initialization Error %d, %s",
                             r, uv_err_name(r));
                      }



    D) Stack backtrace.

    (gdb) bt
    #0  0x00007f62c47b0267 in __GI_raise (sig=sig@entry=6) at
    ../sysdeps/unix/sysv/linux/raise.c:55
    #1  0x00007f62c47b1eca in __GI_abort () at abort.c:89
    #2  0x000000000040d411 in fatal_error (log_prio=2,
        fmt=0x424e28 "IO_TASK: POLL_PROXY -  Polling Initialization
    Error %d, %s") at misc.c:135
    #3  0x0000000000405b6d in poll_proxy (handle=0x641f80
    <IO_Task_Async_Handle>) at network_io.c:291
    #4  0x00000000004114cb in uv__async_io (loop=0xc26800
    <Poll_Loop>, w=<optimized out>, events=<optimized out>)
        at src/unix/async.c:163
    #5  0x000000000041d255 in uv__io_poll (loop=loop@entry=0xc26800
    <Poll_Loop>, timeout=<optimized out>)
        at src/unix/linux-core.c:462
    #6  0x0000000000411c48 in uv_run (loop=0xc26800 <Poll_Loop>,
    mode=UV_RUN_DEFAULT) at src/unix/core.c:385
    #7  0x0000000000405835 in IO_Task (arg=0x0) at network_io.c:100
    #8  0x00007f62c4b4c6aa in start_thread (arg=0x7f62bbfff700) at
    pthread_create.c:333
    #9  0x00007f62c4881eed in clone () at
    ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

    At this point cdesc->fd is 20 and Poll_Loop.watchers[20] is NULL.
    On 04/30/2021 11:48 AM, Paul Romero wrote:
    Hi:

    My guess is that there is a subtle, probably timing related,
    problem in uv_poll_stop()
    simply because I can't think of any other relevant Libuv
    routines that are explicitly
    called after uv_poll_start() succeeds.  What do you think ? Of
    course, there could
    be a problem in the way connections are rejected. All that code
    is in the make_incoming_connection()
    callback routine in main.c

    Best Regards,

    Paul R.


    On 04/29/2021 03:14 PM, Paul Romero wrote:
    Hi:

    I reattached the following files with some confusing debugging
    stuff removed and
    a dump using them follows.

    * main.c
    * network_io.c
    * rx_fsm.c
    * async.c

    Again this is the relevant code segment.

        case T_POLL_CONFIG_REQ:
            //
            // Main process request to start Libuv/epoll() operation.
            //
            poll_handle = (uv_poll_t *) VALLOC(sizeof(uv_poll_t));

            if( (r = uv_poll_init(&Poll_Loop, poll_handle,
    cdesc->fd)) != 0)
              {

fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Polling Initialization Error %d, %s",
                     r, uv_err_name(r));
              }

    This is the crash diagnostic.

    IO_TASK: POLL_PROXY -  Polling Initialization Error -17, EEXIST
    Aborted (core dumped)

    And the is the gdb backtrace and is typical of the problem.

    Using host libthread_db library
    "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Core was generated by `./pexd'.
    Program terminated with signal SIGABRT, Aborted.
    #0  0x00007ff1da90b267 in __GI_raise (sig=sig@entry=6) at
    ../sysdeps/unix/sysv/linux/raise.c:55
    55    ../sysdeps/unix/sysv/linux/raise.c: No such file or
    directory.
    (gdb) bt
    #0  0x00007ff1da90b267 in __GI_raise (sig=sig@entry=6) at
    ../sysdeps/unix/sysv/linux/raise.c:55
    #1  0x00007ff1da90ceca in __GI_abort () at abort.c:89
    #2  0x000000000040d379 in fatal_error (log_prio=2,
        fmt=0x424ce0 "IO_TASK: POLL_PROXY -  Polling
    Initialization Error %d, %s") at misc.c:135
    #3  0x0000000000405b21 in poll_proxy (handle=0x641f80
    <IO_Task_Async_Handle>) at network_io.c:287
    #4  0x000000000041142b in uv__async_io (loop=0xc26800
    <Poll_Loop>, w=<optimized out>, events=<optimized out>)
        at src/unix/async.c:163
    #5  0x000000000041d1b5 in uv__io_poll
    (loop=loop@entry=0xc26800 <Poll_Loop>, timeout=<optimized out>)
        at src/unix/linux-core.c:462
    #6  0x0000000000411ba8 in uv_run (loop=0xc26800 <Poll_Loop>,
    mode=UV_RUN_DEFAULT) at src/unix/core.c:385
    #7  0x000000000040580f in IO_Task (arg=0x0) at network_io.c:100
    #8  0x00007ff1daca76aa in start_thread (arg=0x7ff1d5a7c700) at
    pthread_create.c:333
    #9  0x00007ff1da9dceed in clone () at
    ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

    At this point cdesc->fd is 22 and Poll_Loop.watchers[22] =
    0x17ef458, a valid address ?

    On 04/29/2021 09:09 AM, Paul Romero wrote:
    Hi:

    The most common symptom of the problem is that uv_poll_init()
    fails which means the problem must
    be related to the  uv_poll_t Poll_Loop. The relevant code
    segment is as follows.

       poll_handle = (uv_poll_t *) VALLOC(sizeof(uv_poll_t));

       if( (r = uv_poll_init(&Poll_Loop, poll_handle, cdesc->fd))
    != 0)
         {
fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Polling Initialization Error %d, %s",
                                 r, uv_err_name(r));
          }

    The following files illustrate the problem and are attached
    to this message. Please advise me
    how to attach files to the list directly.

    *  trace_badf.txt: A stack trace when the error code is EBADF.
    *  trace_eexist.txt: A stack trace when the error code is EEXIST.
    * network_io.c: The IO_Task() thread code. The problem occurs
    in the poll_proxy() routine.
    * main.c:  The main() process/thread code where most
    initialization occurs.
    * transmit.c: The Transmit_Task() thread.
    * rx_fsm.c: The heart of the protocol task code.


    You are right that only the IO_Task() performs polling
    related operations.
    However, please note that the Transmit_Task() thread uses the
    underlying TCP socket descriptor
    to send data to the network with Linux send(). (See the
    TX_DATA() routine.) Also,
    USE_SSL and USE_SCRAM are not defined.

    I will send more information if am able to produce the
    problem at another point.
    Also, I have seen the issue you mentioned before and suspect
    it may be relevant.
    Please let me know if you need anything else.


    Best Regards,

    Paul R.






    Best Regards,

    Paul R.


    On 04/28/2021 08:52 PM, Jameson Nash wrote:
    I'm assuming from your description that you've ensured all
    libuv calls happen from exactly one thread (except
    uv_async_send). It sounds like you're describing the issue
    that
    
https://github.com/libuv/libuv/commit/c9406ba0e3d67907c1973a71968b89a6bd83c63c
    was intended to fix, which was included in v1.41.0

    Note that the Poll_Loop.watchers[N] is not always required
    to be NULL, so perhaps you've found a case where it is still
    incorrectly expecting that? You might want to file an issue
    with code to reproduce the problem, or at least an `rr`
    trace dump of the code that hits problem (note that `rr`
    could potentially upload all of your hard drive, so be
    careful what you are submitting).

    On Wed, Apr 28, 2021 at 11:13 PM pa...@rcom-software.com
    <mailto:pa...@rcom-software.com> <pa...@rcom-software.com
    <mailto:pa...@rcom-software.com>> wrote:

        Hi Folks:

        I am experiencing an intermittent problem with
        uv_poll_init() after the successful establishment and
        release
        of multiple concurrent TCP connections. I am not sure if
        this problem is due to a bug, which may
        be corrected in another Libuv release, or if I am doing
        something wrong when releasing the poll handle.
        Do you have insight about the cause ? The details I
        consider most important follow.

        The uv_poll_t callback function reads incoming TCP data
        as follows.

            iff(events & UV_READABLE)
              {
                CONN_DESC *cdesc = (CONN_DESC *) poll_handle->data;
                n = recv(cdesc->fd, cdesc->rx_buf, RX_BUF_SIZE,
        MSG_DONTWAIT);

        NOTE: I am using Libuv version 1.41.0 running on Ubuntu
        Linux version 15.04.

        The problem is that uv_poll_init() fails, normally with
        the EEXIST or EBADF error code, and
        my investigation indicates the uv_loop_t
        Poll_Loop.watchers[N] field is not NULL when it should be,
        where N is TCP socket descriptor. It occurs immediately
        after the uv_poll_t poll_handle is allocated.
        (There is exactly one TCP socket descriptor per poll
        handle.) The call to uv_poll_init() is as follows
        and the socket descriptor is obtained with uv_fileno().

            if( (r = uv_poll_init(&Poll_Loop, poll_handle,
        cdesc->fd)) != 0)
               {
fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Polling Initialization Error %d, %s",
                     r, uv_err_name(r));
               }

        It occurs in the IO_Task() thread when there multiple
        TCP socket descriptors are in use. The IO_task
        releases the poll_handle with the following code
        sequence when it is notified that polling should stop
        via a Libuv async. message from the Protocol_Task() thread.

            if( (r = uv_poll_stop(poll_handle)) )
              {
                fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY -
        Poll Stop Error %d, %s",
                     r, uv_err_name(r) );
              }

            poll_handle->data = (void *) cdesc;
            uv_close((uv_handle_t *) poll_handle,
        async_close_callback);

        The actual release occurs in async_close_callback() as
        follows and the Protocol_Task() releases the
        TCP socket decriptor with a normal Linux close() after
        it receives the T_DISMANTLE response message.

            VFREE((UCHAR *) handle);
            //
            // Send a notification message to the Protocol_Task.
            //
            msg = MSG_ALLOC(0, FALSE);
            msg->class = C_NOTIFY;
            msg->type = T_DISMANTLE_RSP;
            msg->info = 0;

            SendProtoMsg(cdesc, msg);

        I think the underlying cause is that if there is new TCP
        connection that uses a the same TCP socket descriptor
        as one released with uv_poll_stop() and uv_close(), the
        call to uv_poll_init() occurs before the socket closure
        has propogated into the uv_loop_t Poll_Handle.

        Best Regards,

        Paul Romero
-- You received this message because you are subscribed to
        the Google Groups "libuv" group.
        To unsubscribe from this group and stop receiving emails
        from it, send an email to
        libuv+unsubscr...@googlegroups.com
        <mailto:libuv+unsubscr...@googlegroups.com>.
        To view this discussion on the web visit
        
https://groups.google.com/d/msgid/libuv/c3290a76-ab6b-42ad-8540-33021c6188b9n%40googlegroups.com.

-- You received this message because you are subscribed to a
    topic in the Google Groups "libuv" group.
    To unsubscribe from this topic, visit
    https://groups.google.com/d/topic/libuv/_4ClQoaVPCg/unsubscribe.
    To unsubscribe from this group and all its topics, send an
    email to libuv+unsubscr...@googlegroups.com
    <mailto:libuv+unsubscr...@googlegroups.com>.
    To view this discussion on the web visit
    
https://groups.google.com/d/msgid/libuv/CADnnjUXSsF7QwRnVqFojAm1W_o35CHKyMCPM%2BVw0FM0FjqN1XQ%40mail.gmail.com.

--

    Paul Romero
    -----------
    RCOM Communications Software
    EMAIL:pa...@rcom-software.com <mailto:pa...@rcom-software.com>
    PHONE: (510)482-2769





--

    Paul Romero
    -----------
    RCOM Communications Software
    EMAIL:pa...@rcom-software.com <mailto:pa...@rcom-software.com>
    PHONE: (510)482-2769





--

    Paul Romero
    -----------
    RCOM Communications Software
    EMAIL:pa...@rcom-software.com <mailto:pa...@rcom-software.com>
    PHONE: (510)482-2769




-- You received this message because you are subscribed to a topic
    in the Google Groups "libuv" group.
    To unsubscribe from this topic, visit
    https://groups.google.com/d/topic/libuv/_4ClQoaVPCg/unsubscribe.
    To unsubscribe from this group and all its topics, send an
    email to libuv+unsubscr...@googlegroups.com
    <mailto:libuv+unsubscr...@googlegroups.com>.
    To view this discussion on the web visit
    
https://groups.google.com/d/msgid/libuv/608C50E8.7070202%40rcom-software.com.

--

    Paul Romero
    -----------
    RCOM Communications Software
    EMAIL:pa...@rcom-software.com <mailto:pa...@rcom-software.com>
    PHONE: (510)482-2769





--

    Paul Romero
    -----------
    RCOM Communications Software
    EMAIL:pa...@rcom-software.com <mailto:pa...@rcom-software.com>
    PHONE: (510)482-2769





--

    Paul Romero
    -----------
    RCOM Communications Software
    EMAIL:pa...@rcom-software.com <mailto:pa...@rcom-software.com>
    PHONE: (510)482-2769




-- You received this message because you are subscribed to the Google
    Groups "libuv" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to libuv+unsubscr...@googlegroups.com
    <mailto:libuv+unsubscr...@googlegroups.com>.
    To view this discussion on the web visit
    https://groups.google.com/d/msgid/libuv/608E45B4.4050606%40rcom-software.com
    
<https://groups.google.com/d/msgid/libuv/608E45B4.4050606%40rcom-software.com?utm_medium=email&utm_source=footer>.

--
You received this message because you are subscribed to the Google Groups "libuv" group. To unsubscribe from this group and stop receiving emails from it, send an email to libuv+unsubscr...@googlegroups.com <mailto:libuv+unsubscr...@googlegroups.com>. To view this discussion on the web visit https://groups.google.com/d/msgid/libuv/CADnnjUW%2BMivauJHJm5kib0BxG8TCGSWSp%2BTcbi6CQ6nCrsrgPg%40mail.gmail.com <https://groups.google.com/d/msgid/libuv/CADnnjUW%2BMivauJHJm5kib0BxG8TCGSWSp%2BTcbi6CQ6nCrsrgPg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--


Paul Romero
-----------
RCOM Communications Software
EMAIL: pa...@rcom-software.com
PHONE: (510)482-2769




--
You received this message because you are subscribed to the Google Groups 
"libuv" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to libuv+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/libuv/608ECD05.9040803%40rcom-software.com.

Reply via email to