[ 
https://issues.apache.org/jira/browse/PROTON-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16468176#comment-16468176
 ] 

Alan Conway edited comment on PROTON-1842 at 5/9/18 1:07 AM:
-------------------------------------------------------------

Another note, the latest threaderciser shows the race with flags "-listen 
-connect -close-listen" so the only things that are racing here are IO events 
from connection errors and procator-generated wakes - there are no user wakes 
involved.

I am seeing a race betwee pn_proactor_done() (user thread) deciding to finalize 
a connection, and an epoll thread waking up to process it. The epoll thread is 
racing to lock the context mutex while the user thread is deleting it - I'm not 
seeing a crash but it's clear that it could be a crash with the right timing.

Speculating: we need to bring back something like the ee->mutex to sync around 
epoll mods and waits.  The variables in 

pconnection_is_final(pconnection_t *pc) {
  return !pc->current_arm && !pc->timer_armed && !pc->context.wake_ops;
} 

Need to be synchronized around epoll events, because right now it seems that 
is_final can return true concurrently with epoll_wait returning the same pc, so 
it seems like current_arm is not properly synced.


was (Author: aconway):
Another note, the latest threaderciser shows the race with flags "-listen 
-connect -close-listen" so the only things that are racing here are IO events 
from connection errors and procator-generated wakes - there are no user wakes 
involved.

I am seeing a race betwee pn_proactor_done() (user thread) deciding to finalize 
a connection, and an epoll thread waking up to process it. The epoll thread is 
racing to lock the context mutex while the user thread is deleting it - I'm not 
seeing a crash but it's clear that it could be a crash with the right timing.

> [c] Dispatch/Proton crashes when opening/closing connections
> ------------------------------------------------------------
>
>                 Key: PROTON-1842
>                 URL: https://issues.apache.org/jira/browse/PROTON-1842
>             Project: Qpid Proton
>          Issue Type: Bug
>          Components: proton-c
>    Affects Versions: proton-c-0.22.0
>            Reporter: Chuck Rolke
>            Priority: Major
>         Attachments: helloworld.cpp, race.tsan, race.vg
>
>
> Using proton cpp example code that is modified to open and close connections 
> by the thousands in the main loop and having the event loop short circuit any 
> messaging with:
> {{  void on_connection_open(proton::connection& c) {}}
> {{      c.close();}}
> {{  }}}
> and then directing this client example to a dispatch router 1.1.0. Eventually 
> (after 100,000 to 1,000,000 connection open/closes) the router crashes with:
> {{qdrouterd: /home/chug/git/qpid-proton/c/src/proactor/epoll.c:466: 
> wake_pop_front: Assertion `p->wakes_in_progress' failed.}}
> and with:
> {{qdrouterd: /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2014: 
> proactor_do_epoll: Assertion `ee->type == PCONNECTION_TIMER' failed.}}
> This issue seems to happen only with qpid-dispatch accepting the open/close 
> event stream. Proton cpp example _server_direct_ and c example _direct_ work 
> properly with the same open/close event stream mounting into the 10s of 
> millions of connections.
> A core dump backtrace with the PCONNECTION_TIMER failure reads as:
> {{(gdb) bt}}
> {{#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51}}
> {{#1  0x00007f795c712c41 in __GI_abort () at abort.c:79}}
> {{#2  0x00007f795c709f7a in __assert_fail_base (fmt=0x7f795c85a260 
> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
> assertion=assertion@entry=0x7f795d72e15a "ee->type == PCONNECTION_TIMER", }}
> {{    file=file@entry=0x7f795d72de98 
> "/home/chug/git/qpid-proton/c/src/proactor/epoll.c", line=line@entry=2014, }}
> {{    function=function@entry=0x7f795d72e320 <__PRETTY_FUNCTION__.6307> 
> "proactor_do_epoll") at assert.c:92}}
> {{#3  0x00007f795c709ff2 in __GI___assert_fail (assertion=0x7f795d72e15a 
> "ee->type == PCONNECTION_TIMER", file=0x7f795d72de98 
> "/home/chug/git/qpid-proton/c/src/proactor/epoll.c", line=2014, }}
> {{    function=0x7f795d72e320 <__PRETTY_FUNCTION__.6307> "proactor_do_epoll") 
> at assert.c:101}}
> {{#4  0x00007f795d72d29f in proactor_do_epoll (p=0x26b7310, can_block=true) 
> at /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2014}}
> {{#5  0x00007f795d72d30e in pn_proactor_wait (p=0x26b7310) at 
> /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2030}}
> {{#6  0x00007f795dbe89ad in thread_run (arg=0x26be750) at 
> /home/chug/git/qpid-dispatch/src/server.c:946}}
> {{#7  0x00007f795d50e50b in start_thread (arg=0x7f794f486700) at 
> pthread_create.c:465}}
> {{#8  0x00007f795c7d216f in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

Reply via email to