[ https://issues.apache.org/jira/browse/PROTON-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16468176#comment-16468176 ]
Alan Conway edited comment on PROTON-1842 at 5/9/18 1:07 AM: ------------------------------------------------------------- Another note, the latest threaderciser shows the race with flags "-listen -connect -close-listen" so the only things that are racing here are IO events from connection errors and procator-generated wakes - there are no user wakes involved. I am seeing a race betwee pn_proactor_done() (user thread) deciding to finalize a connection, and an epoll thread waking up to process it. The epoll thread is racing to lock the context mutex while the user thread is deleting it - I'm not seeing a crash but it's clear that it could be a crash with the right timing. Speculating: we need to bring back something like the ee->mutex to sync around epoll mods and waits. The variables in pconnection_is_final(pconnection_t *pc) { return !pc->current_arm && !pc->timer_armed && !pc->context.wake_ops; } Need to be synchronized around epoll events, because right now it seems that is_final can return true concurrently with epoll_wait returning the same pc, so it seems like current_arm is not properly synced. was (Author: aconway): Another note, the latest threaderciser shows the race with flags "-listen -connect -close-listen" so the only things that are racing here are IO events from connection errors and procator-generated wakes - there are no user wakes involved. I am seeing a race betwee pn_proactor_done() (user thread) deciding to finalize a connection, and an epoll thread waking up to process it. The epoll thread is racing to lock the context mutex while the user thread is deleting it - I'm not seeing a crash but it's clear that it could be a crash with the right timing. > [c] Dispatch/Proton crashes when opening/closing connections > ------------------------------------------------------------ > > Key: PROTON-1842 > URL: https://issues.apache.org/jira/browse/PROTON-1842 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c > Affects Versions: proton-c-0.22.0 > Reporter: Chuck Rolke > Priority: Major > Attachments: helloworld.cpp, race.tsan, race.vg > > > Using proton cpp example code that is modified to open and close connections > by the thousands in the main loop and having the event loop short circuit any > messaging with: > {{ void on_connection_open(proton::connection& c) {}} > {{ c.close();}} > {{ }}} > and then directing this client example to a dispatch router 1.1.0. Eventually > (after 100,000 to 1,000,000 connection open/closes) the router crashes with: > {{qdrouterd: /home/chug/git/qpid-proton/c/src/proactor/epoll.c:466: > wake_pop_front: Assertion `p->wakes_in_progress' failed.}} > and with: > {{qdrouterd: /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2014: > proactor_do_epoll: Assertion `ee->type == PCONNECTION_TIMER' failed.}} > This issue seems to happen only with qpid-dispatch accepting the open/close > event stream. Proton cpp example _server_direct_ and c example _direct_ work > properly with the same open/close event stream mounting into the 10s of > millions of connections. > A core dump backtrace with the PCONNECTION_TIMER failure reads as: > {{(gdb) bt}} > {{#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51}} > {{#1 0x00007f795c712c41 in __GI_abort () at abort.c:79}} > {{#2 0x00007f795c709f7a in __assert_fail_base (fmt=0x7f795c85a260 > "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", > assertion=assertion@entry=0x7f795d72e15a "ee->type == PCONNECTION_TIMER", }} > {{ file=file@entry=0x7f795d72de98 > "/home/chug/git/qpid-proton/c/src/proactor/epoll.c", line=line@entry=2014, }} > {{ function=function@entry=0x7f795d72e320 <__PRETTY_FUNCTION__.6307> > "proactor_do_epoll") at assert.c:92}} > {{#3 0x00007f795c709ff2 in __GI___assert_fail (assertion=0x7f795d72e15a > "ee->type == PCONNECTION_TIMER", file=0x7f795d72de98 > "/home/chug/git/qpid-proton/c/src/proactor/epoll.c", line=2014, }} > {{ function=0x7f795d72e320 <__PRETTY_FUNCTION__.6307> "proactor_do_epoll") > at assert.c:101}} > {{#4 0x00007f795d72d29f in proactor_do_epoll (p=0x26b7310, can_block=true) > at /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2014}} > {{#5 0x00007f795d72d30e in pn_proactor_wait (p=0x26b7310) at > /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2030}} > {{#6 0x00007f795dbe89ad in thread_run (arg=0x26be750) at > /home/chug/git/qpid-dispatch/src/server.c:946}} > {{#7 0x00007f795d50e50b in start_thread (arg=0x7f794f486700) at > pthread_create.c:465}} > {{#8 0x00007f795c7d216f in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:95}} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org