[jira] [Commented] (PROTON-1999) [c] Crash in pn_connection_finalize

2019-02-03 Thread Cliff Jansen (JIRA)


[ 
https://issues.apache.org/jira/browse/PROTON-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759665#comment-16759665
 ] 

Cliff Jansen commented on PROTON-1999:
--

Jeremy: I did not fully understand the scenario you have laid out but I think 
you have answered your own question:

   "Both threads are manipulating reference counts of objects, and I suspect a 
race condition. "

90% of the threading information in cpp/docs/mt.md is aimed at preventing this 
cross thread reference counting that never ends well.

What trips most people up is that connection::~connection() is just as thread 
unsafe as proton::open_session().  Usually this happens because some shared 
object passed between threads has an embedded proton::connection object.  Calls 
to connection::foo() are carefully scrutinized for obeying "the rules", but the 
destructor is often left to the whims of a "shared pointer last reference", 
which only gets the last dereference in the correct thread 50% of the time.

But it can also happen if a user object with an embedded Proton object is 
copied in the wrong thread, or the Proton object is passed by value (copy!) 
instead of by reference to a method in the wrong thread.

In your specific case, I wonder if activities (destructors?) that are happening 
in a non-container thread need to be passed to a relevant work_queue instead, 
before the container::stop() is invoked.  That's mostly a guess at this point 
and may be at best a "treat the symptom" suggestion as opposed to good design 
advice.

> [c] Crash in pn_connection_finalize
> ---
>
> Key: PROTON-1999
> URL: https://issues.apache.org/jira/browse/PROTON-1999
> Project: Qpid Proton
>  Issue Type: Bug
>  Components: cpp-binding, proton-c
>Affects Versions: proton-c-0.26.0
> Environment: Linux 64-bits (Ubuntu 16.04 and Oracle Linux 7.4)
>Reporter: Olivier Delbeke
>Assignee: Cliff Jansen
>Priority: Major
> Attachments: call_stack.txt, example2.cpp, log.txt, main.cpp, 
> run_qpid-broker.sh
>
>
> Here is my situation : I have several proton::containers (~20). 
> Each one has its own proton::messaging_handler, and handles one 
> proton::connection to a local qpid-broker (everything runs on the same Linux 
> machine).
> 20 x ( one container with one handler with one connection with one link)
> Some containers/connections/handlers work in send mode ; they have one link 
> that is a proton::sender.
> Some containers/connections/handlers work in receive mode ; they have one 
> link that is a proton::receiver. Each time they receive an input message, 
> they do some processing on it, and finally add a "sender->send()" task to the 
> work queue of some sender handlers ( by calling work_queue()->add( [=] \{ 
> sender->send(msg); } as shown in the multi-threading examples).
> This works fine for some time (tens of thousands of messages, several minutes 
> or hours), but eventually crashes, either with a SEGFAULT (when the 
> qpid-proton lib is compiled in release mode) or with an assert (in debug 
> mode), in qpid-proton/c/src/core/engine.c line 483, 
> assert(!conn->transport->referenced) in function pn_connection_finalize().
> The proton logs (activated with export PN_TRACE_FRM=1) do not show anything 
> abnormal (no loss of connection, no rejection of messages, no timeouts, ...).
> As the connection is not closed, I wonder why pn_connection_finalize() would 
> be called in the first place.
> I joined the logs and the call trace.
> Happens on 0.26.0 but also reproduced with the latest master (Jan 28, 2019).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (PROTON-1999) [c] Crash in pn_connection_finalize

2019-02-03 Thread Cliff Jansen (JIRA)


[ 
https://issues.apache.org/jira/browse/PROTON-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759664#comment-16759664
 ] 

Cliff Jansen commented on PROTON-1999:
--

Olivier Delbeke: the mapping mechanism between Proton-C objects and the extra 
state in their C++ counterparts is not thread safe (pn_XXX_attachments()).

Perhaps that's an implementation detail that could be changed, but the current 
C++ philosophy is to superimpose a minimal additional locking on top of the 
Proton-C threading model.  We try to piggy-back naturally on where threading is 
safe or unsafe in Proton-C with the addition of (rare) extra locking: on the 
work_queue.add() and methods on proton::container.

> [c] Crash in pn_connection_finalize
> ---
>
> Key: PROTON-1999
> URL: https://issues.apache.org/jira/browse/PROTON-1999
> Project: Qpid Proton
>  Issue Type: Bug
>  Components: cpp-binding, proton-c
>Affects Versions: proton-c-0.26.0
> Environment: Linux 64-bits (Ubuntu 16.04 and Oracle Linux 7.4)
>Reporter: Olivier Delbeke
>Assignee: Cliff Jansen
>Priority: Major
> Attachments: call_stack.txt, example2.cpp, log.txt, main.cpp, 
> run_qpid-broker.sh
>
>
> Here is my situation : I have several proton::containers (~20). 
> Each one has its own proton::messaging_handler, and handles one 
> proton::connection to a local qpid-broker (everything runs on the same Linux 
> machine).
> 20 x ( one container with one handler with one connection with one link)
> Some containers/connections/handlers work in send mode ; they have one link 
> that is a proton::sender.
> Some containers/connections/handlers work in receive mode ; they have one 
> link that is a proton::receiver. Each time they receive an input message, 
> they do some processing on it, and finally add a "sender->send()" task to the 
> work queue of some sender handlers ( by calling work_queue()->add( [=] \{ 
> sender->send(msg); } as shown in the multi-threading examples).
> This works fine for some time (tens of thousands of messages, several minutes 
> or hours), but eventually crashes, either with a SEGFAULT (when the 
> qpid-proton lib is compiled in release mode) or with an assert (in debug 
> mode), in qpid-proton/c/src/core/engine.c line 483, 
> assert(!conn->transport->referenced) in function pn_connection_finalize().
> The proton logs (activated with export PN_TRACE_FRM=1) do not show anything 
> abnormal (no loss of connection, no rejection of messages, no timeouts, ...).
> As the connection is not closed, I wonder why pn_connection_finalize() would 
> be called in the first place.
> I joined the logs and the call trace.
> Happens on 0.26.0 but also reproduced with the latest master (Jan 28, 2019).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (PROTON-1999) [c] Crash in pn_connection_finalize

2019-02-01 Thread Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/PROTON-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758319#comment-16758319
 ] 

Jeremy commented on PROTON-1999:


Hello [~ODelbeke] and [~cliffjansen],

[~ODelbeke]: Before going into my analysis, can you please attach the gdb 
stacks for the other threads as well? Specifically, what is happening in the 
main thread.

In fact, we are facing the same randomness problem, even though we are using a 
pointer to a work queue. I've been debugging it for a couple of days now, and I 
suspect the problem comes from proton's memory management. When we don't have 
exceptions, everything runs smoothly. As soon as we start having exceptions we 
start having segfaults. On proton container errors, we stop the container and 
join the thread, and in the mean time, the main thread will propagate the 
proton error by throwing it as an exception (interrupting the normal flow 
rolling back). We took care of ensuring the following order of 
construction/destruction of proton objects through a RAII object we created:

Construction:
 * Create the handler
 * Create the container
 * Run the container in a new thread (we only call run in the new thread)
 * Use the handler, which can store proton objects (sender, receiver, trackers, 
and pointers to deliveries)

Destruction:
 * Release stored proton objects from the handler(sender.close(), 
receiver.close(), empty the queue of trackers and deliveries)
 * Join the thread, meaning, wait for the run method to exit
 * Destroy the container
 * Destroy the handler

Even then, the segfaults persisted.

Scenario:

We have 3 threads: the Main thread, the proton container thread for the sender, 
and a proton container thread for a broker.

In the proton handler, we have a send message method which looks just like the 
above examples, with the additional twist that our send message can throw an 
exception in the Main thread. We want to keep the tracker for further 
processing later. The code looks like this:
{code}
void SenderHandler::send(proton::message m)
{
...
   std::promise messageWillBeSent;
   m_senderWorkQueue->add([&]{
  messageWillBeSent.set_value(m_sender.send(m_messageToSend));
   });
   auto tracker = messageWillBeSent.get_future().get();

   waitForTrackerSettle(timeout); // checks for errors in proton, and throws an 
exception if an error did occur in proton
}
{code}
 In our case, we are simulating a problem with the broker. Therefore, the send 
will take an exception in the waitForTrackerSettle method.

The main thread will start to unroll, starting by the destruction of the 
tracker. The proton container thread, which took an error and propagated it to 
the main thread, in the mean time was finalizing the run method and exiting its 
thread. Both threads are manipulating reference counts of objects, and I 
suspect a race condition. Taking a look at the reference counting mechanism in 
proton 
([object.c|https://github.com/apache/qpid-proton/blob/0.26.0/c/src/core/object/object.c]),
 I see that the operations on reference counters are not atomic. In c++, 
shared_ptr reference counter operations are known to be 
atomic([shared_ptr_base.h|[https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/shared_ptr_base.h]).]
 I strongly suspect that this is not safe.

These cores, we get randomly, along with stacks that look exactly like the one 
you attached (with the main thread waiting on the thread.join()). Replying to  
[~ODelbeke]: "However, I still don't really understand why it solves the 
problem." We noticed that the smallest change in the code results in a 
different stack (sometimes destructor of connection, other times destructor of 
trackers, senders, ...). I'm not sure the result you're getting now is not 
random.

[~cliffjansen] I think you might better understand the inner workings of the 
memory management model of proton. Were race conditions on the reference 
counter factored in the design of the proton's memory management?

I will be testing a proton patch locally that substitutes the proton int 
reference counter, by std::atomic.

> [c] Crash in pn_connection_finalize
> ---
>
> Key: PROTON-1999
> URL: https://issues.apache.org/jira/browse/PROTON-1999
> Project: Qpid Proton
>  Issue Type: Bug
>  Components: cpp-binding, proton-c
>Affects Versions: proton-c-0.26.0
> Environment: Linux 64-bits (Ubuntu 16.04 and Oracle Linux 7.4)
>Reporter: Olivier Delbeke
>Assignee: Cliff Jansen
>Priority: Major
> Attachments: call_stack.txt, example2.cpp, log.txt, main.cpp, 
> run_qpid-broker.sh
>
>
> Here is my situation : I have several proton::containers (~20). 
> Each one has its own proton::messaging_handler, and handles one 
> proton::connection to a local qpid-broker (everything runs on the same 

[jira] [Commented] (PROTON-1999) [c] Crash in pn_connection_finalize

2019-01-30 Thread Olivier Delbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/PROTON-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756102#comment-16756102
 ] 

Olivier Delbeke commented on PROTON-1999:
-

Thank you so much.

Calling sender.work_queue() from one of the callbacks and saving the pointer 
surprisingly indeed seems to improve things. I need to do more tests to be 
completely sure as it's not that easy to reproduce, but it looks good until 
now. My understanding was that this call was a simple getter function, so it 
would have been thread-safe, but it seems that I was wrong.

> [c] Crash in pn_connection_finalize
> ---
>
> Key: PROTON-1999
> URL: https://issues.apache.org/jira/browse/PROTON-1999
> Project: Qpid Proton
>  Issue Type: Bug
>  Components: cpp-binding, proton-c
>Affects Versions: proton-c-0.26.0
> Environment: Linux 64-bits (Ubuntu 16.04 and Oracle Linux 7.4)
>Reporter: Olivier Delbeke
>Assignee: Cliff Jansen
>Priority: Major
> Attachments: call_stack.txt, example2.cpp, log.txt, main.cpp, 
> run_qpid-broker.sh
>
>
> Here is my situation : I have several proton::containers (~20). 
> Each one has its own proton::messaging_handler, and handles one 
> proton::connection to a local qpid-broker (everything runs on the same Linux 
> machine).
> 20 x ( one container with one handler with one connection with one link)
> Some containers/connections/handlers work in send mode ; they have one link 
> that is a proton::sender.
> Some containers/connections/handlers work in receive mode ; they have one 
> link that is a proton::receiver. Each time they receive an input message, 
> they do some processing on it, and finally add a "sender->send()" task to the 
> work queue of some sender handlers ( by calling work_queue()->add( [=] \{ 
> sender->send(msg); } as shown in the multi-threading examples).
> This works fine for some time (tens of thousands of messages, several minutes 
> or hours), but eventually crashes, either with a SEGFAULT (when the 
> qpid-proton lib is compiled in release mode) or with an assert (in debug 
> mode), in qpid-proton/c/src/core/engine.c line 483, 
> assert(!conn->transport->referenced) in function pn_connection_finalize().
> The proton logs (activated with export PN_TRACE_FRM=1) do not show anything 
> abnormal (no loss of connection, no rejection of messages, no timeouts, ...).
> As the connection is not closed, I wonder why pn_connection_finalize() would 
> be called in the first place.
> I joined the logs and the call trace.
> Happens on 0.26.0 but also reproduced with the latest master (Jan 28, 2019).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (PROTON-1999) [c] Crash in pn_connection_finalize

2019-01-29 Thread Cliff Jansen (JIRA)


[ 
https://issues.apache.org/jira/browse/PROTON-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755400#comment-16755400
 ] 

Cliff Jansen commented on PROTON-1999:
--

In example2.cpp

  sender.work_queue()

is not thread safe, hence send_message() is not.

Take a look at the examples where a pointer to the work_queue is carefully 
managed: NULL on init, set from within the handler callback, and your 
send_message() would use:

  my_work_queue_ptr_->add( the_work );

Please give that a try and report back.

> [c] Crash in pn_connection_finalize
> ---
>
> Key: PROTON-1999
> URL: https://issues.apache.org/jira/browse/PROTON-1999
> Project: Qpid Proton
>  Issue Type: Bug
>  Components: cpp-binding, proton-c
>Affects Versions: proton-c-0.26.0
> Environment: Linux 64-bits (Ubuntu 16.04 and Oracle Linux 7.4)
>Reporter: Olivier Delbeke
>Assignee: Cliff Jansen
>Priority: Major
> Attachments: call_stack.txt, example2.cpp, log.txt, main.cpp, 
> run_qpid-broker.sh
>
>
> Here is my situation : I have several proton::containers (~20). 
> Each one has its own proton::messaging_handler, and handles one 
> proton::connection to a local qpid-broker (everything runs on the same Linux 
> machine).
> 20 x ( one container with one handler with one connection with one link)
> Some containers/connections/handlers work in send mode ; they have one link 
> that is a proton::sender.
> Some containers/connections/handlers work in receive mode ; they have one 
> link that is a proton::receiver. Each time they receive an input message, 
> they do some processing on it, and finally add a "sender->send()" task to the 
> work queue of some sender handlers ( by calling work_queue()->add( [=] \{ 
> sender->send(msg); } as shown in the multi-threading examples).
> This works fine for some time (tens of thousands of messages, several minutes 
> or hours), but eventually crashes, either with a SEGFAULT (when the 
> qpid-proton lib is compiled in release mode) or with an assert (in debug 
> mode), in qpid-proton/c/src/core/engine.c line 483, 
> assert(!conn->transport->referenced) in function pn_connection_finalize().
> The proton logs (activated with export PN_TRACE_FRM=1) do not show anything 
> abnormal (no loss of connection, no rejection of messages, no timeouts, ...).
> As the connection is not closed, I wonder why pn_connection_finalize() would 
> be called in the first place.
> I joined the logs and the call trace.
> Happens on 0.26.0 but also reproduced with the latest master (Jan 28, 2019).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (PROTON-1999) [c] Crash in pn_connection_finalize

2019-01-29 Thread Olivier Delbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/PROTON-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755095#comment-16755095
 ] 

Olivier Delbeke commented on PROTON-1999:
-

I just joined minimal sample code that reproduces the issue, and a script to 
start the qpid broker.

> [c] Crash in pn_connection_finalize
> ---
>
> Key: PROTON-1999
> URL: https://issues.apache.org/jira/browse/PROTON-1999
> Project: Qpid Proton
>  Issue Type: Bug
>  Components: cpp-binding, proton-c
>Affects Versions: proton-c-0.26.0
> Environment: Linux 64-bits (Ubuntu 16.04 and Oracle Linux 7.4)
>Reporter: Olivier Delbeke
>Priority: Major
> Attachments: call_stack.txt, log.txt, main.cpp, run_qpid-broker.sh
>
>
> Here is my situation : I have several proton::containers (~20). 
> Each one has its own proton::messaging_handler, and handles one 
> proton::connection to a local qpid-broker (everything runs on the same Linux 
> machine).
> 20 x ( one container with one handler with one connection with one link)
> Some containers/connections/handlers work in send mode ; they have one link 
> that is a proton::sender.
> Some containers/connections/handlers work in receive mode ; they have one 
> link that is a proton::receiver. Each time they receive an input message, 
> they do some processing on it, and finally add a "sender->send()" task to the 
> work queue of some sender handlers ( by calling work_queue()->add( [=] \{ 
> sender->send(msg); } as shown in the multi-threading examples).
> This works fine for some time (tens of thousands of messages, several minutes 
> or hours), but eventually crashes, either with a SEGFAULT (when the 
> qpid-proton lib is compiled in release mode) or with an assert (in debug 
> mode), in qpid-proton/c/src/core/engine.c line 483, 
> assert(!conn->transport->referenced) in function pn_connection_finalize().
> The proton logs (activated with export PN_TRACE_FRM=1) do not show anything 
> abnormal (no loss of connection, no rejection of messages, no timeouts, ...).
> As the connection is not closed, I wonder why pn_connection_finalize() would 
> be called in the first place.
> I joined the logs and the call trace.
> Happens on 0.26.0 but also reproduced with the latest master (Jan 28, 2019).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org