[ https://issues.apache.org/jira/browse/PROTON-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justin Ross reassigned PROTON-2411: ----------------------------------- Assignee: Andrew Stitcher > Simultaneous idle timeout sequencing errors > ------------------------------------------- > > Key: PROTON-2411 > URL: https://issues.apache.org/jira/browse/PROTON-2411 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c > Affects Versions: proton-c-0.34.0 > Reporter: Jaap Wiggelinkhuizen > Assignee: Andrew Stitcher > Priority: Critical > Fix For: proton-c-0.36.0 > > Attachments: p2411_0.diff > > > In our mission critical software we use Qpid proton 0.34.0 in our C++-client > software together with the Qpid dispatch router 1.16.0. We updated to these > versions not so long ago, before we used proton 0.25.0 and dispatch 1.3.0. > Our application runs on several VM’s with a router on each VM. All clients > connect to the local router only and the routers connect to eachother in a > hub spoke pattern. In both the client configuration as the router > configuration we have configured an idle timeout of 30 seconds. > On July 4th we were confronted with an incident in production where a lot of > our client processes reported problems regarding the idle timeouts. These > client processes were already running stable for more than 3 weeks. The > problem appeared in two flavors: > # Transport error “error: amqp:resource-limit-exceeded: local-idle-timeout > expired” > # epoll proactor failure in epoll_timer.c:263: “idle timeout sequencing > error” > On each VM at least 3 processes showed one of these problems in a total time > window of less than a minute. We haven’t found any cause in the underlying > hardware, hypervisor, network or operating system until now. > Although we don’t know the root cause of the problems, we can solve the first > situation by using the proper reconnect settings (by mistake we handled > on_transport_error() as a fatal situation and will correct that so that only > on_transport_close() will be handled as fatal). However the second situation > is more odd because it results in an abort within proton itself. The comments > in epoll_timer.c explain that this error occurs when a connection timer is > moved backwards a second time. We don’t understand how this can happen > suddenly. > > Last sunday the problem occurred again on two more production sites where our > software was operational just over 3 weeks now. And again it has happened on > all VM's within a short timeframe. It's interesting that it only occurs on > sunday mornings until now. Maybe it has something to do with how long the > software is running and the fact that on sunday mornings there is less > messaging traffic, i.e. more heartbeats?... > > Unfortunately we haven't been able to reproduce the issue at our test > facilities and hence can not provide a reproducer. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org