Bug closed due to lack of activity, please feel free to reopen if needed. ** Changed in: neutron Status: New => Won't Fix
-- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1853071 Title: AMQP disconnects, q-reports-plugin queue grows, leading to DBDeadlocks while trying to update agent heartbeats Status in neutron: Won't Fix Bug description: Since upgrading to Rocky, we have seen this issue pop up in several environments, small and large. First we see various AMQP/Rabbit related errors - missed heartbeats from neutron-server to rabbitmq, then repeated errors such as Socket Closed, Broken Pipe, etc... This continues on for a while and all agents report as dead. On the agent side, we see RPC timeouts when trying to report state. Meanwhile, the q-reports-plugin queue in rabbit grows, to 10k+ - presumably because neutron-server can't connect to Rabbit and process messages. Eventually sometime later, we see "DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')" errors when neutron-server is trying to update stale agent heartbeats. Example of various AMQP related errors - all slightly different: 2019-11-18 07:38:55,200.200 22488 ERROR oslo.messaging._drivers.impl_rabbit [req- cba0d0fa-8e5a-42f1-a93b-4bb398b22275 - - - - -] [eba472a9-021d-4738-801b-7944aad3e3af] AMQP server 127.0.0.1:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed 2019-11-18 07:40:22,454.454 22489 ERROR oslo.messaging._drivers.impl_rabbit [-] [6eb8d074-02c7-4add-8d91-768cbae60fdc] AMQP server on 127.0.0.1:5672 is unreachable: Too many heartbeats missed. Trying again in 1 seconds.: ConnectionForced: Too many heartbeats missed 2019-11-18 07:40:22,586.586 22489 ERROR oslo.messaging._drivers.impl_rabbit [req-0b9f092f-27f2-4be1-bdf5-2c5208e54321 - - - - -] [4b43df2c-cc3e-4442-807c-dcfd4057cb3d] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.: error: [Errno 32] Broken pipe 2019-11-18 07:42:06,010.010 22487 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 32] Broken pipe 2019-11-18 07:58:26,692.692 22489 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: IOError: Socket closed 2019-11-18 07:58:26,696.696 22489 ERROR oslo.messaging._drivers.impl_rabbit [-] [84273ffb-1610-44b1-aff7-d5e4606b7f59] AMQP server on 127.0.0.1:5672 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <AMQPError: unknown error> Along with following Broken Pipe stacktrace in oslo messaging: http://paste.openstack.org/show/786312/ This continues for some time (30 min - 1 hour) and all agents report as dead, and we see following errors in rabbitmq broker logs: First missed heartbeat errors, then handshake_timeout errors: 2019-11-18 07:41:01.448 [error] <0.6126.71> closing AMQP connection <0.6126.71> (127.0.0.1:39817 -> 127.0.0.1:5672 - neutron-server:22487:ee468e25-42d7-45b8-aea0-4f6fb58a9034): missed heartbeats from client, timeout: 60s 2019-11-18 07:41:07.665 [error] <0.18727.72> closing AMQP connection <0.18727.72> (127.0.0.1:51762 -> 127.0.0.1:5672): {handshake_timeout,frame_header} Eventually we see Rabbitmq q-reports has grown and neutron reporting following DBDeadlock stacktrace: 2019-11-18 08:51:14,505.505 22493 ERROR oslo_db.api [req-231004a2-d988-47b3-9730-d6b5276fdcf8 - - - - -] DB exceeded retry limit.: DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE agents SET heartbeat_timestamp=%s WHERE agents.id = %s'] [parameters: (datetime.datetime(2019, 11, 18, 8, 50, 23, 804716), '223c754e-9d7f-4df3-b5a5-9be4eb8692b0')] (Background on this error at: http://sqlalche.me/e/e3q8) Full stacktrace here: http://paste.openstack.org/show/786313/ The only way to recover is we stop neutron-server and rabbitmq, kill any neutron workers still dangling (which they usually are), then restart. But then we see problem manifest days or a week later. Rabbitmq is on same host as neutron-server - it is all localhost communication. So we are unsure of why it can't heartbeat or connect. Also the subsequent DBDeadlock leads me to think there is some syncrhonization issue when neutron gets overwhelmed with outstanding RPC messages. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1853071/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp