Reviewed: https://review.opendev.org/667421 Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=03f7dc29b75d1099ef44a034ed7e23d2a4444ac6 Submitter: Zuul Branch: master
commit 03f7dc29b75d1099ef44a034ed7e23d2a4444ac6 Author: Lee Yarwood <[email protected]> Date: Tue Jun 25 18:20:24 2019 +0100 libvirt: Add a rbd_connect_timeout configurable Previously the initial call to connect to a RBD cluster via the RADOS API could hang indefinitely if network or other environmental related issues were encountered. When encountered during a call to update_available_resource this can result in the local n-cpu service reporting as UP while never being able to break out of a subsequent RPC timeout loop as documented in bug This change adds a simple timeout configurable to be used when initially connecting to the cluster [1][2][3]. The default timeout of 5 seconds being sufficiently small enough to ensure that if encountered the n-cpu service will be able to be marked as DOWN before a RPC timeout is seen. [1] http://docs.ceph.com/docs/luminous/rados/api/python/#rados.Rados.connect [2] http://docs.ceph.com/docs/mimic/rados/api/python/#rados.Rados.connect [3] http://docs.ceph.com/docs/nautilus/rados/api/python/#rados.Rados.connect Closes-bug: #1834048 Change-Id: I67f341bf895d6cc5d503da274c089d443295199e ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1834048 Title: Nova waits indefinitely on ceph client hangs due to network problems Status in OpenStack Compute (nova): Fix Released Bug description: Description =========== Requested to be filed by sean-k-mooney as "not a ceph problem". During what looks like the update_available_resource process, queries to ceph are made to check available space, etc. In cases where there is packet loss between the compute node and ceph, the ceph client may hang for up to 30 seconds per dropped request. This freezes up nova's queue and enough sequential failures will eventually shows up with a symptom of "too many missed heartbeats" rabbitmq error, which interrupts and restarts the cycle over again. As suggested by Sean, it might be best to put a configurable timeout on ceph calls during this process to ensure nova doesnt lock up/flap, and ceph backend network issues are reported for debug. Steps to reproduce ================== 1. introduce a silent failure of ceph client, oneway packet loss via mismatched LACP MTU across switches, bad triangular routing, flapping links, etc. 2. observe symptom of nova hanging long enough to miss 60 seconds of rabbitmq heartbeats, debug hanging on update_available_resource /var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/resource_tracker.py:704 Expected result =============== nova alerting of ceph connection timeout Actual result ============= nova hangs for 60 seconds, while being in "up" state, flapping for a couple seconds every 60 seconds as it hits the rabbitmq error and reconnects, but is in non-functional state and ignores all instructions on the messagebus. Environment =========== nova==18.1.0 rocky Logs & Configs ============== No direct logs other than rabbitmq's complaints of timeouts as a symptom. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1834048/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

