[Yahoo-eng-team] [Bug 1834048] Re: Nova waits indefinitely on ceph client hangs due to network problems

OpenStack Infra Thu, 04 Jul 2019 05:46:25 -0700

Reviewed:  https://review.opendev.org/667421
Committed: 
https://git.openstack.org/cgit/openstack/nova/commit/?id=03f7dc29b75d1099ef44a034ed7e23d2a4444ac6
Submitter: Zuul
Branch:    master


commit 03f7dc29b75d1099ef44a034ed7e23d2a4444ac6
Author: Lee Yarwood <[email protected]>
Date:   Tue Jun 25 18:20:24 2019 +0100

    libvirt: Add a rbd_connect_timeout configurable
    
    Previously the initial call to connect to a RBD cluster via the RADOS
    API could hang indefinitely if network or other environmental related
    issues were encountered.
    
    When encountered during a call to update_available_resource this can
    result in the local n-cpu service reporting as UP while never being able
    to break out of a subsequent RPC timeout loop as documented in bug
    
    This change adds a simple timeout configurable to be used when initially
    connecting to the cluster [1][2][3]. The default timeout of 5 seconds
    being sufficiently small enough to ensure that if encountered the n-cpu
    service will be able to be marked as DOWN before a RPC timeout is seen.
    
    [1] http://docs.ceph.com/docs/luminous/rados/api/python/#rados.Rados.connect
    [2] http://docs.ceph.com/docs/mimic/rados/api/python/#rados.Rados.connect
    [3] http://docs.ceph.com/docs/nautilus/rados/api/python/#rados.Rados.connect
    
    Closes-bug: #1834048
    Change-Id: I67f341bf895d6cc5d503da274c089d443295199e


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1834048

Title:
  Nova waits indefinitely on ceph client hangs due to network problems

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===========
  Requested to be filed by sean-k-mooney as "not a ceph problem".

  During what looks like the update_available_resource process, queries
  to ceph are made to check available space, etc. In cases where there
  is packet loss between the compute node and ceph, the ceph client may
  hang for up to 30 seconds per dropped request.

  This freezes up nova's queue and enough sequential failures will
  eventually shows up with a symptom of "too many missed heartbeats"
  rabbitmq error, which interrupts and restarts the cycle over again.

  As suggested by Sean, it might be best to put a configurable timeout
  on ceph calls during this process to ensure nova doesnt lock up/flap,
  and ceph backend network issues are reported for debug.

  Steps to reproduce
  ==================
  1. introduce a silent failure of ceph client, oneway packet loss via 
mismatched LACP MTU across switches, bad triangular routing, flapping links, 
etc.
  2. observe symptom of nova hanging long enough to miss 60 seconds of rabbitmq 
heartbeats, debug hanging on update_available_resource 
/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/resource_tracker.py:704

  Expected result
  ===============
  nova alerting of ceph connection timeout

  Actual result
  =============
  nova hangs for 60 seconds, while being in "up" state, flapping for a couple 
seconds every 60 seconds as it hits the rabbitmq error and reconnects, but is 
in non-functional state and ignores all instructions on the messagebus.

  Environment
  ===========
  nova==18.1.0
  rocky

  Logs & Configs
  ==============
  No direct logs other than rabbitmq's complaints of timeouts as a symptom.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1834048/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1834048] Re: Nova waits indefinitely on ceph client hangs due to network problems

Reply via email to