Public bug reported: nova.servicegroup.drivers.db.DbDriver._report_state() is called every service.report_interval seconds from a timer in order to periodically report the service state. It calls self.conductor_api.service_update().
If this ends up calling nova.conductor.rpcapi.ConductorAPI.service_update(), it will do an RPC call() to nova-conductor. If anything happens to the RPC server (failover, switchover, etc.) by default the RPC code will wait 60 seconds for a response (blocking the timer-based calling of _report_state() in the meantime). This is long enough to cause the status in the database to get old enough that other services consider this service to be "down". Arguably, since we're going to call service_update( ) again in service.report_interval seconds there's no reason to wait the full 60 seconds. Instead, it would make sense to set the RPC timeout for the service_update() call to to something slightly less than service.report_interval seconds. I've also submitted a related bug report (https://bugs.launchpad.net/bugs/1368917) to improve RPC loss of connection in general, but I expect that'll take a while to deal with while this particular case can be handled much more easily. ** Affects: nova Importance: Undecided Status: New ** Tags: compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1368989 Title: service_update() should not set an RPC timeout longer than service.report_interval Status in OpenStack Compute (Nova): New Bug description: nova.servicegroup.drivers.db.DbDriver._report_state() is called every service.report_interval seconds from a timer in order to periodically report the service state. It calls self.conductor_api.service_update(). If this ends up calling nova.conductor.rpcapi.ConductorAPI.service_update(), it will do an RPC call() to nova-conductor. If anything happens to the RPC server (failover, switchover, etc.) by default the RPC code will wait 60 seconds for a response (blocking the timer-based calling of _report_state() in the meantime). This is long enough to cause the status in the database to get old enough that other services consider this service to be "down". Arguably, since we're going to call service_update( ) again in service.report_interval seconds there's no reason to wait the full 60 seconds. Instead, it would make sense to set the RPC timeout for the service_update() call to to something slightly less than service.report_interval seconds. I've also submitted a related bug report (https://bugs.launchpad.net/bugs/1368917) to improve RPC loss of connection in general, but I expect that'll take a while to deal with while this particular case can be handled much more easily. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1368989/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp