Re: [Openstack-operators] [nova] nova-compute automatically disabling itself?

Chris Apsey Tue, 06 Feb 2018 12:18:30 -0800

All,

This was the core issue - settingconsecutive_build_service_disable_threshold = 0 in nova.conf (oncontrollers and compute nodes) solved this. It was being triggered byneutron dropping requests (and/or responses) for vif-plugging due to cpuusage on the neutron endpoints being pegged at 100% for too long. Weincreased our rpc_response_timeout value and this issue appears to beresolved for the time being. We can probably safely remove theconsecutive_build_service_disable_threshold option at this point, but wewould rather have intermittent build failures rather than compute nodesfalling over in the future.

Slightly related, we are noticing that neutron endpoints are usingnoticeably more CPU time recently than in the past w/ a similar workload(we run linuxbridge w/ vxlan). We believe this is tied to ourapplication of KPTI for meltdown mitigation across the various hosts inour cluster (the timeline matches). Has anyone else experienced similarimpacts or can suggest anything to try to lessen the impact?


---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

On 2018-01-31 04:47 PM, Chris Apsey wrote:

That looks promising.  I'll report back to confirm the solution.

Thanks!

---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

On 2018-01-31 04:40 PM, Matt Riedemann wrote:
On 1/31/2018 3:16 PM, Chris Apsey wrote:
All,

Running in to a strange issue I haven't seen before.
Randomly, the nova-compute services on compute nodes are disablingthemselves (as if someone ran openstack compute service set --disablehostX nova-compute. When this happens, the node continues to reportitself as 'up' - the service is just disabled. As a result, ifenough of these occur, we get scheduling errors due to lack ofavailable resources (which makes sense). Re-enabling them works justfine and they continue on as if nothing happened. I looked throughthe logs and I can find the API calls where we re-enable the services(PUT /v2.1/os-services/enable), but I do not see any API calls wherethe services are getting disabled initially.
Is anyone aware of any cases where compute nodes will automaticallydisable their nova-compute service on their own, or has anyone seenthis before and might know a root cause? We have plenty of sparevcpus and RAM on each node - like less than 25% utilization (both inabsolute terms and in terms of applied ratios).
We're seeing follow-on errors regarding rmq messages getting lost andvif-plug failures, but we think those are a symptom, not a cause.
Currently running pike on Xenial.

---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
This is actually a feature added in Pike:

https://review.openstack.org/#/c/463597/

This came up in discussion with operators at the Forum in Boston.
The vif-plug failures are likely the reason those computes are gettingdisabled.
There is a config option "consecutive_build_service_disable_threshold"
which you can set to disable the auto-disable behavior as some have
experienced issues with it:

https://bugs.launchpad.net/nova/+bug/1742102
_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] [nova] nova-compute automatically disabling itself?

Reply via email to