Re: [Openstack-operators] [nova] nova-compute automatically disabling itself?

2018-06-07 Thread Matt Riedemann

On 2/6/2018 6:44 PM, Matt Riedemann wrote:

On 2/6/2018 2:14 PM, Chris Apsey wrote:
but we would rather have intermittent build failures rather than 
compute nodes falling over in the future.


Note that once a compute has a successful build, the consecutive build 
failures counter is reset. So if your limit is the default (10) and you 
have 10 failures in a row, the compute service is auto-disabled. But if 
you have say 5 failures and then a pass, it's reset to 0 failures.


Obviously if you're doing a pack-first scheduling strategy rather than 
spreading instances across the deployment, a burst of failures could 
easily disable a compute, especially if that host is overloaded like you 
saw. I'm not sure if rescheduling is helping you or not - that would be 
useful information since we consider the need to reschedule off a failed 
compute host as a bad thing. At the Forum in Boston when this idea came 
up, it was specifically for the case that operators in the room didn't 
want a bad compute to become a "black hole" in their deployment causing 
lots of reschedules until they get that one fixed.


Just an update on this. There is a change merged in Rocky [1] which is 
also going through backports to Queens and Pike. If you've already 
disabled the "consecutive_build_service_disable_threshold" config option 
then it's a no-op. If you haven't, 
"consecutive_build_service_disable_threshold" is now used to count build 
failures but no longer auto-disable the compute service on the 
configured threshold is met (10 by default). The build failure count is 
then used by a new weigher (enabled by default) to sort hosts with build 
failures to the back of the list of candidate hosts for new builds. Once 
there is a successful build on a given host, the failure count is reset. 
The idea here is that hosts which are failing are given lower priority 
during scheduling.


[1] https://review.openstack.org/#/c/572195/

--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [nova] nova-compute automatically disabling itself?

2018-02-06 Thread Matt Riedemann

On 2/6/2018 2:14 PM, Chris Apsey wrote:
but we would rather have intermittent build failures rather than compute 
nodes falling over in the future.


Note that once a compute has a successful build, the consecutive build 
failures counter is reset. So if your limit is the default (10) and you 
have 10 failures in a row, the compute service is auto-disabled. But if 
you have say 5 failures and then a pass, it's reset to 0 failures.


Obviously if you're doing a pack-first scheduling strategy rather than 
spreading instances across the deployment, a burst of failures could 
easily disable a compute, especially if that host is overloaded like you 
saw. I'm not sure if rescheduling is helping you or not - that would be 
useful information since we consider the need to reschedule off a failed 
compute host as a bad thing. At the Forum in Boston when this idea came 
up, it was specifically for the case that operators in the room didn't 
want a bad compute to become a "black hole" in their deployment causing 
lots of reschedules until they get that one fixed.


--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [nova] nova-compute automatically disabling itself?

2018-02-06 Thread Chris Apsey

All,

This was the core issue - setting 
consecutive_build_service_disable_threshold = 0 in nova.conf (on 
controllers and compute nodes) solved this.  It was being triggered by 
neutron dropping requests (and/or responses) for vif-plugging due to cpu 
usage on the neutron endpoints being pegged at 100% for too long.  We 
increased our rpc_response_timeout value and this issue appears to be 
resolved for the time being.  We can probably safely remove the 
consecutive_build_service_disable_threshold option at this point, but we 
would rather have intermittent build failures rather than compute nodes 
falling over in the future.


Slightly related, we are noticing that neutron endpoints are using 
noticeably more CPU time recently than in the past w/ a similar workload 
(we run linuxbridge w/ vxlan).  We believe this is tied to our 
application of KPTI for meltdown mitigation across the various hosts in 
our cluster (the timeline matches).  Has anyone else experienced similar 
impacts or can suggest anything to try to lessen the impact?


---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

On 2018-01-31 04:47 PM, Chris Apsey wrote:

That looks promising.  I'll report back to confirm the solution.

Thanks!

---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

On 2018-01-31 04:40 PM, Matt Riedemann wrote:

On 1/31/2018 3:16 PM, Chris Apsey wrote:

All,

Running in to a strange issue I haven't seen before.

Randomly, the nova-compute services on compute nodes are disabling 
themselves (as if someone ran openstack compute service set --disable 
hostX nova-compute.  When this happens, the node continues to report 
itself as 'up' - the service is just disabled.  As a result, if 
enough of these occur, we get scheduling errors due to lack of 
available resources (which makes sense).  Re-enabling them works just 
fine and they continue on as if nothing happened.  I looked through 
the logs and I can find the API calls where we re-enable the services 
(PUT /v2.1/os-services/enable), but I do not see any API calls where 
the services are getting disabled initially.


Is anyone aware of any cases where compute nodes will automatically 
disable their nova-compute service on their own, or has anyone seen 
this before and might know a root cause?  We have plenty of spare 
vcpus and RAM on each node - like less than 25% utilization (both in 
absolute terms and in terms of applied ratios).


We're seeing follow-on errors regarding rmq messages getting lost and 
vif-plug failures, but we think those are a symptom, not a cause.


Currently running pike on Xenial.

---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



This is actually a feature added in Pike:

https://review.openstack.org/#/c/463597/

This came up in discussion with operators at the Forum in Boston.

The vif-plug failures are likely the reason those computes are getting 
disabled.


There is a config option "consecutive_build_service_disable_threshold"
which you can set to disable the auto-disable behavior as some have
experienced issues with it:

https://bugs.launchpad.net/nova/+bug/1742102


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [nova] nova-compute automatically disabling itself?

2018-01-31 Thread Chris Apsey

That looks promising.  I'll report back to confirm the solution.

Thanks!

---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

On 2018-01-31 04:40 PM, Matt Riedemann wrote:

On 1/31/2018 3:16 PM, Chris Apsey wrote:

All,

Running in to a strange issue I haven't seen before.

Randomly, the nova-compute services on compute nodes are disabling 
themselves (as if someone ran openstack compute service set --disable 
hostX nova-compute.  When this happens, the node continues to report 
itself as 'up' - the service is just disabled.  As a result, if enough 
of these occur, we get scheduling errors due to lack of available 
resources (which makes sense).  Re-enabling them works just fine and 
they continue on as if nothing happened.  I looked through the logs 
and I can find the API calls where we re-enable the services (PUT 
/v2.1/os-services/enable), but I do not see any API calls where the 
services are getting disabled initially.


Is anyone aware of any cases where compute nodes will automatically 
disable their nova-compute service on their own, or has anyone seen 
this before and might know a root cause?  We have plenty of spare 
vcpus and RAM on each node - like less than 25% utilization (both in 
absolute terms and in terms of applied ratios).


We're seeing follow-on errors regarding rmq messages getting lost and 
vif-plug failures, but we think those are a symptom, not a cause.


Currently running pike on Xenial.

---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



This is actually a feature added in Pike:

https://review.openstack.org/#/c/463597/

This came up in discussion with operators at the Forum in Boston.

The vif-plug failures are likely the reason those computes are getting 
disabled.


There is a config option "consecutive_build_service_disable_threshold"
which you can set to disable the auto-disable behavior as some have
experienced issues with it:

https://bugs.launchpad.net/nova/+bug/1742102


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [nova] nova-compute automatically disabling itself?

2018-01-31 Thread Eric Fried
There's [1], but I would have expected you to see error logs like [2] if
that's what you're hitting.

[1]
https://github.com/openstack/nova/blob/master/nova/conf/compute.py#L627-L645
[2]
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1714-L1716

efried

On 01/31/2018 03:16 PM, Chris Apsey wrote:
> All,
> 
> Running in to a strange issue I haven't seen before.
> 
> Randomly, the nova-compute services on compute nodes are disabling
> themselves (as if someone ran openstack compute service set --disable
> hostX nova-compute.  When this happens, the node continues to report
> itself as 'up' - the service is just disabled.  As a result, if enough
> of these occur, we get scheduling errors due to lack of available
> resources (which makes sense).  Re-enabling them works just fine and
> they continue on as if nothing happened.  I looked through the logs and
> I can find the API calls where we re-enable the services (PUT
> /v2.1/os-services/enable), but I do not see any API calls where the
> services are getting disabled initially.
> 
> Is anyone aware of any cases where compute nodes will automatically
> disable their nova-compute service on their own, or has anyone seen this
> before and might know a root cause?  We have plenty of spare vcpus and
> RAM on each node - like less than 25% utilization (both in absolute
> terms and in terms of applied ratios).
> 
> We're seeing follow-on errors regarding rmq messages getting lost and
> vif-plug failures, but we think those are a symptom, not a cause.
> 
> Currently running pike on Xenial.
> 
> ---
> v/r
> 
> Chris Apsey
> bitskr...@bitskrieg.net
> https://www.bitskrieg.net
> 
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [nova] nova-compute automatically disabling itself?

2018-01-31 Thread Matt Riedemann

On 1/31/2018 3:16 PM, Chris Apsey wrote:

All,

Running in to a strange issue I haven't seen before.

Randomly, the nova-compute services on compute nodes are disabling 
themselves (as if someone ran openstack compute service set --disable 
hostX nova-compute.  When this happens, the node continues to report 
itself as 'up' - the service is just disabled.  As a result, if enough 
of these occur, we get scheduling errors due to lack of available 
resources (which makes sense).  Re-enabling them works just fine and 
they continue on as if nothing happened.  I looked through the logs and 
I can find the API calls where we re-enable the services (PUT 
/v2.1/os-services/enable), but I do not see any API calls where the 
services are getting disabled initially.


Is anyone aware of any cases where compute nodes will automatically 
disable their nova-compute service on their own, or has anyone seen this 
before and might know a root cause?  We have plenty of spare vcpus and 
RAM on each node - like less than 25% utilization (both in absolute 
terms and in terms of applied ratios).


We're seeing follow-on errors regarding rmq messages getting lost and 
vif-plug failures, but we think those are a symptom, not a cause.


Currently running pike on Xenial.

---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



This is actually a feature added in Pike:

https://review.openstack.org/#/c/463597/

This came up in discussion with operators at the Forum in Boston.

The vif-plug failures are likely the reason those computes are getting 
disabled.


There is a config option "consecutive_build_service_disable_threshold" 
which you can set to disable the auto-disable behavior as some have 
experienced issues with it:


https://bugs.launchpad.net/nova/+bug/1742102

--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators