Re: [openstack-dev] [nova] How to debug no valid host failures with placement

Michael Glasgow Thu, 02 Aug 2018 15:20:02 -0700

On 08/02/18 15:04, Chris Friesen wrote:

On 08/02/2018 01:04 PM, melanie witt wrote:
The problem is an infamous one, which is, your users are trying to boot
instances and they get "No Valid Host" and an instance in ERROR state.They contact support, and now support is trying to determine whyNoValidHost happened. In the past, they would turn on DEBUG log levelon the nova-scheduler, try another request, and take a look at thescheduler logs.
At a previous Summit[1] there were some operators that said they justalways ran nova-scheduler with debug logging enabled in order to dealwith this issue, but that it was a pain [...]

I would go a bit further and say it's likely to be unacceptable on alarge cluster. It's expensive to deal with all those logs and tomanually comb through them for troubleshooting this issue type, whichcan happen frequently with some setups. Secondarily there areperformance and security concerns with leaving debug on all the time.

As to "defining the problem", I think it's what Melanie said. It'sabout asking for X and the system saying, "sorry, can't give you X" withno further detail or even means of discovering it.

More generally, any time a service fails to deliver a resource which itis primarily designed to deliver, it seems to me at this stage thatshould probably be taken a bit more seriously than just "check the logfile, maybe there's something in there?" From the user's perspective,if nova fails to produce an instance, or cinder fails to produce avolume, or neutron fails to build a subnet, that's kind of a big deal,right?

In such cases, would it be possible to generate a detailed exceptionobject which contains all the necessary info to ascertain why thatspecific failure occurred? Ideally the operator should be able tocorrelate those exceptions with associated objects, e.g. the instance inERROR state in this case, so that given that failed instance ID they canquickly remedy the user's problem without reading megabytes of logfiles. If there's a way to make this error handling generic acrossservices to some extent, that seems like it would be great for operators.

Such a framework might eventually hook into internal ticketing systems,maintenance reporting, or provide a starting point for self healingmechanisms, but initially the aim would just be to provide the operatorwith the bare minimum info necessary for more efficient break-fix.

It could be a big investment, but it also doesn't seem like "optional"functionality from a large operator's perspective. "Enable debug andtry again" is just not good enough IMHO.


--
Michael Glasgow

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] How to debug no valid host failures with placement

Reply via email to