On 08/02/18 15:04, Chris Friesen wrote:
On 08/02/2018 01:04 PM, melanie witt wrote:

The problem is an infamous one, which is, your users are trying to boot
instances and they get "No Valid Host" and an instance in ERROR state. They contact support, and now support is trying to determine why NoValidHost happened. In the past, they would turn on DEBUG log level on the nova-scheduler, try another request, and take a look at the scheduler logs.

At a previous Summit[1] there were some operators that said they just always ran nova-scheduler with debug logging enabled in order to deal with this issue, but that it was a pain [...]

I would go a bit further and say it's likely to be unacceptable on a large cluster. It's expensive to deal with all those logs and to manually comb through them for troubleshooting this issue type, which can happen frequently with some setups. Secondarily there are performance and security concerns with leaving debug on all the time.

As to "defining the problem", I think it's what Melanie said. It's about asking for X and the system saying, "sorry, can't give you X" with no further detail or even means of discovering it.

More generally, any time a service fails to deliver a resource which it is primarily designed to deliver, it seems to me at this stage that should probably be taken a bit more seriously than just "check the log file, maybe there's something in there?" From the user's perspective, if nova fails to produce an instance, or cinder fails to produce a volume, or neutron fails to build a subnet, that's kind of a big deal, right?

In such cases, would it be possible to generate a detailed exception object which contains all the necessary info to ascertain why that specific failure occurred? Ideally the operator should be able to correlate those exceptions with associated objects, e.g. the instance in ERROR state in this case, so that given that failed instance ID they can quickly remedy the user's problem without reading megabytes of log files. If there's a way to make this error handling generic across services to some extent, that seems like it would be great for operators.

Such a framework might eventually hook into internal ticketing systems, maintenance reporting, or provide a starting point for self healing mechanisms, but initially the aim would just be to provide the operator with the bare minimum info necessary for more efficient break-fix.

It could be a big investment, but it also doesn't seem like "optional" functionality from a large operator's perspective. "Enable debug and try again" is just not good enough IMHO.

--
Michael Glasgow

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to