Re: [Engine-devel] Autorecovery feature plan for review

Yaniv Kaul Wed, 15 Feb 2012 23:35:22 -0800

On 02/16/2012 09:29 AM, Moran Goldboim wrote:

On 02/16/2012 12:38 AM, Itamar Heim wrote:
On 02/15/2012 07:02 PM, Livnat Peer wrote:
On 15/02/12 18:28, Ayal Baron wrote:
----- Original Message -----
Hi,

A short summary from the call today, please correct me if I forgot or
misunderstood something.

Ayal argued that the failed host/storagedomain should be reactivated
by a periodically executed job, he would prefer if the engine could
[try to] correct the problem right on discovery.
Livnat's point was that this is hard to implement and it is OK if we
move it to Nonoperational state and periodically check it again.

There was a little arguing if we call the current behavior a bug or a
missing behavior, I believe this is not quite important.

I did not fully understand the last few sentences from Livant, did we
manage to agree in a change in the plan?
A couple of points that we agreed upon:
1. no need for new mechanism, just initiate this from themonitoring context.Preferably, if not difficult, evaluate the monitoring data, ifhost should remain in non-op then don't bother running initVdsOnUp2. configuration of when to call initvdsonup is orthogonal toauto-init behaviour and if introduced should be on by default anduser should be able to configure this either on or off for the hostin general (no lower granularity) and can only be configured viathe API.When disabled initVdsOnUp would be called only when admin activatesthe host/storage and any error would keep it inactive (I stilldon't understand why this is at all needed but whatever).
Also a note from Moran on the call was to check if we can unify the
non-operational and Error statuses of the host.
It was mentioned on the call that the reason for having ERROR state is
for recovery (time out of the error state) but since we are about to
recover from non-operational status as well there is no reason to have
two different statuses.
they are not exactly the same.
or should i say, error is supposed to be when reason isn't related tohost being non-operational.
what is error state?
a host will go into error state if it fails to run 3 (configurable)VMs, that succeeded running on other host on retry.
i.e., something is wrong with that host, failing to launch VMs.
as it happens, it already "auto recovers" for this mode after acertain period of time.
why? because the host will fail to run virtual machines, and will bethe least loaded, so it will be the first target selected to runthem, which will continue to fail.
so there is a negative scoring mechanism on number of errors, tillhost is taken out for a while.
(I don't remember if the reverse is true and the VM goes into errormode if the VM failed to launch on all hosts per number of retries. ithink this wasn't needed and user just got an error in audit log)
i can see two reasons a host will go into error state:
1. monitoring didn't detect an issue yet, and host wouldhave/will/should go into non-operational mode.if host will go into non-operational mode, and will auto recover withthe above flow, i guess it is fine.
2. cause for failure isn't something we monitor for (upgraded to abad version of qemu, or qemu got corrupted).
now, the error mode was developed quite a long time ago (august 2007iirc), so could be it mostly compensated for the first reason whichis now better monitored.i wonder how often error state is seen due to a reason which isn'tmonitored already.
moran - do you have examples of when you see error state of hosts?
usually it happened when there were a problematic/ misconfiguratedvdsm / libvirt which failed to run vms (nothing we can recover from)-i haven't faced the issue of "host it too loaded" that status has someother syndromes, however the behaviour on that state is very much thesame -waiting for 30 min (?) and than move it to activated.
Moran.

'host is too loaded' is too loaded is the only transient state where atemporary 'error' state makes sense, but in the same time, it can alsofit the 'non operational' state description.From my experience, the problem with KVM/libvirt/VDSM mis-configured isnever temporary, (= magically solved by itself, without concrete userintervention). IMHO, it should move the host to an error state thatwould not automatically recover from.Regardless, consolidating the names of the states ('inactive, detached,non operational, maintenance, error, unknown' ...) would be nice too.Probably can't be done for all, of course.

Y.

_______________________________________________
Engine-devel mailing list
Engine-devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/engine-devel


_______________________________________________
Engine-devel mailing list
Engine-devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/engine-devel

Re: [Engine-devel] Autorecovery feature plan for review

Reply via email to