Re: [DISCUSS] KVM HA with IPMI Fencing

Ronald van Zantvoort Mon, 19 Oct 2015 02:25:08 -0700

On 19/10/15 11:18, Ronald van Zantvoort wrote:

On 16/10/15 00:21, ilya wrote:

I noticed several attempts to address the issue with KVM HA in Jira and
Dev ML. As we all know, there are many ways to solve the same problem,
on our side, we've given it some thought as well - and its on our to do
list.


Specifically a mail thread "KVM HA is broken, let's fix it"
JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8643

We propose the following solution that in our understanding should cover
all use cases and provide a fencing mechanism.

NOTE: Proposed IPMI fencing, is just a script. If you are using HP
hardware with ILO, it could be an ILO executable with specific
parameters. In theory - this can be *any* action script not just IPMI.

Please take few minutes to read this through, to avoid duplicate
efforts...


Proposed FS below:
----------------

https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+HA+with+IPMI+Fencing



Hi Ilja, thanks for the design; I've put a comment int 8943, here it is
verbatim as my 5c in the discussion:


Well, that completely clobbered up the readability LOL

Let's try again, but seehttps://issues.apache.org/jira/browse/CLOUDSTACK-8943 for the bettermarkup ;)

[~ilya.mailing.li...@gmail.com]: Thanks for the design document. I can'tcomment in Confluence, so here goes:

* When to fence; [~sweller]: Of course you're right that it should behighly unlikely that your storage completely dissappears from thecluster. Be that as it may, as you yourself note, first of all if you'reusing NFS without HA that likelihood increases manyfold. Secondly,defining it as an anlikely disastrous event seems no reason not to takeit into account; making it a catastrophic event by 'fencing' allaffected hypervisors will not serve anyone as it would be unexpected andunwelcome.* The entire concept of fencing exists to absolutely ensure state.Specifically in this regard the state of the block devices and theirdata. [~shadowsor]: For that same reason it's not reasonable to 'justassume' VM's gone. There's a ton of failure domains that could cause anagent to disconnect from the manager but still have the same VM'srunning, and there's nothing stopping CloudStack from starting the sameVM twice on the same block devices, with desastrous results. That's whyyou *need* to *know* the VM's are *very definitely* not running anymore,which is exactly what fencing is supposed to do.* For this, IPMI fencing is a nice and very often used option;absolutely ensuring a hypervisor has died, and ergo the running VM's. Itwill however not fix the case of the mass rebooting hypervisors (butrather quite likely making it even more of an adventure if not addressedproperly)

Now, with all that in mind, I'd like to make the following commentsregarding [~ilya.mailing.li...@gmail.com] 's design.

* First of the IPMI implementation: There's is IMHO no need to defineIPMI (Executable,Start,Stop,Reboot,Blink,Test). IPMI is a protocol, allthese are standard commands. For example, using the venerable `ipmitool`gives you `chassis power (on,status,poweroff,identify,reset)` etc. whichwill work on any IPMI device; only authentication details (User, Pass,Proto) differ. There's bound to be some library that does it withouthaving to resort to (possibly numerous) different (versions of) externalbinaries.

* Secondly you're assuming that hypervisors can access the IPMI's oftheir cluster/pod peers; although I'm not against this assumption persé, I'm also not convinced we're servicing everybody by forcing thatassumption to be true; some kind of IPMI agent/proxy comes to mind, oreven relegating the task back to the manager or some SystemVM. Also bearin mind that you need access to those IPMI's to ensure clusterfunctionality, so a failure domain should be in maintenance state if anyof the fence devices can't be reached

* Thirdly your proposed testing algorithm needs more discussion; afterall, it directly hits the fundamental principal reasons for *why* tofence a host, and that's a lot more than just 'these disks still getswrites'. In fact, by the time you're checking this, you're probablyalready assuming something's very wrong with the hypervisor, so why notjust fence it then? The decision to fence should lie with the firstnotification that some is (very) wrong with the hypervisor, and onlylimited attempts should be made to get it out. Say it can't reach it'sstorage and that get's you your HA actions; why check for the disksfirst? Try to get the storage back up like 3 times, or for 90 sec or so,then fence the fucker and HA the VM's immediately after confirmation. Infact, that's exactly what it's doing now, with the side note thatconfirmation can only reasonably follow after the hypervisor is donerebooting.

* Finally as mentioned you're not solving the 'o look, my storage isgone, let's fence' * (N) problem; in the case of a failing NFS:** Every host will start IPMI resetting every other hypervisor; bythen there's a good chance every hypervisor in all connected clustersare rebooting, leaving a state where there's no hypervisors in thecluster to fence others; that in turn should lead to the cluster fallingin maintenance state, which will lead to even more bells & whistlesgoing off.** They'll come back, find the NFS still gone, and continue resettingeach other like there's no tomorrow** Support staff already panicking over the NFS/network outage nowhas to deal with entire clusters of hypervisors in perpetual reboot aswell as clusters which are completely unreachable because there's no oneleft to check state; this all while the outage might simply require therevert of some inadvertent network ACL snafuAlthough I well understand [~sweller]'s concerns regarding agentcomplexity in this regard, quorum is the standard way of solving thatproblem. On the other hand, once the Agents start talking to each otherand the Manager over some standard messaging API/bus this problem mightwell be solved for you; getting, say, Gossip or Paxos or any otherclustering/quorum protocol shouldn't be that hard considering the amountof Java software already doing just that out there.** Another idea would be to introduce some other kind of storagemonitoring, for example by a SystemVM or something.** If you'll insist on the 'clusters fence themselves' paradigm, youcould maybe also introduce a constraint that a node is only allowed tofence others if itself is healthy; ergo if it doesn't have all storagesavailable, it doesn't get to fence others whose storage isn't available.

Re: [DISCUSS] KVM HA with IPMI Fencing

Reply via email to