Re: [DISCUSS] KVM HA with IPMI Fencing

Ronald van Zantvoort Mon, 19 Oct 2015 02:19:06 -0700

On 16/10/15 00:21, ilya wrote:

I noticed several attempts to address the issue with KVM HA in Jira and
Dev ML. As we all know, there are many ways to solve the same problem,
on our side, we've given it some thought as well - and its on our to do
list.


Specifically a mail thread "KVM HA is broken, let's fix it"
JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8643

We propose the following solution that in our understanding should cover
all use cases and provide a fencing mechanism.

NOTE: Proposed IPMI fencing, is just a script. If you are using HP
hardware with ILO, it could be an ILO executable with specific
parameters. In theory - this can be *any* action script not just IPMI.

Please take few minutes to read this through, to avoid duplicate efforts...


Proposed FS below:
----------------

https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+HA+with+IPMI+Fencing

Hi Ilja, thanks for the design; I've put a comment int 8943, here it isverbatim as my 5c in the discussion:

ilya musayev: Thanks for the design document. I can't comment inConfluence, so here goes:When to fence; Simon Weller: Of course you're right that it should behighly unlikely that your storage completely dissappears from thecluster. Be that as it may, as you yourself note, first of all if you'reusing NFS without HA that likelihood increases manyfold. Secondly,defining it as an anlikely disastrous event seems no reason not to takeit into account; making it a catastrophic event by 'fencing' allaffected hypervisors will not serve anyone as it would be unexpected andunwelcome.The entire concept of fencing exists to absolutely ensure state.Specifically in this regard the state of the block devices and theirdata. Marcus Sorensen: For that same reason it's not reasonable to 'justassume' VM's gone. There's a ton of failure domains that could cause anagent to disconnect from the manager but still have the same VM'srunning, and there's nothing stopping CloudStack from starting the sameVM twice on the same block devices, with desastrous results. That's whyyou need to know the VM's are very definitely not running anymore, whichis exactly what fencing is supposed to do.For this, IPMI fencing is a nice and very often used option; absolutelyensuring a hypervisor has died, and ergo the running VM's. It willhowever not fix the case of the mass rebooting hypervisors (but ratherquite likely making it even more of an adventure if not addressed properly)Now, with all that in mind, I'd like to make the following commentsregarding ilya musayev 's design.First of the IPMI implementation: There's is IMHO no need to define IPMI(Executable,Start,Stop,Reboot,Blink,Test). IPMI is a protocol, all theseare standard commands. For example, using the venerable `ipmitool` givesyou `chassis power (on,status,poweroff,identify,reset)` etc. which willwork on any IPMI device; only authentication details (User, Pass, Proto)differ. There's bound to be some library that does it without having toresort to (possibly numerous) different (versions of) external binaries.Secondly you're assuming that hypervisors can access the IPMI's of theircluster/pod peers; although I'm not against this assumption per sé, I'malso not convinced we're servicing everybody by forcing that assumptionto be true; some kind of IPMI agent/proxy comes to mind, or evenrelegating the task back to the manager or some SystemVM. Also bear inmind that you need access to those IPMI's to ensure clusterfunctionality, so a failure domain should be in maintenance state if anyof the fence devices can't be reachedThirdly your proposed testing algorithm needs more discussion; afterall, it directly hits the fundamental principal reasons for why to fencea host, and that's a lot more than just 'these disks still gets writes'.In fact, by the time you're checking this, you're probably alreadyassuming something's very wrong with the hypervisor, so why not justfence it then? The decision to fence should lie with the firstnotification that some is (very) wrong with the hypervisor, and onlylimited attempts should be made to get it out. Say it can't reach it'sstorage and that get's you your HA actions; why check for the disksfirst? Try to get the storage back up like 3 times, or for 90 sec or so,then fence the fucker and HA the VM's immediately after confirmation. Infact, that's exactly what it's doing now, with the side note thatconfirmation can only reasonably follow after the hypervisor is donerebooting.Finally as mentioned you're not solving the 'o look, my storage is gone,let's fence' * (N) problem; in the case of a failing NFS:Every host will start IPMI resetting every other hypervisor; by thenthere's a good chance every hypervisor in all connected clusters arerebooting, leaving a state where there's no hypervisors in the clusterto fence others; that in turn should lead to the cluster falling inmaintenance state, which will lead to even more bells & whistles going off.They'll come back, find the NFS still gone, and continue resetting eachother like there's no tomorrowSupport staff already panicking over the NFS/network outage now has todeal with entire clusters of hypervisors in perpetual reboot as well asclusters which are completely unreachable because there's no one left tocheck state; this all while the outage might simply require the revertof some inadvertent network ACL snafuAlthough I well understand Simon Weller's concerns regarding agentcomplexity in this regard, quorum is the standard way of solving thatproblem. On the other hand, once the Agents start talking to each otherand the Manager over some standard messaging API/bus this problem mightwell be solved for you; getting, say, Gossip or Paxos or any otherclustering/quorum protocol shouldn't be that hard considering the amountof Java software already doing just that out there.Another idea would be to introduce some other kind of storagemonitoring, for example by a SystemVM or something.If you'll insist on the 'clusters fence themselves' paradigm, you couldmaybe also introduce a constraint that a node is only allowed to fenceothers if itself is healthy; ergo if it doesn't have all storagesavailable, it doesn't get to fence others whose storage isn't available.

Re: [DISCUSS] KVM HA with IPMI Fencing

Reply via email to