On 19/10/15 11:18, Ronald van Zantvoort wrote:
On 16/10/15 00:21, ilya wrote:
I noticed several attempts to address the issue with KVM HA in Jira and
Dev ML. As we all know, there are many ways to solve the same problem,
on our side, we've given it some thought as well - and its on our to do
list.
Specifically a mail thread "KVM HA is broken, let's fix it"
JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
JIRA: https://issues.apache.org/jira/browse/CLOUDSTACK-8643
We propose the following solution that in our understanding should cover
all use cases and provide a fencing mechanism.
NOTE: Proposed IPMI fencing, is just a script. If you are using HP
hardware with ILO, it could be an ILO executable with specific
parameters. In theory - this can be *any* action script not just IPMI.
Please take few minutes to read this through, to avoid duplicate
efforts...
Proposed FS below:
----------------
https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+HA+with+IPMI+Fencing
Hi Ilja, thanks for the design; I've put a comment int 8943, here it is
verbatim as my 5c in the discussion:
Well, that completely clobbered up the readability LOL
Let's try again, but see
https://issues.apache.org/jira/browse/CLOUDSTACK-8943 for the better
markup ;)
[~ilya.mailing.li...@gmail.com]: Thanks for the design document. I can't
comment in Confluence, so here goes:
* When to fence; [~sweller]: Of course you're right that it should be
highly unlikely that your storage completely dissappears from the
cluster. Be that as it may, as you yourself note, first of all if you're
using NFS without HA that likelihood increases manyfold. Secondly,
defining it as an anlikely disastrous event seems no reason not to take
it into account; making it a catastrophic event by 'fencing' all
affected hypervisors will not serve anyone as it would be unexpected and
unwelcome.
* The entire concept of fencing exists to absolutely ensure state.
Specifically in this regard the state of the block devices and their
data. [~shadowsor]: For that same reason it's not reasonable to 'just
assume' VM's gone. There's a ton of failure domains that could cause an
agent to disconnect from the manager but still have the same VM's
running, and there's nothing stopping CloudStack from starting the same
VM twice on the same block devices, with desastrous results. That's why
you *need* to *know* the VM's are *very definitely* not running anymore,
which is exactly what fencing is supposed to do.
* For this, IPMI fencing is a nice and very often used option;
absolutely ensuring a hypervisor has died, and ergo the running VM's. It
will however not fix the case of the mass rebooting hypervisors (but
rather quite likely making it even more of an adventure if not addressed
properly)
Now, with all that in mind, I'd like to make the following comments
regarding [~ilya.mailing.li...@gmail.com] 's design.
* First of the IPMI implementation: There's is IMHO no need to define
IPMI (Executable,Start,Stop,Reboot,Blink,Test). IPMI is a protocol, all
these are standard commands. For example, using the venerable `ipmitool`
gives you `chassis power (on,status,poweroff,identify,reset)` etc. which
will work on any IPMI device; only authentication details (User, Pass,
Proto) differ. There's bound to be some library that does it without
having to resort to (possibly numerous) different (versions of) external
binaries.
* Secondly you're assuming that hypervisors can access the IPMI's of
their cluster/pod peers; although I'm not against this assumption per
sé, I'm also not convinced we're servicing everybody by forcing that
assumption to be true; some kind of IPMI agent/proxy comes to mind, or
even relegating the task back to the manager or some SystemVM. Also bear
in mind that you need access to those IPMI's to ensure cluster
functionality, so a failure domain should be in maintenance state if any
of the fence devices can't be reached
* Thirdly your proposed testing algorithm needs more discussion; after
all, it directly hits the fundamental principal reasons for *why* to
fence a host, and that's a lot more than just 'these disks still gets
writes'. In fact, by the time you're checking this, you're probably
already assuming something's very wrong with the hypervisor, so why not
just fence it then? The decision to fence should lie with the first
notification that some is (very) wrong with the hypervisor, and only
limited attempts should be made to get it out. Say it can't reach it's
storage and that get's you your HA actions; why check for the disks
first? Try to get the storage back up like 3 times, or for 90 sec or so,
then fence the fucker and HA the VM's immediately after confirmation. In
fact, that's exactly what it's doing now, with the side note that
confirmation can only reasonably follow after the hypervisor is done
rebooting.
* Finally as mentioned you're not solving the 'o look, my storage is
gone, let's fence' * (N) problem; in the case of a failing NFS:
** Every host will start IPMI resetting every other hypervisor; by
then there's a good chance every hypervisor in all connected clusters
are rebooting, leaving a state where there's no hypervisors in the
cluster to fence others; that in turn should lead to the cluster falling
in maintenance state, which will lead to even more bells & whistles
going off.
** They'll come back, find the NFS still gone, and continue resetting
each other like there's no tomorrow
** Support staff already panicking over the NFS/network outage now
has to deal with entire clusters of hypervisors in perpetual reboot as
well as clusters which are completely unreachable because there's no one
left to check state; this all while the outage might simply require the
revert of some inadvertent network ACL snafu
Although I well understand [~sweller]'s concerns regarding agent
complexity in this regard, quorum is the standard way of solving that
problem. On the other hand, once the Agents start talking to each other
and the Manager over some standard messaging API/bus this problem might
well be solved for you; getting, say, Gossip or Paxos or any other
clustering/quorum protocol shouldn't be that hard considering the amount
of Java software already doing just that out there.
** Another idea would be to introduce some other kind of storage
monitoring, for example by a SystemVM or something.
** If you'll insist on the 'clusters fence themselves' paradigm, you
could maybe also introduce a constraint that a node is only allowed to
fence others if itself is healthy; ergo if it doesn't have all storages
available, it doesn't get to fence others whose storage isn't available.