On Fri, 2017-12-01 at 16:21 -0600, Ken Gaillot wrote: > On Thu, 2017-11-30 at 11:58 +0000, Adam Spiers wrote: > > Ken Gaillot <kgail...@redhat.com> wrote: > > > On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote: > > > > Hi all, > > > > > > > > A colleague has been valiantly trying to help me belatedly > > > > learn > > > > about > > > > the intricacies of startup fencing, but I'm still not fully > > > > understanding some of the finer points of the behaviour. > > > > > > > > The documentation on the "startup-fencing" option[0] says > > > > > > > > Advanced Use Only: Should the cluster shoot unseen nodes? > > > > Not > > > > using the default is very unsafe! > > > > > > > > and that it defaults to TRUE, but doesn't elaborate any > > > > further: > > > > > > > > https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/ > > > > Pa > > > > cema > > > > ker_Explained/s-cluster-options.html > > > > > > > > Let's imagine the following scenario: > > > > > > > > - We have a 5-node cluster, with all nodes running cleanly. > > > > > > > > - The whole cluster is shut down cleanly. > > > > > > > > - The whole cluster is then started up again. (Side question: > > > > what > > > > happens if the last node to shut down is not the first to > > > > start > > > > up? > > > > How will the cluster ensure it has the most recent version of > > > > the > > > > CIB? Without that, how would it know whether the last man > > > > standing > > > > was shut down cleanly or not?) > > > > > > Of course, the cluster can't know what CIB version nodes it > > > doesn't > > > see > > > have, so if a set of nodes is started with an older version, it > > > will go > > > with that. > > > > Right, that's what I expected. > > > > > However, a node can't do much without quorum, so it would be > > > difficult > > > to get in a situation where CIB changes were made with quorum > > > before > > > shutdown, but none of those nodes are present at the next start- > > > up > > > with > > > quorum. > > > > > > In any case, when a new node joins a cluster, the nodes do > > > compare > > > CIB > > > versions. If the new node has a newer CIB, the cluster will use > > > it. > > > If > > > other changes have been made since then, the newest CIB wins, so > > > one or > > > the other's changes will be lost. > > > > Ahh, that's interesting. Based on reading > > > > https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pace > > ma > > ker_Explained/ch03.html#_cib_properties > > > > whichever node has the highest (admin_epoch, epoch, num_updates) > > tuple > > will win, so normally in this scenario it would be the epoch which > > decides it, i.e. whichever node had the most changes since the last > > time the conflicting nodes shared the same config - right? > > Correct ... assuming the code for that is working properly, which I > haven't confirmed :) > > > > > And if that would choose the wrong node, admin_epoch can be set > > manually to override that decision? > > Correct again, with same caveat > > > > > > Whether missing nodes were shut down cleanly or not relates to > > > your > > > next question ... > > > > > > > - 4 of the nodes boot up fine and rejoin the cluster within the > > > > dc-deadtime interval, foruming a quorum, but the 5th doesn't. > > > > > > > > IIUC, with startup-fencing enabled, this will result in that > > > > 5th > > > > node > > > > automatically being fenced. If I'm right, is that really > > > > *always* > > > > necessary? > > > > > > It's always safe. :-) As you mentioned, if the missing node was > > > the > > > last one alive in the previous run, the cluster can't know > > > whether > > > it > > > shut down cleanly or not. Even if the node was known to shut down > > > cleanly in the last run, the cluster still can't know whether the > > > node > > > was started since then and is now merely unreachable. So, fencing > > > is > > > necessary to ensure it's not accessing resources. > > > > I get that, but I was questioning the "necessary to ensure it's not > > accessing resources" part of this statement. My point is that > > sometimes this might be overkill, because sometimes we might be > > able > > to > > discern through other methods that there are no resources we need > > to > > worry about potentially conflicting with what we want to > > run. That's > > why I gave the stateless clones example. > > > > > The same scenario is why a single node can't have quorum at > > > start- > > > up in > > > a cluster with "two_node" set. Both nodes have to see each other > > > at > > > least once before they can assume it's safe to do anything. > > > > Yep. > > > > > > Let's suppose further that the cluster configuration is such > > > > that > > > > no > > > > stateful resources which could potentially conflict with other > > > > nodes > > > > will ever get launched on that 5th node. For example it might > > > > only > > > > host stateless clones, or resources with require=nothing set, > > > > or > > > > it > > > > might not even host any resources at all due to some temporary > > > > constraints which have been applied. > > > > > > > > In those cases, what is to be gained from fencing? The only > > > > thing I > > > > can think of is that using (say) IPMI to power-cycle the node > > > > *might* > > > > fix whatever issue was preventing it from joining the > > > > cluster. Are > > > > there any other reasons for fencing in this case? It wouldn't > > > > help > > > > avoid any data corruption, at least. > > > > > > Just because constraints are telling the node it can't run a > > > resource > > > doesn't mean the node isn't malfunctioning and running it anyway. > > > If > > > the node can't tell us it's OK, we have to assume it's not. > > > > Sure, but even if it *is* running it, if it's not conflicting with > > anything or doing any harm, is it really always better to fence > > regardless? > > There's a resource meta-attribute "requires" that says what a > resource > needs to start. If it can't do any harm if it runs awry, you can set > requires="quorum" (or even "nothing"). > > So, that's sort of a way to let the cluster know that, but it doesn't > currently do what you're suggesting, since start-up fencing is purely > about the node and not about the resources. I suppose if the cluster > had no resources requiring fencing (or, to push it further, no such > resources that will be probed on that node), we could disable start- > up > fencing, but that's not done currently. > > > Disclaimer: to a certain extent I'm playing devil's advocate here > > to > > stimulate a closer (re-)examination of the axiom we've grown so > > used > > to over the years that if we don't know what a node is doing, we > > should fence it. I'm not necessarily arguing that fencing is wrong > > here, but I think it's healthy to occasionally go back to first > > principles and re-question why we are doing things a certain way, > > to > > make sure that the original assumptions still hold true. I'm > > familiar > > with the pain that our customers experience when nodes are fenced > > for > > less than very compelling reasons, so I think it's worth looking > > for > > opportunities to reduce fencing to when it's really needed. > > The fundamental purpose of a high-availability cluster is to keep the > desired service functioning, above all other priorities (including, > unfortunately, making sysadmins' lives easier). > > If a service requires an HA cluster, it's a safe bet it will have > problems in a split-brain situation (otherwise, why bother with the > overhead). Even something as simple as an IP address will render a > service useless if it's brought up on two machines on a network. > > Fencing is really the only hammer we have in that situation. At that > point, we have zero information about what the node is doing. If it's > powered off (or cut off from disk/network), we know it's not doing > anything. > > Fencing may not always help the situation, but it's all we've got. > > We give the user a good bit of control over fencing policies: > corosync > tuning, stonith-enabled, startup-fencing, no-quorum-policy, requires, > on-fail, and the choice of fence agent. It can be a challenge for a > new > user to know all the knobs to turn, but HA is kind of unavoidably > complex. > > > > > Now let's imagine the same scenario, except rather than a clean > > > > full > > > > cluster shutdown, all nodes were affected by a power cut, but > > > > also > > > > this time the whole cluster is configured to *only* run > > > > stateless > > > > clones, so there is no risk of conflict between two nodes > > > > accidentally > > > > running the same resource. On startup, the 4 nodes in the > > > > quorum > > > > have > > > > no way of knowing that the 5th node was also affected by the > > > > power > > > > cut, so in theory from their perspective it could still be > > > > running a > > > > stateless clone. Again, is there anything to be gained from > > > > fencing > > > > the 5th node once it exceeds the dc-deadtime threshold for > > > > joining, > > > > other than the chance that a reboot might fix whatever was > > > > preventing > > > > it from joining, and get the cluster back to full strength? > > > > > > If a cluster runs only services that have no potential to > > > conflict, > > > then you don't need a cluster. :-) > > > > True :-) Again as devil's advocate this scenario could be extended > > to > > include remote nodes which *do* run resources which could conflict > > (such as compute nodes), and in that case running stateless clones > > (only) on the core cluster could be justified simply on the grounds > > that we need Pacemaker for the remotes anyway, so we might as well > > use > > it for the stateless clones rather than introducing keepalived as > > yet > > another component ... but this is starting to get hypothetical, so > > it's perhaps not worth spending energy discussing that tangent ;-) > > > > > Unique clones require communication even if they're stateless > > > (think > > > IPaddr2). > > > > Well yeah, IPaddr2 is arguably stateful since there are changing > > ARP > > tables involved :-) > > > > > I'm pretty sure even some anonymous stateless clones require > > > communication to avoid issues. > > > > Fair enough. > > > > > > Also, when exactly does the dc-deadtime timer start ticking? > > > > Is it reset to zero after a node is fenced, so that potentially > > > > that > > > > node could go into a reboot loop if dc-deadtime is set too low? > > > > > > A node's crmd starts the timer at start-up and whenever a new > > > election > > > starts, and is stopped when the DC makes it a join offer. > > > > That's surprising - I would have expected it to be the other way > > around, i.e. that the timer doesn't run on the node which is > > joining, > > but one of the nodes already in the cluster (e.g. the > > DC). Otherwise > > how can fencing of that node be triggered if the node takes too > > long > > to join? > > > > > I don't think it ever reboots though, I think it just starts a > > > new > > > election. > > > > Maybe we're talking at cross-purposes? By "reboot loop", I was > > asking > > if the node which fails to join could end up getting endlessly > > fenced: > > join timeout -> fenced -> reboots -> join timeout -> fenced -> ... > > etc. > > startup-fencing and dc-deadtime don't have anything to do with each > other.
I guess that's not quite accurate -- the first DC election at start-up won't complete until dc-deadtime, so the DC won't be able to check for start-up fencing until after then. But a fence loop is not possible because one fencing is done, the node has a known status. startup-fencing doesn't require that a node be functional, only that its status is known. > There are two separate joins: the node joins at the corosync layer, > and > then its crmd joins to the other crmd's at the pacemaker layer. One > of > the crmd's is then elected DC. > > startup-fencing kicks in if the cluster has quorum and the DC sees no > node status in the CIB for a node. Node status will be recorded in > the > CIB once it joins at the corosync layer. So, all nodes have until > quorum is reached, a DC is elected, and the DC invokes the policy > engine, to join at the cluster layer, else they will be shot. (And at > that time, their status is known and recorded as dead.) This only > happens when the cluster first starts, and is the only way to handle > split-brain at start-up. > > dc-deadtime is for the DC election. When a node joins an existing > cluster, it expects the existing DC to make it a membership offer (at > the pacemaker layer). If that doesn't happen within dc-deadtime, the > node asks for a new DC election. The idea is that the DC may be > having > trouble that hasn't been detected yet. Similarly, whenever a new > election is called, all of the nodes expect a join offer from > whichever > node is elected DC, and again they call a new election if that > doesn't > happen in dc-deadtime. > > > > So, you can get into an election loop, but I think network > > > conditions > > > would have to be pretty severe. > > > > Yeah, that sounds like a different type of loop to the one I was > > imagining. > > > > > > The same questions apply if this troublesome node was actually > > > > a > > > > remote node running pacemaker_remoted, rather than the 5th node > > > > in > > > > the > > > > cluster. > > > > > > Remote nodes don't join at the crmd level as cluster nodes do, so > > > they > > > don't "start up" in the same sense > > > > Sure, they establish a TCP connection via pacemaker_remoted when > > the > > remote resource is starting. > > > > > and start-up fencing doesn't apply to them. Instead, the cluster > > > initiates the connection when called for (I don't remember for > > > sure > > > whether it fences the remote node if the connection fails, but > > > that > > > would make sense). > > > > Hrm, that's not what Yan said, and it's not what my L3 colleagues > > are > > reporting either ;-) I've been told (but not yet verified myself) > > that if a remote resource's start operation times out (e.g. due to > > the remote node not being up yet), the remote will get fenced. > > But I see Yan has already replied with additional details on this. > > Yep I remembered wrong :) > > > > > I have an uncomfortable feeling that I'm missing something > > > > obvious, > > > > probably due to the documentation's warning that "Not using the > > > > default [for startup-fencing] is very unsafe!" Or is it only > > > > unsafe when the resource which exceeded dc-deadtime on startup > > > > could potentially be running a stateful resource which the > > > > cluster > > > > now wants to restart elsewhere? If that's the case, would it > > > > be > > > > possible to optionally limit startup fencing to when it's > > > > really > > > > needed? > > > > > > > > Thanks for any light you can shed! > > > > > > There's no automatic mechanism to know that, but if you know > > > before > > > a > > > particular start that certain nodes are really down and are > > > staying > > > that way, you can disable start-up fencing in the configuration > > > on > > > disk, before starting the other nodes, then re-enable it once > > > everything is back to normal. > > > > Ahah! That's the kind of tip I was looking for, thanks :-) So you > > mean by editing the CIB XML directly? Would disabling startup- > > fencing > > manually this way require a concurrent update of the epoch? > > You can edit the CIB on disk when the cluster is down, but you have > to > go about it carefully. > > Rather than edit it directly, you can use > CIB_file=/var/lib/pacemaker/cib/cib.xml when invoking cibadmin (or > your > favorite higher-level tool). cibadmin will update the hash that > pacemaker uses to verify the CIB's integrity. Alternatively, you can > remove *everything* in /var/lib/pacemaker/cib except cib.xml, then > edit > it directly. > > Updating the admin epoch is a good idea if you want to be sure your > edited CIB wins, although starting that node first is also good > enough. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org