On Thu, 2018-06-28 at 09:13 +0200, Ulrich Windl wrote: > > > > Ken Gaillot <kgail...@redhat.com> schrieb am 27.06.2018 um > > > > 16:32 in Nachricht > > <1530109926.6452.3.ca...@redhat.com>: > > On Wed, 2018-06-27 at 09:18 -0500, Ken Gaillot wrote: > > > On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote: > > > > > > > Ken Gaillot <kgail...@redhat.com> schrieb am 26.06.2018 > > > > > > > um > > > > > > > 18:22 in Nachricht > > > > > > > > <1530030128.5202.5.ca...@redhat.com>: > > > > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote: > > > > > > 26.06.2018 09:14, Ulrich Windl wrote: > > > > > > > Hi! > > > > > > > > > > > > > > We just observed some strange effect we cannot explain in > > > > > > > SLES > > > > > > > 11 > > > > > > > SP4 (pacemaker 1.1.12-f47ea56): > > > > > > > We run about a dozen of Xen PVMs on a three-node cluster > > > > > > > (plus > > > > > > > some > > > > > > > infrastructure and monitoring stuff). It worked all well > > > > > > > so > > > > > > > far, > > > > > > > and there was no significant change recently. > > > > > > > However when a colleague stopped on VM for maintenance > > > > > > > via > > > > > > > cluster > > > > > > > command, the cluster did not notice when the PVM actually > > > > > > > was > > > > > > > running again (it had been started not using the cluster > > > > > > > (a > > > > > > > bad > > > > > > > idea, I know)). > > > > > > > > > > > > To be on a safe side in such cases you'd probably want to > > > > > > enable > > > > > > additional monitor for a "Stopped" role. Default one covers > > > > > > only > > > > > > "Started" role. The same thing as for multistate resources, > > > > > > where > > > > > > you > > > > > > need several monitor ops, for "Started/Slave" and "Master" > > > > > > roles. > > > > > > But, this will increase a load. > > > > > > And, I believe cluster should reprobe a resource on all > > > > > > nodes > > > > > > once > > > > > > you > > > > > > change target-role back to "Started". > > > > > > > > > > Which raises the question, how did you stop the VM initially? > > > > > > > > I thought "(...) stopped one VM for maintenance via cluster > > > > command" > > > > is obvious. It was something like "crm resource stop ...". > > > > > > > > > > > > > > If you stopped it by setting target-role to Stopped, likely > > > > > the > > > > > cluster > > > > > still thinks it's stopped, and you need to set it to Started > > > > > again. > > > > > If > > > > > instead you set maintenance mode or unmanaged the resource, > > > > > then > > > > > stopped the VM manually, then most likely it's still in that > > > > > mode > > > > > and > > > > > needs to be taken out of it. > > > > > > > > The point was when the command to start the resource was given, > > > > the > > > > cluster had completely ignored the fact that it was running > > > > already > > > > and started to start the VM on a second node (which may be > > > > desastrous). But that's leading away from the main question... > > > > > > Ah, this is expected behavior when you start a resource manually, > > > and > > > there are no monitors with target-role=Stopped. If the node where > > > you > > > manually started the VM isn't the same node the cluster happens > > > to > > > choose, then you can get multiple active instances. > > > > > > By default, the cluster assumes that where a probe found a > > > resource > > > to > > > be not running, that resource will stay not running unless > > > started by > > > the cluster. (It will re-probe if the node goes away and comes > > > back.) > > > > > > If you wish to guard against resources being started outside > > > cluster > > > control, configure a recurring monitor with target-role=Stopped, > > > and > > > the cluster will run that on all nodes where it thinks the > > > resource > > > is > > > not supposed to be running. Of course since it has to poll at > > > intervals, it can take up to that much time to detect a manually > > > started instance. > > > > Alternatively, if you don't want the overhead of a recurring > > monitor > > but want to be able to address known manual starts yourself, you > > can > > force a full reprobe of the resource with "crm_resource -r > > <resource- > > id> --refresh". > > > > If you do it before starting the resource via crm, the cluster will > > stop the manually started instance, and then you can start it via > > the > > crm; if you do it after starting the resource via crm, there will > > still > > likely be two active instances, and the cluster will stop both and > > start one again. > > > > A way around that would be to unmanage the resource, start the > > resource > > via crm (which won't actually start anything due to being > > unmanaged, > > but will tell the cluster it's supposed to be started), force a > > reprobe, then manage the resource again -- that should prevent > > multiple > > active. However if the cluster prefers a different node, it may > > still > > stop the resource and start it in its preferred location. > > (Stickiness > > could get around that.) > > Hi! > > Thanks again for that. There's one question that comes to my mind: > What is the purpose of the cluster recheck interval? I thought it's > exactly that, finding resources that are not in the state they should > be.
I can see how the name would suggest that, but nope, it's just a recalculation of whether any actions need to be taken. It comes in handy for two purposes: first, rules and some options (such as failure-timeout) that depend on time values are not guaranteed to be evaluated more often than the recheck interval. So if you have a rule setting maintenance mode between 10:30 and 11pm, if the recheck interval is 15 minutes, maintenance mode could be entered anytime between 10:30 and 10:45, and exited anytime between 11 and 11:15. Second, it's a fail-safe for bugs that cause the cluster to miss an event. If the cluster fails to react to a recorded event, it should notice when the next recheck interval expires. > > Regards, > Ulrich > > > > > > > > > > > > > > Examining the logs, it seems that the recheck timer > > > > > > > popped > > > > > > > periodically, but no monitor action was run for the VM > > > > > > > (the > > > > > > > action > > > > > > > is configured to run every 10 minutes). > > > > > > Recurring monitors are only recorded in the log if their return > > > value > > > changed. If there are 10 successful monitors in a row and then a > > > failure, only the first success and the failure are logged. > > > > > > > > > > > > > > > > > Actually the only monitor operations found were: > > > > > > > May 23 08:04:13 > > > > > > > Jun 13 08:13:03 > > > > > > > Jun 25 09:29:04 > > > > > > > Then a manual "reprobe" was done, and several monitor > > > > > > > operations > > > > > > > were run. > > > > > > > Then again I see no more monitor actions in syslog. > > > > > > > > > > > > > > What could be the reasons for this? Too many operations > > > > > > > defined? > > > > > > > > > > > > > > The other message I don't understand is like "<other- > > > > > > > resource>: > > > > > > > Rolling back scores from <vm-resource>" > > > > > > > > > > > > > > Could it be a new bug introduced in pacemaker, or could > > > > > > > it be > > > > > > > some > > > > > > > configuration problem (The status is completely clean > > > > > > > however)? > > > > > > > > > > > > > > According to the packet changelog, there was no change > > > > > > > since > > > > > > > Nov > > > > > > > 2016... > > > > > > > > > > > > > > Regards, > > > > > > > Ulrich > > > > -- > > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > h.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org