Re: [ClusterLabs] Behavior of corosync kill
Thanks Ken. Let me check resource-stickiness property at my end. Regards, Rohit On Tue, Aug 25, 2020 at 8:07 PM Ken Gaillot wrote: > On Tue, 2020-08-25 at 12:28 +0530, Rohit Saini wrote: > > Hi All, > > I am seeing the following behavior. Can someone clarify if this is > > intended behavior. If yes, then why so? Please let me know if logs > > are needed for better clarity. > > > > 1. Without Stonith: > > Continuous corosync kill on master causes switchover and makes > > another node as master. But as soon as this corosync recovers, it > > becomes master again. Shouldn't it become slave now? > > Where resources are active or take on the master role depends on the > cluster configuration, not past node issues. > > You may be interested in the resource-stickiness property: > > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_resource_meta_attributes > > > > 2. With Stonith: > > Sometimes, on corosync kill, that node gets shooted by stonith but > > sometimes not. Not able to understand this fluctuating behavior. Does > > it have to do anything with faster recovery of corosync, which > > stonith fails to detect? > > It's not failing to detect it, but recovering satisfactorily without > fencing. > > At any given time, one of the cluster nodes is elected the designated > controller (DC). When new events occur, such as a node leaving the > corosync ring unexpectedly, the DC runs pacemaker's scheduler to see > what needs to be done about it. In the case of a lost node, it will > also erase the node's resource history, to indicate that the state of > resources on the node is no longer accurately known. > > If no further events happened during that time, the scheduler would > schedule fencing, and the cluster would carry it out. > > However, systemd monitors corosync and will restart it if it dies. If > systemd respawns corosync fast enough (it often is sub-second), the > node will rejoin the cluster before the scheduler completes its > calculations and fencing is initiated. Rejoining the cluster includes > re-sync'ing its resource history with the other nodes. > > The node join is considered new information, so the former scheduler > run is cancelled (the "transition" is "aborted") and a new one is > started. Since the node is now happily part of the cluster, and the > resource history tells us the state of all resources on the node, no > fencing is needed. > > > > I am using > > corosync-2.4.5-4.el7.x86_64 > > pacemaker-1.1.19-8.el7.x86_64 > > centos 7.6.1810 > > > > Thanks, > > Rohit > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > -- > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Behavior of corosync kill
On Tue, 2020-08-25 at 12:28 +0530, Rohit Saini wrote: > Hi All, > I am seeing the following behavior. Can someone clarify if this is > intended behavior. If yes, then why so? Please let me know if logs > are needed for better clarity. > > 1. Without Stonith: > Continuous corosync kill on master causes switchover and makes > another node as master. But as soon as this corosync recovers, it > becomes master again. Shouldn't it become slave now? Where resources are active or take on the master role depends on the cluster configuration, not past node issues. You may be interested in the resource-stickiness property: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_resource_meta_attributes > 2. With Stonith: > Sometimes, on corosync kill, that node gets shooted by stonith but > sometimes not. Not able to understand this fluctuating behavior. Does > it have to do anything with faster recovery of corosync, which > stonith fails to detect? It's not failing to detect it, but recovering satisfactorily without fencing. At any given time, one of the cluster nodes is elected the designated controller (DC). When new events occur, such as a node leaving the corosync ring unexpectedly, the DC runs pacemaker's scheduler to see what needs to be done about it. In the case of a lost node, it will also erase the node's resource history, to indicate that the state of resources on the node is no longer accurately known. If no further events happened during that time, the scheduler would schedule fencing, and the cluster would carry it out. However, systemd monitors corosync and will restart it if it dies. If systemd respawns corosync fast enough (it often is sub-second), the node will rejoin the cluster before the scheduler completes its calculations and fencing is initiated. Rejoining the cluster includes re-sync'ing its resource history with the other nodes. The node join is considered new information, so the former scheduler run is cancelled (the "transition" is "aborted") and a new one is started. Since the node is now happily part of the cluster, and the resource history tells us the state of all resources on the node, no fencing is needed. > I am using > corosync-2.4.5-4.el7.x86_64 > pacemaker-1.1.19-8.el7.x86_64 > centos 7.6.1810 > > Thanks, > Rohit > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Format of '--lifetime' in 'pcs resource move'
Hi all, The lifetime value is indeed expected to be ISO 8601 duration. I updated pcs documentation to clarify that: https://github.com/ClusterLabs/pcs/commit/1e9650a8fd5b8a0a22911ddca1010de582684971 Please note constraints are not removed from CIB when their lifetime expires. They are rendered ineffective but still preserved in CIB. See the following bugzilla for more details: https://bugzilla.redhat.com/show_bug.cgi?id=1442116 Regards, Tomas Dne 21. 08. 20 v 7:56 Ulrich Windl napsal(a): Strahil Nikolov schrieb am 20.08.2020 um 18:25 in Nachricht <329b5d02-2bcb-4a2c-bc2b-ca3030e6a...@yahoo.com>: Have you tried ISO 8601 format. For example: 'PT20M' And watch out not to mix Minutes wth Months ;-) The ISo format is described at: https://manpages.debian.org/testing/crmsh/crm.8.en.html Best Regards, Strahil Nikolov На 20 август 2020 г. 13:40:16 GMT+03:00, Digimer написа: Hi all, Reading the pcs man page for the 'move' action, it talks about '--lifetime' switch that appears to control when the location constraint is removed; move [destination node] [--master] [life‐ time=] [--wait[=n]] Move the resource off the node it is currently running on by creating a -INFINITY location constraint to ban the node. If destination node is specified the resource will be moved to that node by creating an INFINITY loca‐ tion constraint to prefer the destination node. If --master is used the scope of the command is limited to the master role and you must use the promotable clone id (instead of the resource id). If lifetime is specified then the constraint will expire after that time, other‐ wise it defaults to infinity and the constraint can be cleared manually with 'pcs resource clear' or 'pcs con‐ straint delete'. If --wait is specified, pcs will wait up to 'n' seconds for the resource to move and then return 0 on success or 1 on error. If 'n' is not speci‐ fied it defaults to 60 minutes. If you want the resource to preferably avoid running on some nodes but be able to failover to them use 'pcs constraint location avoids'. I think I want to use this, as we move resources manually for various reasons where the old host is still able to host the resource should a node failure occur. So we'd love to immediately remove the location constraint as soon as the move completes. I tries using '--lifetime=60' as a test, assuming the format was 'seconds', but that was invalid. How is this switch meant to be used? Cheers -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Behavior of corosync kill
On Tue, Aug 25, 2020 at 10:00 AM Rohit Saini wrote: > > Hi All, > I am seeing the following behavior. Can someone clarify if this is intended > behavior. If yes, then why so? Please let me know if logs are needed for > better clarity. > > 1. Without Stonith: > Continuous corosync kill on master causes switchover and makes another node > as master. But as soon as this corosync recovers, it becomes master again. > Shouldn't it become slave now? It is rather unclear what you are asking. Nodes cannot be master or slave. Do you mean specific master/slave resource in pacemaker configuration? > > 2. With Stonith: > Sometimes, on corosync kill, that node gets shooted by stonith but sometimes > not. Not able to understand this fluctuating behavior. Does it have to do > anything with faster recovery of corosync, which stonith fails to detect? This could be, but logs in both cases may give more hints. > > I am using > corosync-2.4.5-4.el7.x86_64 > pacemaker-1.1.19-8.el7.x86_64 > centos 7.6.1810 > > Thanks, > Rohit > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Behavior of corosync kill
Hi All, I am seeing the following behavior. Can someone clarify if this is intended behavior. If yes, then why so? Please let me know if logs are needed for better clarity. 1. Without Stonith: Continuous corosync kill on master causes switchover and makes another node as master. But as soon as this corosync recovers, it becomes master again. Shouldn't it become slave now? 2. With Stonith: Sometimes, on corosync kill, that node gets shooted by stonith but sometimes not. Not able to understand this fluctuating behavior. Does it have to do anything with faster recovery of corosync, which stonith fails to detect? I am using corosync-2.4.5-4.el7.x86_64 pacemaker-1.1.19-8.el7.x86_64 centos 7.6.1810 Thanks, Rohit ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/