Re: [ClusterLabs] Behavior of corosync kill

2020-08-25 Thread Rohit Saini
Thanks Ken. Let me check resource-stickiness property at my end.

Regards,
Rohit

On Tue, Aug 25, 2020 at 8:07 PM Ken Gaillot  wrote:

> On Tue, 2020-08-25 at 12:28 +0530, Rohit Saini wrote:
> > Hi All,
> > I am seeing the following behavior. Can someone clarify if this is
> > intended behavior. If yes, then why so? Please let me know if logs
> > are needed for better clarity.
> >
> > 1. Without Stonith:
> > Continuous corosync kill on master causes switchover and makes
> > another node as master. But as soon as this corosync recovers, it
> > becomes master again. Shouldn't it become slave now?
>
> Where resources are active or take on the master role depends on the
> cluster configuration, not past node issues.
>
> You may be interested in the resource-stickiness property:
>
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_resource_meta_attributes
>
>
> > 2. With Stonith:
> > Sometimes, on corosync kill, that node gets shooted by stonith but
> > sometimes not. Not able to understand this fluctuating behavior. Does
> > it have to do anything with faster recovery of corosync, which
> > stonith fails to detect?
>
> It's not failing to detect it, but recovering satisfactorily without
> fencing.
>
> At any given time, one of the cluster nodes is elected the designated
> controller (DC). When new events occur, such as a node leaving the
> corosync ring unexpectedly, the DC runs pacemaker's scheduler to see
> what needs to be done about it. In the case of a lost node, it will
> also erase the node's resource history, to indicate that the state of
> resources on the node is no longer accurately known.
>
> If no further events happened during that time, the scheduler would
> schedule fencing, and the cluster would carry it out.
>
> However, systemd monitors corosync and will restart it if it dies. If
> systemd respawns corosync fast enough (it often is sub-second), the
> node will rejoin the cluster before the scheduler completes its
> calculations and fencing is initiated. Rejoining the cluster includes
> re-sync'ing its resource history with the other nodes.
>
> The node join is considered new information, so the former scheduler
> run is cancelled (the "transition" is "aborted") and a new one is
> started. Since the node is now happily part of the cluster, and the
> resource history tells us the state of all resources on the node, no
> fencing is needed.
>
>
> > I am using
> > corosync-2.4.5-4.el7.x86_64
> > pacemaker-1.1.19-8.el7.x86_64
> > centos 7.6.1810
> >
> > Thanks,
> > Rohit
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Behavior of corosync kill

2020-08-25 Thread Ken Gaillot
On Tue, 2020-08-25 at 12:28 +0530, Rohit Saini wrote:
> Hi All,
> I am seeing the following behavior. Can someone clarify if this is
> intended behavior. If yes, then why so? Please let me know if logs
> are needed for better clarity.
> 
> 1. Without Stonith:
> Continuous corosync kill on master causes switchover and makes
> another node as master. But as soon as this corosync recovers, it
> becomes master again. Shouldn't it become slave now?

Where resources are active or take on the master role depends on the
cluster configuration, not past node issues.

You may be interested in the resource-stickiness property:

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_resource_meta_attributes


> 2. With Stonith:
> Sometimes, on corosync kill, that node gets shooted by stonith but
> sometimes not. Not able to understand this fluctuating behavior. Does
> it have to do anything with faster recovery of corosync, which
> stonith fails to detect?

It's not failing to detect it, but recovering satisfactorily without
fencing.

At any given time, one of the cluster nodes is elected the designated
controller (DC). When new events occur, such as a node leaving the
corosync ring unexpectedly, the DC runs pacemaker's scheduler to see
what needs to be done about it. In the case of a lost node, it will
also erase the node's resource history, to indicate that the state of
resources on the node is no longer accurately known.

If no further events happened during that time, the scheduler would
schedule fencing, and the cluster would carry it out.

However, systemd monitors corosync and will restart it if it dies. If
systemd respawns corosync fast enough (it often is sub-second), the
node will rejoin the cluster before the scheduler completes its
calculations and fencing is initiated. Rejoining the cluster includes
re-sync'ing its resource history with the other nodes.

The node join is considered new information, so the former scheduler
run is cancelled (the "transition" is "aborted") and a new one is
started. Since the node is now happily part of the cluster, and the
resource history tells us the state of all resources on the node, no
fencing is needed.


> I am using
> corosync-2.4.5-4.el7.x86_64
> pacemaker-1.1.19-8.el7.x86_64
> centos 7.6.1810
> 
> Thanks,
> Rohit
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Format of '--lifetime' in 'pcs resource move'

2020-08-25 Thread Tomas Jelinek

Hi all,

The lifetime value is indeed expected to be ISO 8601 duration. I updated 
pcs documentation to clarify that:

https://github.com/ClusterLabs/pcs/commit/1e9650a8fd5b8a0a22911ddca1010de582684971

Please note constraints are not removed from CIB when their lifetime 
expires. They are rendered ineffective but still preserved in CIB. See 
the following bugzilla for more details:

https://bugzilla.redhat.com/show_bug.cgi?id=1442116

Regards,
Tomas


Dne 21. 08. 20 v 7:56 Ulrich Windl napsal(a):

Strahil Nikolov  schrieb am 20.08.2020 um 18:25 in

Nachricht <329b5d02-2bcb-4a2c-bc2b-ca3030e6a...@yahoo.com>:

Have you tried ISO 8601 format.
For example: 'PT20M'


And watch out not to mix Minutes wth Months ;-)



The  ISo format  is described at:
https://manpages.debian.org/testing/crmsh/crm.8.en.html

Best Regards,
Strahil Nikolov

На 20 август 2020 г. 13:40:16 GMT+03:00, Digimer  написа:

Hi all,

  Reading the pcs man page for the 'move' action, it talks about
'--lifetime' switch that appears to control when the location
constraint
is removed;


   move [destination  node]  [--master]  [life‐
   time=] [--wait[=n]]
  Move the resource off the node it is  currently  running
  on  by  creating  a -INFINITY location constraint to ban
  the node. If destination node is specified the  resource
  will be moved to that node by creating an INFINITY loca‐
  tion constraint  to  prefer  the  destination  node.  If
  --master  is used the scope of the command is limited to
  the master role and you must use the promotable clone id
  (instead  of  the resource id). If lifetime is specified
  then the constraint will expire after that time,  other‐
  wise  it  defaults to infinity and the constraint can be
  cleared manually with 'pcs resource clear' or 'pcs  con‐
  straint  delete'.  If --wait is specified, pcs will wait
  up to 'n' seconds for the  resource  to  move  and  then
  return  0 on success or 1 on error. If 'n' is not speci‐
  fied it defaults to 60 minutes. If you want the resource
  to preferably avoid running on some nodes but be able to
  failover to them use 'pcs constraint location avoids'.


I think I want to use this, as we move resources manually for various
reasons where the old host is still able to host the resource should a
node failure occur. So we'd love to immediately remove the location
constraint as soon as the move completes.

I tries using '--lifetime=60' as a test, assuming the format was
'seconds', but that was invalid. How is this switch meant to be used?

Cheers

--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay
Gould
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Behavior of corosync kill

2020-08-25 Thread Andrei Borzenkov
On Tue, Aug 25, 2020 at 10:00 AM Rohit Saini
 wrote:
>
> Hi All,
> I am seeing the following behavior. Can someone clarify if this is intended 
> behavior. If yes, then why so? Please let me know if logs are needed for 
> better clarity.
>
> 1. Without Stonith:
> Continuous corosync kill on master causes switchover and makes another node 
> as master. But as soon as this corosync recovers, it becomes master again. 
> Shouldn't it become slave now?


It is rather unclear what you are asking. Nodes cannot be master or
slave. Do you mean specific master/slave resource in pacemaker
configuration?

>
> 2. With Stonith:
> Sometimes, on corosync kill, that node gets shooted by stonith but sometimes 
> not. Not able to understand this fluctuating behavior. Does it have to do 
> anything with faster recovery of corosync, which stonith fails to detect?

This could be, but logs in both cases may give more hints.

>
> I am using
> corosync-2.4.5-4.el7.x86_64
> pacemaker-1.1.19-8.el7.x86_64
> centos 7.6.1810
>
> Thanks,
> Rohit
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Behavior of corosync kill

2020-08-25 Thread Rohit Saini
Hi All,
I am seeing the following behavior. Can someone clarify if this is
intended behavior. If yes, then why so? Please let me know if logs are
needed for better clarity.

1. Without Stonith:
Continuous corosync kill on master causes switchover and makes another node
as master. But as soon as this corosync recovers, it becomes master again.
Shouldn't it become slave now?

2. With Stonith:
Sometimes, on corosync kill, that node gets shooted by stonith but
sometimes not. Not able to understand this fluctuating behavior. Does it
have to do anything with faster recovery of corosync, which stonith fails
to detect?

I am using
corosync-2.4.5-4.el7.x86_64
pacemaker-1.1.19-8.el7.x86_64
centos 7.6.1810

Thanks,
Rohit
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/