[ClusterLabs] Stopping all nodes causes servers to migrate

2021-01-24 Thread Digimer
Hi all,

  Right off the bat; I'm using a custom RA so this behaviour might be a
bug in my agent.

 I had a test server (srv01-test) running on node 1 (el8-a01n01), and on
node 2 (el8-a01n02) I ran 'pcs cluster stop --all'.

  It appears like pacemaker asked the VM to migrate to node 2 instead of
stopping it. Once the server was on node 2, I couldn't use 'pcs resource
disable ' as it returned that that resource was unmanaged, and the
cluster shut down was hung. When I directly stopped the VM and then did
a 'pcs resource cleanup', the cluster shutdown completed.

  In my agent, I noted these environment variables had been set;

OCF_RESKEY_name= srv01-test
OCF_RESKEY_CRM_meta_migrate_source = el8-a01n01
OCF_RESKEY_CRM_meta_migrate_target = el8-a01n02
OCF_RESKEY_CRM_meta_on_node= el8-a01n01

  So as best as I can tell, pacemaker really did ask for a migration. Is
this the case? If not, what environment variables should have been set
in this scenario?

Thanks for any insight!

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] CCIB migration from Pacemaker 1.x to 2.x

2021-01-24 Thread Strahil Nikolov
> How to handle it?
You need to :
- Setup and TEST stonith
- Add a 3rd node (even if it doesn't host any resources) or setup a
node for kronosnet

Best Regards,
Strahil Nikolov

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] CCIB migration from Pacemaker 1.x to 2.x

2021-01-24 Thread Sharma, Jaikumar

>You need to :
>- Setup and TEST stonith
>- Add a 3rd node (even if it doesn't host any resources) or setup a
>node for kronosnet

Thank you Strahil, looking into it.

Regards
Jaikumar

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] DRBD ms resource keeps getting demoted

2021-01-24 Thread Ulrich Windl
>>> Stuart Massey  schrieb am 22.01.2021 um 14:08 in
Nachricht
:
> Hi Ulrich,
> Thank you for your response.
> It makes sense that this would be happening on the failing, secondary/slave
> node, in which case we might expect drbd to be restarted (the service
> entirely, since it is already demoted) on the slave. I don't understand how
> it would affect the master, unless the failing secondary is causing some
> issue with drbd on the primary that causes the monitor on the master to
> time out for some reason. This does not (so far) seem to be the case, as
> the failing node has now been in maintenance mode for a couple of days with
> drbd still running as secondary, so if drbd failures on the secondary were
> causing the monitor on the Master/Primary to timeout, we should still be
> seeing that; we are not. The master has yet to demote the drbd resource
> since we put the failing node in maintenance.

When you are in maintenance mode, monitor operations won't run AFAIK.

> We will watch for a bit longer.
> Thanks again
> 
> On Thu, Jan 21, 2021 at 2:23 AM Ulrich Windl <
> ulrich.wi...@rz.uni-regensburg.de> wrote:
> 
>> >>> Stuart Massey  schrieb am 20.01.2021 um
>> 03:41
>> in
>> Nachricht
>> :
>> > Strahil,
>> > That is very kind of you, thanks.
>> > I see that in your (feature set 3.4.1) cib, drbd is in a   with
>> some
>> > meta_attributes and operations having to do with promotion, while in our
>> > (feature set 3.0.14) cib, drbd is in a  which does not have
those
>> > (maybe since promotion is implicit).
>> > Our cluster has been working quite well for some time, too. I wonder
what
>> > would happen if you could hang the os in one of your nodes? If a VM,
>> maybe
>>
>> Unless some other fencing mechanism (like watchdog timeout) kicks in, thge
>> monitor operation is the only thing that can detect a problem (from the
>> cluster's view): The monitor operation would timeout. Then the cluster
>> would
>> try to restart the resource (stop, then start). If stop also times out the
>> node
>> will be fenced.
>>
>> > the constrained secondary could be starved by setting disk IOPs to
>> > something really low. Of course, you are using different versions of
just
>> > about everything, as we're on centos7.
>> > Regards,
>> > Stuart
>> >
>> > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov 
>> > wrote:
>> >
>> >> I have just built a test cluster (centOS 8.3) for testing DRBD and it
>> >> works quite fine.
>> >> Actually I followed my notes from
>> >> https://forums.centos.org/viewtopic.php?t=65539 with the exception of
>> >> point 8 due to the "promotable" stuff.
>> >>
>> >> I'm attaching the output of 'pcs cluster cib file' and I hope it helps
>> you
>> >> fix your issue.
>> >>
>> >> Best Regards,
>> >> Strahil Nikolov
>> >>
>> >>
>> >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа:
>> >>
>> >> Ulrich,
>> >> Thank you for that observation. We share that concern.
>> >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves
the
>> >> "public" (to the intranet) IPs, and the other bonded pair is private to
>> the
>> >> cluster, used for drbd replication. HA will, I hope, be using the
>> "public"
>> >> IP, since that is the route to the IP addresses resolved for the host
>> >> names; that will certainly be the only route to the quorum device. I
can
>> >> say that this cluster has run reasonably well for quite some time with
>> this
>> >> configuration prior to the recently developed hardware issues on one of
>> the
>> >> nodes.
>> >> Regards,
>> >> Stuart
>> >>
>> >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl <
>> >> ulrich.wi...@rz.uni-regensburg.de> wrote:
>> >>
>> >> >>> Stuart Massey  schrieb am 19.01.2021 um
>> 04:46
>> >> in
>> >> Nachricht
>> >> :
>> >> > So, we have a 2-node cluster with a quorum device. One of the nodes
>> >> (node1)
>> >> > is having some trouble, so we have added constraints to prevent any
>> >> > resources migrating to it, but have not put it in standby, so that
>> drbd
>> >> in
>> >> > secondary on that node stays in sync. The problems it is having lead
>> to
>> >> OS
>> >> > lockups that eventually resolve themselves - but that causes it to be
>> >> > temporarily dropped from the cluster by the current master (node2).
>> >> > Sometimes when node1 rejoins, then node2 will demote the drbd ms
>> >> resource.
>> >> > That causes all resources that depend on it to be stopped, leading to
>> a
>> >> > service outage. They are then restarted on node2, since they can't
run
>> on
>> >> > node1 (due to constraints).
>> >> > We are having a hard time understanding why this happens. It seems
>> like
>> >> > there may be some sort of DC contention happening. Does anyone have
>> any
>> >> > idea how we might prevent this from happening?
>> >>
>> >> I think if you are routing high-volume DRBD traffic throuch "the same
>> >> pipe" as the cluster communication, cluster communication may fail if
>> the
>> >> pipe is satiated.
>> >> I'm not happy with that, but it seems to be that wa