[ClusterLabs] Antw: [EXT] DRBD ms resource keeps getting demoted

2021-01-18 Thread Ulrich Windl
>>> Stuart Massey  schrieb am 19.01.2021 um 04:46 in
Nachricht
:
> So, we have a 2-node cluster with a quorum device. One of the nodes (node1)
> is having some trouble, so we have added constraints to prevent any
> resources migrating to it, but have not put it in standby, so that drbd in
> secondary on that node stays in sync. The problems it is having lead to OS
> lockups that eventually resolve themselves - but that causes it to be
> temporarily dropped from the cluster by the current master (node2).
> Sometimes when node1 rejoins, then node2 will demote the drbd ms resource.
> That causes all resources that depend on it to be stopped, leading to a
> service outage. They are then restarted on node2, since they can't run on
> node1 (due to constraints).
> We are having a hard time understanding why this happens. It seems like
> there may be some sort of DC contention happening. Does anyone have any
> idea how we might prevent this from happening?

I think if you are routing high-volume DRBD traffic throuch "the same pipe" as 
the cluster communication, cluster communication may fail if the pipe is 
satiated.
I'm not happy with that, but it seems to be that way.

Maybe running a combination of iftop and iotop could help you understand what's 
going on...

Regards,
Ulrich

> Selected messages (de-identified) from pacemaker.log that illustrate
> suspicion re DC confusion are below. The update_dc and
> abort_transition_graph re deletion of lrm seem to always precede the
> demotion, and a demotion seems to always follow (when not already demoted).
> 
> Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> do_dc_takeover:Taking over DC status for this partition
> Jan 18 16:52:17 [21938] node02.example.com   crmd: info: update_dc:
> Set DC to node02.example.com (3.0.14)
> Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> abort_transition_graph:Transition aborted by deletion of
> lrm[@id='1']: Resource state removal | cib=0.89.327
> source=abort_unless_down:357
> path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
> Jan 18 16:52:19 [21937] node02.example.compengine: info:
> master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible 1 to
> master
> Jan 18 16:52:19 [21937] node02.example.compengine:   notice: LogAction:
>  * Demote drbd_ourApp:1 (Master -> Slave
> node02.example.com )




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] DRBD ms resource keeps getting demoted

2021-01-19 Thread Stuart Massey
Ulrich,
Thank you for that observation. We share that concern.
We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves the
"public" (to the intranet) IPs, and the other bonded pair is private to the
cluster, used for drbd replication. HA will, I hope, be using the "public"
IP, since that is the route to the IP addresses resolved for the host
names; that will certainly be the only route to the quorum device. I can
say that this cluster has run reasonably well for quite some time with this
configuration prior to the recently developed hardware issues on one of the
nodes.
Regards,
Stuart

On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Stuart Massey  schrieb am 19.01.2021 um 04:46
> in
> Nachricht
> :
> > So, we have a 2-node cluster with a quorum device. One of the nodes
> (node1)
> > is having some trouble, so we have added constraints to prevent any
> > resources migrating to it, but have not put it in standby, so that drbd
> in
> > secondary on that node stays in sync. The problems it is having lead to
> OS
> > lockups that eventually resolve themselves - but that causes it to be
> > temporarily dropped from the cluster by the current master (node2).
> > Sometimes when node1 rejoins, then node2 will demote the drbd ms
> resource.
> > That causes all resources that depend on it to be stopped, leading to a
> > service outage. They are then restarted on node2, since they can't run on
> > node1 (due to constraints).
> > We are having a hard time understanding why this happens. It seems like
> > there may be some sort of DC contention happening. Does anyone have any
> > idea how we might prevent this from happening?
>
> I think if you are routing high-volume DRBD traffic throuch "the same
> pipe" as the cluster communication, cluster communication may fail if the
> pipe is satiated.
> I'm not happy with that, but it seems to be that way.
>
> Maybe running a combination of iftop and iotop could help you understand
> what's going on...
>
> Regards,
> Ulrich
>
> > Selected messages (de-identified) from pacemaker.log that illustrate
> > suspicion re DC confusion are below. The update_dc and
> > abort_transition_graph re deletion of lrm seem to always precede the
> > demotion, and a demotion seems to always follow (when not already
> demoted).
> >
> > Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> > do_dc_takeover:Taking over DC status for this partition
> > Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> update_dc:
> > Set DC to node02.example.com (3.0.14)
> > Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> > abort_transition_graph:Transition aborted by deletion of
> > lrm[@id='1']: Resource state removal | cib=0.89.327
> > source=abort_unless_down:357
> > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
> > Jan 18 16:52:19 [21937] node02.example.compengine: info:
> > master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible 1 to
> > master
> > Jan 18 16:52:19 [21937] node02.example.compengine:   notice:
> LogAction:
> >  * Demote drbd_ourApp:1 (Master -> Slave
> > node02.example.com )
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] DRBD ms resource keeps getting demoted

2021-01-19 Thread Strahil Nikolov
I have just built a test cluster (centOS 8.3) for testing DRBD and it
works quite fine.Actually I followed my notes from 
https://forums.centos.org/viewtopic.php?t=65539 with the exception of
point 8 due to the "promotable" stuff.
I'm attaching the output of 'pcs cluster cib file' and I hope it helps
you fix your issue.
Best Regards,Strahil Nikolov

В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа:
> Ulrich,Thank you for that observation. We share that concern.
> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves
> the "public" (to the intranet) IPs, and the other bonded pair is
> private to the cluster, used for drbd replication. HA will, I hope,
> be using the "public" IP, since that is the route to the IP addresses
> resolved for the host names; that will certainly be the only route to
> the quorum device. I can say that this cluster has run reasonably
> well for quite some time with this configuration prior to the
> recently developed hardware issues on one of the nodes.
> Regards,
> Stuart
> 
> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl <
> ulrich.wi...@rz.uni-regensburg.de> wrote:
> > >>> Stuart Massey  schrieb am 19.01.2021 um
> > 04:46 in
> > 
> > Nachricht
> > 
> >  > >:
> > 
> > > So, we have a 2-node cluster with a quorum device. One of the
> > nodes (node1)
> > 
> > > is having some trouble, so we have added constraints to prevent
> > any
> > 
> > > resources migrating to it, but have not put it in standby, so
> > that drbd in
> > 
> > > secondary on that node stays in sync. The problems it is having
> > lead to OS
> > 
> > > lockups that eventually resolve themselves - but that causes it
> > to be
> > 
> > > temporarily dropped from the cluster by the current master
> > (node2).
> > 
> > > Sometimes when node1 rejoins, then node2 will demote the drbd ms
> > resource.
> > 
> > > That causes all resources that depend on it to be stopped,
> > leading to a
> > 
> > > service outage. They are then restarted on node2, since they
> > can't run on
> > 
> > > node1 (due to constraints).
> > 
> > > We are having a hard time understanding why this happens. It
> > seems like
> > 
> > > there may be some sort of DC contention happening. Does anyone
> > have any
> > 
> > > idea how we might prevent this from happening?
> > 
> > 
> > 
> > I think if you are routing high-volume DRBD traffic throuch "the
> > same pipe" as the cluster communication, cluster communication may
> > fail if the pipe is satiated.
> > 
> > I'm not happy with that, but it seems to be that way.
> > 
> > 
> > 
> > Maybe running a combination of iftop and iotop could help you
> > understand what's going on...
> > 
> > 
> > 
> > Regards,
> > 
> > Ulrich
> > 
> > 
> > 
> > > Selected messages (de-identified) from pacemaker.log that
> > illustrate
> > 
> > > suspicion re DC confusion are below. The update_dc and
> > 
> > > abort_transition_graph re deletion of lrm seem to always precede
> > the
> > 
> > > demotion, and a demotion seems to always follow (when not already
> > demoted).
> > 
> > > 
> > 
> > > Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> > 
> > > do_dc_takeover:Taking over DC status for this partition
> > 
> > > Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> > update_dc:
> > 
> > > Set DC to node02.example.com (3.0.14)
> > 
> > > Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> > 
> > > abort_transition_graph:Transition aborted by deletion of
> > 
> > > lrm[@id='1']: Resource state removal | cib=0.89.327
> > 
> > > source=abort_unless_down:357
> > 
> > > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
> > 
> > > Jan 18 16:52:19 [21937] node02.example.compengine: info:
> > 
> > > master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible
> > 1 to
> > 
> > > master
> > 
> > > Jan 18 16:52:19 [21937] node02.example.compengine:   notice:
> > LogAction:
> > 
> > >  * Demote drbd_ourApp:1 (Master -> Slave
> > 
> > > node02.example.com )
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > ___
> > 
> > Manage your subscription:
> > 
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> > 
> 
> ___Manage your
> subscription:https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/


drbd_cib_el83.xml
Description: XML document
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] DRBD ms resource keeps getting demoted

2021-01-20 Thread Stuart Massey
Strahil,
That is very kind of you, thanks.
I see that in your (feature set 3.4.1) cib, drbd is in a   with some
meta_attributes and operations having to do with promotion, while in our
(feature set 3.0.14) cib, drbd is in a  which does not have those
(maybe since promotion is implicit).
Our cluster has been working quite well for some time, too. I wonder what
would happen if you could hang the os in one of your nodes? If a VM, maybe
the constrained secondary could be starved by setting disk IOPs to
something really low. Of course, you are using different versions of just
about everything, as we're on centos7.
Regards,
Stuart

On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov 
wrote:

> I have just built a test cluster (centOS 8.3) for testing DRBD and it
> works quite fine.
> Actually I followed my notes from
> https://forums.centos.org/viewtopic.php?t=65539 with the exception of
> point 8 due to the "promotable" stuff.
>
> I'm attaching the output of 'pcs cluster cib file' and I hope it helps you
> fix your issue.
>
> Best Regards,
> Strahil Nikolov
>
>
> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа:
>
> Ulrich,
> Thank you for that observation. We share that concern.
> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves the
> "public" (to the intranet) IPs, and the other bonded pair is private to the
> cluster, used for drbd replication. HA will, I hope, be using the "public"
> IP, since that is the route to the IP addresses resolved for the host
> names; that will certainly be the only route to the quorum device. I can
> say that this cluster has run reasonably well for quite some time with this
> configuration prior to the recently developed hardware issues on one of the
> nodes.
> Regards,
> Stuart
>
> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl <
> ulrich.wi...@rz.uni-regensburg.de> wrote:
>
> >>> Stuart Massey  schrieb am 19.01.2021 um 04:46
> in
> Nachricht
> :
> > So, we have a 2-node cluster with a quorum device. One of the nodes
> (node1)
> > is having some trouble, so we have added constraints to prevent any
> > resources migrating to it, but have not put it in standby, so that drbd
> in
> > secondary on that node stays in sync. The problems it is having lead to
> OS
> > lockups that eventually resolve themselves - but that causes it to be
> > temporarily dropped from the cluster by the current master (node2).
> > Sometimes when node1 rejoins, then node2 will demote the drbd ms
> resource.
> > That causes all resources that depend on it to be stopped, leading to a
> > service outage. They are then restarted on node2, since they can't run on
> > node1 (due to constraints).
> > We are having a hard time understanding why this happens. It seems like
> > there may be some sort of DC contention happening. Does anyone have any
> > idea how we might prevent this from happening?
>
> I think if you are routing high-volume DRBD traffic throuch "the same
> pipe" as the cluster communication, cluster communication may fail if the
> pipe is satiated.
> I'm not happy with that, but it seems to be that way.
>
> Maybe running a combination of iftop and iotop could help you understand
> what's going on...
>
> Regards,
> Ulrich
>
> > Selected messages (de-identified) from pacemaker.log that illustrate
> > suspicion re DC confusion are below. The update_dc and
> > abort_transition_graph re deletion of lrm seem to always precede the
> > demotion, and a demotion seems to always follow (when not already
> demoted).
> >
> > Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> > do_dc_takeover:Taking over DC status for this partition
> > Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> update_dc:
> > Set DC to node02.example.com (3.0.14)
> > Jan 18 16:52:17 [21938] node02.example.com   crmd: info:
> > abort_transition_graph:Transition aborted by deletion of
> > lrm[@id='1']: Resource state removal | cib=0.89.327
> > source=abort_unless_down:357
> > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
> > Jan 18 16:52:19 [21937] node02.example.compengine: info:
> > master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible 1 to
> > master
> > Jan 18 16:52:19 [21937] node02.example.compengine:   notice:
> LogAction:
> >  * Demote drbd_ourApp:1 (Master -> Slave
> > node02.example.com )
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
> ___
>
> Manage your subscription:
>
> https://lists.clusterlabs.org/mailman/listinfo/users
>
>
>
> ClusterLabs home:
>
> https://www.clusterlabs.org/
>
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] DRBD ms resource keeps getting demoted

2021-01-20 Thread hunter86_bg
I guess I missed the OS version, otherwise I would have powered up my 3 node CentOS 7 test cluster.I will check later the settings on my CentOS 7 cluster.Last time I checked it, the drbd was running fine.Best Regards,Strahil NikolovOn Jan 20, 2021 04:41, Stuart Massey  wrote:Strahil,That is very kind of you, thanks. I see that in your (feature set 3.4.1) cib, drbd is in a   with some meta_attributes and operations having to do with promotion, while in our (feature set 3.0.14) cib, drbd is in a  which does not have those (maybe since promotion is implicit).Our cluster has been working quite well for some time, too. I wonder what would happen if you could hang the os in one of your nodes? If a VM, maybe the constrained secondary could be starved by setting disk IOPs to something really low. Of course, you are using different versions of just about everything, as we're on centos7.Regards,StuartOn Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov  wrote:I have just built a test cluster (centOS 8.3) for testing DRBD and it works quite fine.Actually I followed my notes from https://forums.centos.org/viewtopic.php?t=65539 with the exception of point 8 due to the "promotable" stuff.I'm attaching the output of 'pcs cluster cib file' and I hope it helps you fix your issue.Best Regards,Strahil NikolovВ 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа:Ulrich,Thank you for that observation. We share that concern.We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves the "public" (to the intranet) IPs, and the other bonded pair is private to the cluster, used for drbd replication. HA will, I hope, be using the "public" IP, since that is the route to the IP addresses resolved for the host names; that will certainly be the only route to the quorum device. I can say that this cluster has run reasonably well for quite some time with this configuration prior to the recently developed hardware issues on one of the nodes.Regards,StuartOn Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl  wrote:>>> Stuart Massey  schrieb am 19.01.2021 um 04:46 in
Nachricht
:
> So, we have a 2-node cluster with a quorum device. One of the nodes (node1)
> is having some trouble, so we have added constraints to prevent any
> resources migrating to it, but have not put it in standby, so that drbd in
> secondary on that node stays in sync. The problems it is having lead to OS
> lockups that eventually resolve themselves - but that causes it to be
> temporarily dropped from the cluster by the current master (node2).
> Sometimes when node1 rejoins, then node2 will demote the drbd ms resource.
> That causes all resources that depend on it to be stopped, leading to a
> service outage. They are then restarted on node2, since they can't run on
> node1 (due to constraints).
> We are having a hard time understanding why this happens. It seems like
> there may be some sort of DC contention happening. Does anyone have any
> idea how we might prevent this from happening?

I think if you are routing high-volume DRBD traffic throuch "the same pipe" as the cluster communication, cluster communication may fail if the pipe is satiated.
I'm not happy with that, but it seems to be that way.

Maybe running a combination of iftop and iotop could help you understand what's going on...

Regards,
Ulrich

> Selected messages (de-identified) from pacemaker.log that illustrate
> suspicion re DC confusion are below. The update_dc and
> abort_transition_graph re deletion of lrm seem to always precede the
> demotion, and a demotion seems to always follow (when not already demoted).
> 
> Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
> do_dc_takeover:        Taking over DC status for this partition
> Jan 18 16:52:17 [21938] node02.example.com       crmd:     info: update_dc:
>     Set DC to node02.example.com (3.0.14)
> Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
> abort_transition_graph:        Transition aborted by deletion of
> lrm[@id='1']: Resource state removal | cib=0.89.327
> source=abort_unless_down:357
> path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
> Jan 18 16:52:19 [21937] node02.example.com    pengine:     info:
> master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible 1 to
> master
> Jan 18 16:52:19 [21937] node02.example.com    pengine:   notice: LogAction:
>      * Demote     drbd_ourApp:1     (            Master -> Slave
> node02.example.com )




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs home: https://www.clusterlabs.org/