Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-07 Thread Andrei Borzenkov
On 07.06.2021 22:49, Eric Robinson wrote:
> 
> Which is what I don't want to happen. I only want the cluster to failover if 
> one of the lower dependencies fails (drbd or filesystem). If one of the MySQL 
> instances fails, I do not want the cluster to move everything for the sake of 
> that one resource.

So set migration threshold to infinity for this resource


> That's like a teacher relocating all the students in the classroom to a new 
> classroom because one of then lost his pencil.
> 

You have already been told that this problem was acknowledged and
support for this scenario was added. What do you expect now - jump ten
years back and add this feature from the very beginning so it magically
appears in the version you are using?

Open service request with your distribution and ask to backport this
feature. Or use newer version where this feature is present.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Colocating a Virtual IP address with multiple resources

2021-06-07 Thread kgaillot
On Mon, 2021-06-07 at 20:37 +, Abithan Kumarasamy wrote:
> Hello Team,
>  
> We have been recently experimenting with some resource model options
> to fulfil the following scenario. We would like to collocate a
> virtual IP resource with multiple db resources. When the virtual IP
> fails over to another node, all the dbs associated should also fail
> over to the new node. We were able to accomplish this with resource
> sets as defined in Example 6.17 in this documentation page: 
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#s-resource-sets-colocation
> . However, whenever a single db fails over to the other node, the
> virtual IP address and the other dbs are not following and failing
> over to the other node. Are there any configurations that may be
> causing this undesired behaviour? We have already tried resource
> sets, colocation constraints, and ordering constraints. Are there any
> other models that we should consider to achieve this solution? Our
> current constraint model looks like this in a simplified manner.
>  
> 
> 
>  
> 
>  
> 
> 
> 
> 
> 
> 

With the above configuration, the resources should fail over all
together. However the database colocations are limited to the promoted
role; any unpromoted instances can fail over without restrictions.

If you want the dbs to depend only on the IP, and not each other, add
sequential="false" to db-set.

With the exact above configuration, is the promoted role of one of the
databases failing over to a node that's not running the IP?

>  
> Thanks,
> Abithan
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Colocating a Virtual IP address with multiple resources

2021-06-07 Thread Abithan Kumarasamy
Hello Team,
 
We have been recently experimenting with some resource model options to fulfil the following scenario. We would like to collocate a virtual IP resource with multiple db resources. When the virtual IP fails over to another node, all the dbs associated should also fail over to the new node. We were able to accomplish this with resource sets as defined in Example 6.17 in this documentation page: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#s-resource-sets-colocation. However, whenever a single db fails over to the other node, the virtual IP address and the other dbs are not following and failing over to the other node. Are there any configurations that may be causing this undesired behaviour? We have already tried resource sets, colocation constraints, and ordering constraints. Are there any other models that we should consider to achieve this solution? Our current constraint model looks like this in a simplified manner.
 


 

 






 
Thanks,
Abithan

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-07 Thread Antony Stone
On Monday 07 June 2021 at 21:49:45, Eric Robinson wrote:

> > -Original Message-
> > From: kgail...@redhat.com 
> > Sent: Monday, June 7, 2021 2:39 PM
> > To: Strahil Nikolov ; Cluster Labs - All topics
> > related to open-source clustering welcomed ; Eric
> > Robinson 
> > Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?
> > 
> > By default, dependent resources in a colocation will affect the placement
> > of the resources they depend on.
> > 
> > In this case, if one of the mysql instances fails and meets its migration
> > threshold, all of the resources will move to another node, to maximize
> > the chance of all of them being able to run.
> 
> Which is what I don't want to happen. I only want the cluster to failover
> if one of the lower dependencies fails (drbd or filesystem). If one of the
> MySQL instances fails, I do not want the cluster to move everything for
> the sake of that one resource. That's like a teacher relocating all the
> students in the classroom to a new classroom because one of then lost his
> pencil.

Okay, so let's focus on what you *do* want to happen.

One MySQL instance fails.  Nothing else does.

What do you want next?

 - Cluster continues with a failed MySQL resource?

 - MySQL resource moves to another node but no other resources move?

 - something else I can't really imagine right now?


I'm sure that if you can define what you want the cluster to do in this 
situation (MySQL fails, all else continues okay), someone here can help you 
explain that to pacemaker.


Antony.

-- 
This email was created using 100% recycled electrons.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-07 Thread Eric Robinson
> -Original Message-
> From: kgail...@redhat.com 
> Sent: Monday, June 7, 2021 2:39 PM
> To: Strahil Nikolov ; Cluster Labs - All topics
> related to open-source clustering welcomed ; Eric
> Robinson 
> Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?
>
> On Sun, 2021-06-06 at 08:26 +, Strahil Nikolov wrote:
> > Based on the constraint rules you have mentioned , failure of mysql
> > should not cause a failover to another node. For better insight, you
> > have to be able to reproduce the issue and share the logs with the
> > community.
>
> By default, dependent resources in a colocation will affect the placement of
> the resources they depend on.
>
> In this case, if one of the mysql instances fails and meets its migration
> threshold, all of the resources will move to another node, to maximize the
> chance of all of them being able to run.
>

Which is what I don't want to happen. I only want the cluster to failover if 
one of the lower dependencies fails (drbd or filesystem). If one of the MySQL 
instances fails, I do not want the cluster to move everything for the sake of 
that one resource. That's like a teacher relocating all the students in the 
classroom to a new classroom because one of then lost his pencil.


> >
> > Best Regards,
> > Strahil Nikolov
> >
> > > On Sat, Jun 5, 2021 at 23:33, Eric Robinson
> > >  wrote:
> > > > -Original Message-
> > > > From: Users  On Behalf Of
> > > > kgail...@redhat.com
> > > > Sent: Friday, June 4, 2021 4:49 PM
> > > > To: Cluster Labs - All topics related to open-source clustering
> > > welcomed
> > > > 
> > > > Subject: Re: [ClusterLabs] One Failed Resource = Failover the
> > > Cluster?
> > > >
> > > > On Fri, 2021-06-04 at 19:10 +, Eric Robinson wrote:
> > > > > Sometimes it seems like Pacemaker fails over an entire cluster
> > > when
> > > > > only one resource has failed, even though no other resources
> > > are
> > > > > dependent on it. Is that expected behavior?
> > > > >
> > > > > For example, suppose I have the following colocation
> > > constraints…
> > > > >
> > > > > filesystem with drbd master
> > > > > vip with filesystem
> > > > > mysql_01 with filesystem
> > > > > mysql_02 with filesystem
> > > > > mysql_03 with filesystem
> > > >
> > > > By default, a resource that is colocated with another resource
> > > will influence
> > > > that resource's location. This ensures that as many resources are
> > > active as
> > > > possible.
> > > >
> > > > So, if any one of the above resources fails and meets its
> > > migration- threshold,
> > > > all of the resources will move to another node so a recovery
> > > attempt can be
> > > > made for the failed resource.
> > > >
> > > > No resource will be *stopped* due to the failed resource unless
> > > it depends
> > > > on it.
> > > >
> > >
> > > Thanks, but I'm confused by your previous two paragraphs. On one
> > > hand, "if any one of the above resources fails and meets its
> > > migration- threshold, all of the resources will move to another
> > > node." Obviously moving resources requires stopping them. But then,
> > > "No resource will be *stopped* due to the failed resource unless it
> > > depends on it." Those two statements seem contradictory to me. Not
> > > trying to be argumentative. Just trying to understand.
> > >
> > > > As of the forthcoming 2.1.0 release, the new "influence" option
> > > for
> > > > colocation constraints (and "critical" resource meta-attribute)
> > > controls
> > > > whether this effect occurs. If influence is turned off (or the
> > > resource made
> > > > non-critical), then the failed resource will just stop, and the
> > > other resources
> > > > won't move to try to save it.
> > > >
> > >
> > > That sounds like the feature I'm waiting for. In the example
> > > configuration I provided, I would not want the failure of any mysql
> > > instance to cause cluster failover. I would only want the cluster to
> > > failover if the filesystem or drbd resources failed. Basically, if a
> > > resource breaks or fails to stop, I don't want the whole cluster to
> > > failover if nothing depends on that resource. Just let it stay down
> > > until someone can manually intervene. But if an underlying resource
> > > fails that everything else is dependent on (drbd or filesystem) then
> > > go ahead and failover the cluster.
> > >
> > > > >
> > > > > …and the following order constraints…
> > > > >
> > > > > promote drbd, then start filesystem start filesystem, then start
> > > > > vip start filesystem, then start mysql_01 start filesystem, then
> > > > > start mysql_02 start filesystem, then start mysql_03
> > > > >
> > > > > Now, if something goes wrong with mysql_02, will Pacemaker try
> > > to fail
> > > > > over the whole cluster? And if mysql_02 can’t be run on either
> > > > > cluster, then does Pacemaker refuse to run any resources?
> > > > >
> > > > > I’m asking because I’ve seen some odd behavior like that over
> > > the
> > > > > years. Cou

Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-07 Thread kgaillot
On Sun, 2021-06-06 at 08:26 +, Strahil Nikolov wrote:
> Based on the constraint rules you have mentioned , failure of mysql
> should not cause a failover to another node. For better insight, you
> have to be able to reproduce the issue and share the logs with the
> community.

By default, dependent resources in a colocation will affect the
placement of the resources they depend on.

In this case, if one of the mysql instances fails and meets its
migration threshold, all of the resources will move to another node, to
maximize the chance of all of them being able to run.

> 
> Best Regards,
> Strahil Nikolov
> 
> > On Sat, Jun 5, 2021 at 23:33, Eric Robinson
> >  wrote:
> > > -Original Message-
> > > From: Users  On Behalf Of
> > > kgail...@redhat.com
> > > Sent: Friday, June 4, 2021 4:49 PM
> > > To: Cluster Labs - All topics related to open-source clustering
> > welcomed
> > > 
> > > Subject: Re: [ClusterLabs] One Failed Resource = Failover the
> > Cluster?
> > >
> > > On Fri, 2021-06-04 at 19:10 +, Eric Robinson wrote:
> > > > Sometimes it seems like Pacemaker fails over an entire cluster
> > when
> > > > only one resource has failed, even though no other resources
> > are
> > > > dependent on it. Is that expected behavior?
> > > >
> > > > For example, suppose I have the following colocation
> > constraints…
> > > >
> > > > filesystem with drbd master
> > > > vip with filesystem
> > > > mysql_01 with filesystem
> > > > mysql_02 with filesystem
> > > > mysql_03 with filesystem
> > >
> > > By default, a resource that is colocated with another resource
> > will influence
> > > that resource's location. This ensures that as many resources are
> > active as
> > > possible.
> > >
> > > So, if any one of the above resources fails and meets its
> > migration- threshold,
> > > all of the resources will move to another node so a recovery
> > attempt can be
> > > made for the failed resource.
> > >
> > > No resource will be *stopped* due to the failed resource unless
> > it depends
> > > on it.
> > >
> > 
> > Thanks, but I'm confused by your previous two paragraphs. On one
> > hand, "if any one of the above resources fails and meets its
> > migration- threshold, all of the resources will move to another
> > node." Obviously moving resources requires stopping them. But then,
> > "No resource will be *stopped* due to the failed resource unless it
> > depends on it." Those two statements seem contradictory to me. Not
> > trying to be argumentative. Just trying to understand.
> > 
> > > As of the forthcoming 2.1.0 release, the new "influence" option
> > for
> > > colocation constraints (and "critical" resource meta-attribute)
> > controls
> > > whether this effect occurs. If influence is turned off (or the
> > resource made
> > > non-critical), then the failed resource will just stop, and the
> > other resources
> > > won't move to try to save it.
> > >
> > 
> > That sounds like the feature I'm waiting for. In the example
> > configuration I provided, I would not want the failure of any mysql
> > instance to cause cluster failover. I would only want the cluster
> > to failover if the filesystem or drbd resources failed. Basically,
> > if a resource breaks or fails to stop, I don't want the whole
> > cluster to failover if nothing depends on that resource. Just let
> > it stay down until someone can manually intervene. But if an
> > underlying resource fails that everything else is dependent on
> > (drbd or filesystem) then go ahead and failover the cluster.
> > 
> > > >
> > > > …and the following order constraints…
> > > >
> > > > promote drbd, then start filesystem
> > > > start filesystem, then start vip
> > > > start filesystem, then start mysql_01
> > > > start filesystem, then start mysql_02
> > > > start filesystem, then start mysql_03
> > > >
> > > > Now, if something goes wrong with mysql_02, will Pacemaker try
> > to fail
> > > > over the whole cluster? And if mysql_02 can’t be run on either
> > > > cluster, then does Pacemaker refuse to run any resources?
> > > >
> > > > I’m asking because I’ve seen some odd behavior like that over
> > the
> > > > years. Could be my own configuration mistakes, of course.
> > > >
> > > > -Eric
> > > --
> > > Ken Gaillot 
> > >
> > > ___
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> > Disclaimer : This email and any files transmitted with it are
> > confidential and intended solely for intended recipients. If you
> > are not the named addressee you should not disseminate, distribute,
> > copy or alter this email. Any views or opinions presented in this
> > email are solely those of the author and might not represent those
> > of Physician Select Management. Warning: Although Physician Select
> > Management has taken reasonable precautions to ensure no viruses
> > are present in this email, the company cannot accept 

Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-07 Thread kgaillot
On Sat, 2021-06-05 at 20:33 +, Eric Robinson wrote:
> > -Original Message-
> > From: Users  On Behalf Of
> > kgail...@redhat.com
> > Sent: Friday, June 4, 2021 4:49 PM
> > To: Cluster Labs - All topics related to open-source clustering
> > welcomed
> > 
> > Subject: Re: [ClusterLabs] One Failed Resource = Failover the
> > Cluster?
> > 
> > On Fri, 2021-06-04 at 19:10 +, Eric Robinson wrote:
> > > Sometimes it seems like Pacemaker fails over an entire cluster
> > > when
> > > only one resource has failed, even though no other resources are
> > > dependent on it. Is that expected behavior?
> > > 
> > > For example, suppose I have the following colocation constraints…
> > > 
> > > filesystem with drbd master
> > > vip with filesystem
> > > mysql_01 with filesystem
> > > mysql_02 with filesystem
> > > mysql_03 with filesystem
> > 
> > By default, a resource that is colocated with another resource will
> > influence
> > that resource's location. This ensures that as many resources are
> > active as
> > possible.
> > 
> > So, if any one of the above resources fails and meets its
> > migration- threshold,
> > all of the resources will move to another node so a recovery
> > attempt can be
> > made for the failed resource.
> > 
> > No resource will be *stopped* due to the failed resource unless it
> > depends
> > on it.
> > 
> 
> Thanks, but I'm confused by your previous two paragraphs. On one
> hand, "if any one of the above resources fails and meets its
> migration- threshold, all of the resources will move to another
> node." Obviously moving resources requires stopping them. But then,
> "No resource will be *stopped* due to the failed resource unless it
> depends on it." Those two statements seem contradictory to me. Not
> trying to be argumentative. Just trying to understand.

Right, I should have said "will be left stopped". I.e., the other
resources might stop and start as part of a move, but they're not going
to stop and stay stopped because something that depends on them failed.

> 
> > As of the forthcoming 2.1.0 release, the new "influence" option for
> > colocation constraints (and "critical" resource meta-attribute)
> > controls
> > whether this effect occurs. If influence is turned off (or the
> > resource made
> > non-critical), then the failed resource will just stop, and the
> > other resources
> > won't move to try to save it.
> > 
> 
> That sounds like the feature I'm waiting for. In the example
> configuration I provided, I would not want the failure of any mysql
> instance to cause cluster failover. I would only want the cluster to
> failover if the filesystem or drbd resources failed. Basically, if a
> resource breaks or fails to stop, I don't want the whole cluster to
> failover if nothing depends on that resource. Just let it stay down
> until someone can manually intervene. But if an underlying resource
> fails that everything else is dependent on (drbd or filesystem) then
> go ahead and failover the cluster.
> 
> > > 
> > > …and the following order constraints…
> > > 
> > > promote drbd, then start filesystem
> > > start filesystem, then start vip
> > > start filesystem, then start mysql_01
> > > start filesystem, then start mysql_02
> > > start filesystem, then start mysql_03
> > > 
> > > Now, if something goes wrong with mysql_02, will Pacemaker try to
> > > fail
> > > over the whole cluster? And if mysql_02 can’t be run on either
> > > cluster, then does Pacemaker refuse to run any resources?
> > > 
> > > I’m asking because I’ve seen some odd behavior like that over the
> > > years. Could be my own configuration mistakes, of course.
> > > 
> > > -Eric
> > 
> > --
> > Ken Gaillot 
> > 
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> 
> Disclaimer : This email and any files transmitted with it are
> confidential and intended solely for intended recipients. If you are
> not the named addressee you should not disseminate, distribute, copy
> or alter this email. Any views or opinions presented in this email
> are solely those of the author and might not represent those of
> Physician Select Management. Warning: Although Physician Select
> Management has taken reasonable precautions to ensure no viruses are
> present in this email, the company cannot accept responsibility for
> any loss or damage arising from the use of this email or attachments.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] DRBD or XtraDB?L

2021-06-07 Thread Bob Marcan
On Mon, 7 Jun 2021 13:21:45 +
Eric Robinson  wrote:

> Looking for opinions here.
> 

https://mariadb.com/kb/en/why-does-mariadb-102-use-innodb-instead-of-xtradb/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] DRBD or XtraDB?L

2021-06-07 Thread Eric Robinson
Looking for opinions here.

We've been using DRBD for 15 years successfully, but always on clusters with 
about 50 instances of MySQL running and 1TB of storage. Soon, we will refresh 
the environment and deploy much bigger servers with 100+ instances of MySQL and 
15TB+ volumes. With DRBB, I'm getting more concerned about the filesystem 
itself being a SPOF and I'm looking for possible alternatives. What advantages 
or disadvantages would application layer replication (XtraDB) have versus 
replication at the block layer (DRBD)? Obviously, XtraDB avoids the problem of 
the filesystem getting corrupted across all DRBD volumes, but there may also be 
things that make it less than desirable in a Linux HA setup.

Thoughts, opinions, flames?

-Eric




Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-07 Thread Eric Robinson
Not even if a mysql resource fails to stop?


From: Strahil Nikolov 
Sent: Sunday, June 6, 2021 3:27 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

Based on the constraint rules you have mentioned , failure of mysql should not 
cause a failover to another node. For better insight, you have to be able to 
reproduce the issue and share the logs with the community.

Best Regards,
Strahil Nikolov
On Sat, Jun 5, 2021 at 23:33, Eric Robinson
mailto:eric.robin...@psmnv.com>> wrote:
> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of
> kgail...@redhat.com
> Sent: Friday, June 4, 2021 4:49 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> mailto:users@clusterlabs.org>>
> Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?
>
> On Fri, 2021-06-04 at 19:10 +, Eric Robinson wrote:
> > Sometimes it seems like Pacemaker fails over an entire cluster when
> > only one resource has failed, even though no other resources are
> > dependent on it. Is that expected behavior?
> >
> > For example, suppose I have the following colocation constraints…
> >
> > filesystem with drbd master
> > vip with filesystem
> > mysql_01 with filesystem
> > mysql_02 with filesystem
> > mysql_03 with filesystem
>
> By default, a resource that is colocated with another resource will influence
> that resource's location. This ensures that as many resources are active as
> possible.
>
> So, if any one of the above resources fails and meets its migration- 
> threshold,
> all of the resources will move to another node so a recovery attempt can be
> made for the failed resource.
>
> No resource will be *stopped* due to the failed resource unless it depends
> on it.
>

Thanks, but I'm confused by your previous two paragraphs. On one hand, "if any 
one of the above resources fails and meets its migration- threshold, all of the 
resources will move to another node." Obviously moving resources requires 
stopping them. But then, "No resource will be *stopped* due to the failed 
resource unless it depends on it." Those two statements seem contradictory to 
me. Not trying to be argumentative. Just trying to understand.

> As of the forthcoming 2.1.0 release, the new "influence" option for
> colocation constraints (and "critical" resource meta-attribute) controls
> whether this effect occurs. If influence is turned off (or the resource made
> non-critical), then the failed resource will just stop, and the other 
> resources
> won't move to try to save it.
>

That sounds like the feature I'm waiting for. In the example configuration I 
provided, I would not want the failure of any mysql instance to cause cluster 
failover. I would only want the cluster to failover if the filesystem or drbd 
resources failed. Basically, if a resource breaks or fails to stop, I don't 
want the whole cluster to failover if nothing depends on that resource. Just 
let it stay down until someone can manually intervene. But if an underlying 
resource fails that everything else is dependent on (drbd or filesystem) then 
go ahead and failover the cluster.

> >
> > …and the following order constraints…
> >
> > promote drbd, then start filesystem
> > start filesystem, then start vip
> > start filesystem, then start mysql_01
> > start filesystem, then start mysql_02
> > start filesystem, then start mysql_03
> >
> > Now, if something goes wrong with mysql_02, will Pacemaker try to fail
> > over the whole cluster? And if mysql_02 can’t be run on either
> > cluster, then does Pacemaker refuse to run any resources?
> >
> > I’m asking because I’ve seen some odd behavior like that over the
> > years. Could be my own configuration mistakes, of course.
> >
> > -Eric
> --
> Ken Gaillot mailto:kgail...@redhat.com>>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended s

Re: [ClusterLabs] Antw: [EXT] no-quorum-policy=stop never executed, pacemaker stuck in election/integration, corosync running in "new membership" cycles with itself

2021-06-07 Thread Klaus Wenninger

On 6/2/21 10:47 AM, Lars Ellenberg wrote:

lge> > I would have expected corosync to come back with a "stable
lge> > non‑quorate membership" of just itself within a very short
lge> > period of time, and pacemaker winning the
lge> > "election"/"integration" with just itself, and then trying
lge> > to call "stop" on everything it knows about.
ken>
ken> That's what I'd expect, too. I'm guessing the corosync cycling is
ken> what's causing the pacemaker cycling, so I'd focus on corosync first.

Any Corosync folks around with some input?
What may cause corosync on an isolated (with iptables DROP rules)
node to keep creating "new membership" with only itself?

Is it a problem with the test setup maybe?
Does an isolated corosync node need to be able
to send the token to itself?
Do the "iptables DROP" rules on the outgoing interfaces prevent that?

iirc dropping outgoing corosync traffic would delay
membership forming. A fact that worries me thinking
of sbd btw. ... ;-)


On Tue, Jun 01, 2021 at 10:31:21AM -0500, kgail...@redhat.com wrote:

On Tue, 2021-06-01 at 13:18 +0200, Ulrich Windl wrote:

Hi!

I can't answer, but I doubt the usefulness of
"no-quorum-policy=stop": If nodes loose quorum, they try to
stop all resources, but "remain" in the cluster (will respond
to network queries (if any arrive).  If one of those "stop"s
fails, the other part of the cluster never knows.  So what can
be done? Should the "other(left)" part of the cluster start
resources, assuming the "other(right)" part of the cluster had
stopped resources successfully?

no-quorum-policy only affects what the non-quorate partition will do.
The quorate partition will still fence the non-quorate part if it is
able, regardless of no-quorum-policy, and won't recover resources until
fencing succeeds.

The context in this case is: "fencing by storage".
DRBD 9 has a "drbd quorum" feature, where you can ask it
to throw IO errors (or freeze) if DRBD quorum is lost,
so data integrity on network partition is protected,
even without fencing on the pacemaker level.

It is rather a "convenience" that the non-quorate
pacemaker on the isolated node should stop everything
that still "survived", especially the umount is necessary
for DRBD on that node to become secondary again,
which is necessary to be able to re-integrate later
when connectivity is restored.

Yes, fencing on the node level is still necessary for other
scenarios.  But with certain scenarios, avoiding a node level
fence while still being able to also avoid "trouble" once
connectivity is restored would be nice.

And would work nicely here, if the corosync membership
of the isolated node would be stable enough for pacemaker
to finalize "integration" with itself and then (try to) stop
everything, so we have a truely "idle" node when connectivity is
restored.

"trouble":
spurious restart of services ("resource too active ..."),
problems with re-connecting DRBD ("two primaries not allowed")


pcmk 2.0.5, corosync 3.1.0, knet, rhel8
I know fencing "solves" this just fine.

what I'd like to understand though is: what exactly is
corosync or pacemaker waiting for here, why does it not
manage to get to the stage where it would even attempt to
"stop" stuff?

two "rings" aka knet interfaces.
node isolation test with iptables,
INPUT/OUTPUT ‑j DROP on one interface,
shortly after on the second as well.
node loses quorum (obviously).

pacemaker is expected to no‑quorum‑policy=stop,
but is "stuck" in Election ‑> Integration,
while corosync "cycles" bewteen "new membership" (with only
itself, obviously) and "token has not been received in ...",
"sync members ...", "new membership has formed ..."

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-06-07 Thread Klaus Wenninger

On 6/2/21 10:39 PM, Eric Robinson wrote:

-Original Message-
From: Users  On Behalf Of Andrei
Borzenkov
Sent: Tuesday, June 1, 2021 12:52 PM
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?

On 01.06.2021 19:21, Eric Robinson wrote:

-Original Message-
From: Users  On Behalf Of Klaus
Wenninger
Sent: Monday, May 31, 2021 12:54 AM
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?

On 5/29/21 12:21 AM, Strahil Nikolov wrote:

I agree -> fencing is mandatory.

Agreed that with proper fencing setup the cluster wouldn'thave run
into that state.
But still it might be interesting to find out what has happened.

Thank you for looking past the fencing issue to the real question.

Regardless of whether or not fencing was enabled, there should still be some
indication of what actions the cluster took and why, but it appears that
cluster services just terminated silently.

Not seeing anything in the log snippet either.

Me neither.


Assuming you are running something systemd-based. Centos 7.

Yes. CentOS Linux release 7.5.1804.


Did you check the journal for pacemaker to see what systemd is thinking?
With the standard unit-file systemd should observe pacemakerd and
restart it if it goes away ungracefully.

The only log entry showing Pacemaker startup that I found in any of the

messages files (current and several days of history) was the one when I
started the cluster manually (see below).

Guess if systemd finding out about a stopped service is logged
to any file is configuration dependent.
Was more thinking of 'systemctl status pacemaker' or
'journalctl -u pacemaker'.

If cluster processes stopped or crashed you obviously won't see any logs
from them until they are restarted. You need to look at other system logs -
may be they record something unusual around this time? Any crash dumps?

The messages log shows continued entries for various pacemaker components, as 
mentioned in a previous email. Could not find any crash dumps.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/