from:"Eric Robinson"

Re: [ClusterLabs] DRBD and SQL Server

2022-09-27 Thread Eric Robinson

Hi Madi,

It sounds like you’ve had a lot of good experience. I’m trying to decide 
between paying a premium price for MSSQL Enterprise with Always-On Replication 
or just setting up an Active/Standby scenario with the Standard Edition of 
MSSQL running on DRBD. We have tons of experience with MySQL on DRBD, but not 
with MSSQL. When running MSSQL on DRBD, what’s the cluster stack? How does 
failover work? When using MySQL, the service only runs on one server at a time. 
In a failover, the writable data volume transitions to the standby server and 
then the MySQL service is started on it. Does it work the same way with MSQL?

-Eric


From: Madison Kelly 
Sent: Monday, September 26, 2022 7:55 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] DRBD and SQL Server

On 2022-09-25 23:49, Eric Robinson wrote:
Hey list,

Anybody have experience running SQL Server on DRBD? I’d ask this in the DRBD 
list but that one is like a ghost town. This list is the next best option.

-Eric

Extensively, yes. Albeit in VMs whose storage was backed by DRBD, though for 
all practical purposes there's no real difference. We've had clients running 
various DB servers for over ten years spanning DRBD 8.3 through to the latest 
9.1.

What's your question?

Madi

--

Madison Kelly

Alteeve's Niche!

Chief Technical Officer

c: +1-647-471-0951

https://alteeve.com/

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD and SQL Server

2022-09-26 Thread Eric Robinson

> -Original Message-
> From: Reid Wahl 
> Sent: Monday, September 26, 2022 3:22 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Eric Robinson 
> Subject: Re: [ClusterLabs] DRBD and SQL Server
>
> On Sun, Sep 25, 2022 at 8:49 PM Eric Robinson 
> wrote:
> >
> > Hey list,
> >
> >
> >
> > Anybody have experience running SQL Server on DRBD? I’d ask this in the
> DRBD list but that one is like a ghost town. This list is the next best 
> option.
>
> I think the DRBD Slack channel is marginally more active than their list, for
> what it's worth.
> >
> >

That's a great suggestion. Thanks!

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] DRBD and SQL Server

2022-09-25 Thread Eric Robinson

Hey list,

Anybody have experience running SQL Server on DRBD? I'd ask this in the DRBD 
list but that one is like a ghost town. This list is the next best option.

-Eric







Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] OT: Linstor/DRBD Problem

2022-04-28 Thread Eric Robinson

We do use DRBD with Linstor, but the question is resolved now. It turns out 
that Linstor does not update the storage pool stats until you create a 
resource. They show correctly now.

From: Strahil Nikolov 
Sent: Thursday, April 28, 2022 12:37 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] OT: Linstor/DRBD Problem

Why do you use Linstor and not DRBD ?
As far as I know Linstor is more suitable for Kubernetes/Openshift .

Best Regards,
Strahil Nikolov
On Thu, Apr 28, 2022 at 8:19, Eric Robinson
mailto:eric.robin...@psmnv.com>> wrote:

This is probably off-topic but I’ll try anyway. Do we have any Linstor gurus 
around here? I’ve read through the Linstor User Guide and all the help screens, 
but I don’t see an answer to this question. We added a new physical drive to 
each of our cluster nodes and extended the LVM volume groups. The VGs now show 
the correct size as expected. However, in Linstor, the storage pools still look 
the same and do not reflect the additional storage space. Any ideas what I 
should check next?

If this is too off topic, I’ll understand.

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] OT: Linstor/DRBD Problem

2022-04-27 Thread Eric Robinson

This is probably off-topic but I'll try anyway. Do we have any Linstor gurus 
around here? I've read through the Linstor User Guide and all the help screens, 
but I don't see an answer to this question. We added a new physical drive to 
each of our cluster nodes and extended the LVM volume groups. The VGs now show 
the correct size as expected. However, in Linstor, the storage pools still look 
the same and do not reflect the additional storage space. Any ideas what I 
should check next?

If this is too off topic, I'll understand.

-Eric


Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] If You Were Building a DRBD-Based Cluster Today...

2021-08-19 Thread Eric Robinson

Multi-tenant MySQL clusters.

From: Digimer 
Sent: Wednesday, August 18, 2021 10:23 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] If You Were Building a DRBD-Based Cluster Today...

On 2021-08-18 3:43 p.m., Eric Robinson wrote:
If you were building a DRBD-based cluster today on servers with internal 
storage, what would you use for a filesystem? To be more specific, the servers 
have 6 x 3.2 TB NVME drives and no RAID controller. Would you build an mdraid 
array as your drbd backing device? Maybe a ZFS RAIDZ? How would you feel about 
using ZFS as the filesystem over DRBD to take advantage of filesystem 
compression? Or would you do something else entirely?

What will the use-case be?

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] If You Were Building a DRBD-Based Cluster Today...

2021-08-18 Thread Eric Robinson

If you were building a DRBD-based cluster today on servers with internal 
storage, what would you use for a filesystem? To be more specific, the servers 
have 6 x 3.2 TB NVME drives and no RAID controller. Would you build an mdraid 
array as your drbd backing device? Maybe a ZFS RAIDZ? How would you feel about 
using ZFS as the filesystem over DRBD to take advantage of filesystem 
compression? Or would you do something else entirely?

[cid:image001.png@01D7943F.7287EC80]

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] ZFS Opinions?

2021-07-09 Thread Eric Robinson

Lol, I suspect this is one of those, "Go do a bunch of Googling," scenarios. 
I'm doing that, too, of course. That's what led me here.

-Eric

From: Users  On Behalf Of Eric Robinson
Sent: Friday, July 9, 2021 11:39 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: [ClusterLabs] ZFS Opinions?

I've been a diehard ext4 user for a long time. Just when I was starting to get 
excited about the possibilities of using zfs instead, I ran into this 
thread<https://centosfaq.org/centos/understanding-vdo-vs-zfs/> where a guy 
talks about having multiple catastrophic failures with zfs. What do you guys 
think? Good choice for Linux HA with DRBD, or no? Caveats and pitfalls?

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] ZFS Opinions?

2021-07-09 Thread Eric Robinson

I've been a diehard ext4 user for a long time. Just when I was starting to get 
excited about the possibilities of using zfs instead, I ran into this 
thread where a guy 
talks about having multiple catastrophic failures with zfs. What do you guys 
think? Good choice for Linux HA with DRBD, or no? Caveats and pitfalls?

-Eric




Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-08 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Tuesday, June 8, 2021 12:20 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?
>
> On 07.06.2021 22:49, Eric Robinson wrote:
> >
> > Which is what I don't want to happen. I only want the cluster to failover if
> one of the lower dependencies fails (drbd or filesystem). If one of the
> MySQL instances fails, I do not want the cluster to move everything for the
> sake of that one resource.
>
> So set migration threshold to infinity for this resource
>

Good suggestion, thanks.

>
> > That's like a teacher relocating all the students in the classroom to a new
> classroom because one of then lost his pencil.
> >
>
> You have already been told that this problem was acknowledged and support
> for this scenario was added. What do you expect now - jump ten years back
> and add this feature from the very beginning so it magically appears in the
> version you are using?
>

That's a great idea. I just ordered a time machine from Amazon and went back 
ten years and fixed this issue. (In fact, I can prove it. Remember that guy you 
met in the coffee house ten years ago wearing the red ball cap? That was me 
dropping by to say hi.)

> Open service request with your distribution and ask to backport this feature.
> Or use newer version where this feature is present.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-07 Thread Eric Robinson

> -Original Message-
> From: kgail...@redhat.com 
> Sent: Monday, June 7, 2021 2:39 PM
> To: Strahil Nikolov ; Cluster Labs - All topics
> related to open-source clustering welcomed ; Eric
> Robinson 
> Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?
>
> On Sun, 2021-06-06 at 08:26 +, Strahil Nikolov wrote:
> > Based on the constraint rules you have mentioned , failure of mysql
> > should not cause a failover to another node. For better insight, you
> > have to be able to reproduce the issue and share the logs with the
> > community.
>
> By default, dependent resources in a colocation will affect the placement of
> the resources they depend on.
>
> In this case, if one of the mysql instances fails and meets its migration
> threshold, all of the resources will move to another node, to maximize the
> chance of all of them being able to run.
>

Which is what I don't want to happen. I only want the cluster to failover if 
one of the lower dependencies fails (drbd or filesystem). If one of the MySQL 
instances fails, I do not want the cluster to move everything for the sake of 
that one resource. That's like a teacher relocating all the students in the 
classroom to a new classroom because one of then lost his pencil.


> >
> > Best Regards,
> > Strahil Nikolov
> >
> > > On Sat, Jun 5, 2021 at 23:33, Eric Robinson
> > >  wrote:
> > > > -Original Message-
> > > > From: Users  On Behalf Of
> > > > kgail...@redhat.com
> > > > Sent: Friday, June 4, 2021 4:49 PM
> > > > To: Cluster Labs - All topics related to open-source clustering
> > > welcomed
> > > > 
> > > > Subject: Re: [ClusterLabs] One Failed Resource = Failover the
> > > Cluster?
> > > >
> > > > On Fri, 2021-06-04 at 19:10 +, Eric Robinson wrote:
> > > > > Sometimes it seems like Pacemaker fails over an entire cluster
> > > when
> > > > > only one resource has failed, even though no other resources
> > > are
> > > > > dependent on it. Is that expected behavior?
> > > > >
> > > > > For example, suppose I have the following colocation
> > > constraints…
> > > > >
> > > > > filesystem with drbd master
> > > > > vip with filesystem
> > > > > mysql_01 with filesystem
> > > > > mysql_02 with filesystem
> > > > > mysql_03 with filesystem
> > > >
> > > > By default, a resource that is colocated with another resource
> > > will influence
> > > > that resource's location. This ensures that as many resources are
> > > active as
> > > > possible.
> > > >
> > > > So, if any one of the above resources fails and meets its
> > > migration- threshold,
> > > > all of the resources will move to another node so a recovery
> > > attempt can be
> > > > made for the failed resource.
> > > >
> > > > No resource will be *stopped* due to the failed resource unless
> > > it depends
> > > > on it.
> > > >
> > >
> > > Thanks, but I'm confused by your previous two paragraphs. On one
> > > hand, "if any one of the above resources fails and meets its
> > > migration- threshold, all of the resources will move to another
> > > node." Obviously moving resources requires stopping them. But then,
> > > "No resource will be *stopped* due to the failed resource unless it
> > > depends on it." Those two statements seem contradictory to me. Not
> > > trying to be argumentative. Just trying to understand.
> > >
> > > > As of the forthcoming 2.1.0 release, the new "influence" option
> > > for
> > > > colocation constraints (and "critical" resource meta-attribute)
> > > controls
> > > > whether this effect occurs. If influence is turned off (or the
> > > resource made
> > > > non-critical), then the failed resource will just stop, and the
> > > other resources
> > > > won't move to try to save it.
> > > >
> > >
> > > That sounds like the feature I'm waiting for. In the example
> > > configuration I provided, I would not want the failure of any mysql
> > > instance to cause cluster failover. I would only want the cluster to
> > > failover if the filesystem or drbd resources failed. Basically, if a
> > > resource breaks or fails to stop, I don't want the whole

[ClusterLabs] DRBD or XtraDB?L

2021-06-07 Thread Eric Robinson

Looking for opinions here.

We've been using DRBD for 15 years successfully, but always on clusters with 
about 50 instances of MySQL running and 1TB of storage. Soon, we will refresh 
the environment and deploy much bigger servers with 100+ instances of MySQL and 
15TB+ volumes. With DRBB, I'm getting more concerned about the filesystem 
itself being a SPOF and I'm looking for possible alternatives. What advantages 
or disadvantages would application layer replication (XtraDB) have versus 
replication at the block layer (DRBD)? Obviously, XtraDB avoids the problem of 
the filesystem getting corrupted across all DRBD volumes, but there may also be 
things that make it less than desirable in a Linux HA setup.

Thoughts, opinions, flames?

-Eric




Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-07 Thread Eric Robinson

Not even if a mysql resource fails to stop?

From: Strahil Nikolov 
Sent: Sunday, June 6, 2021 3:27 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

Based on the constraint rules you have mentioned , failure of mysql should not 
cause a failover to another node. For better insight, you have to be able to 
reproduce the issue and share the logs with the community.

Best Regards,
Strahil Nikolov
On Sat, Jun 5, 2021 at 23:33, Eric Robinson
mailto:eric.robin...@psmnv.com>> wrote:
> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of
> kgail...@redhat.com<mailto:kgail...@redhat.com>
> Sent: Friday, June 4, 2021 4:49 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> mailto:users@clusterlabs.org>>
> Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?
>
> On Fri, 2021-06-04 at 19:10 +, Eric Robinson wrote:
> > Sometimes it seems like Pacemaker fails over an entire cluster when
> > only one resource has failed, even though no other resources are
> > dependent on it. Is that expected behavior?
> >
> > For example, suppose I have the following colocation constraints…
> >
> > filesystem with drbd master
> > vip with filesystem
> > mysql_01 with filesystem
> > mysql_02 with filesystem
> > mysql_03 with filesystem
>
> By default, a resource that is colocated with another resource will influence
> that resource's location. This ensures that as many resources are active as
> possible.
>
> So, if any one of the above resources fails and meets its migration- 
> threshold,
> all of the resources will move to another node so a recovery attempt can be
> made for the failed resource.
>
> No resource will be *stopped* due to the failed resource unless it depends
> on it.
>

Thanks, but I'm confused by your previous two paragraphs. On one hand, "if any 
one of the above resources fails and meets its migration- threshold, all of the 
resources will move to another node." Obviously moving resources requires 
stopping them. But then, "No resource will be *stopped* due to the failed 
resource unless it depends on it." Those two statements seem contradictory to 
me. Not trying to be argumentative. Just trying to understand.

> As of the forthcoming 2.1.0 release, the new "influence" option for
> colocation constraints (and "critical" resource meta-attribute) controls
> whether this effect occurs. If influence is turned off (or the resource made
> non-critical), then the failed resource will just stop, and the other 
> resources
> won't move to try to save it.
>

That sounds like the feature I'm waiting for. In the example configuration I 
provided, I would not want the failure of any mysql instance to cause cluster 
failover. I would only want the cluster to failover if the filesystem or drbd 
resources failed. Basically, if a resource breaks or fails to stop, I don't 
want the whole cluster to failover if nothing depends on that resource. Just 
let it stay down until someone can manually intervene. But if an underlying 
resource fails that everything else is dependent on (drbd or filesystem) then 
go ahead and failover the cluster.

> >
> > …and the following order constraints…
> >
> > promote drbd, then start filesystem
> > start filesystem, then start vip
> > start filesystem, then start mysql_01
> > start filesystem, then start mysql_02
> > start filesystem, then start mysql_03
> >
> > Now, if something goes wrong with mysql_02, will Pacemaker try to fail
> > over the whole cluster? And if mysql_02 can’t be run on either
> > cluster, then does Pacemaker refuse to run any resources?
> >
> > I’m asking because I’ve seen some odd behavior like that over the
> > years. Could be my own configuration mistakes, of course.
> >
> > -Eric
> --
> Ken Gaillot mailto:kgail...@redhat.com>>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept

Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-05 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of
> kgail...@redhat.com
> Sent: Friday, June 4, 2021 4:49 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?
>
> On Fri, 2021-06-04 at 19:10 +, Eric Robinson wrote:
> > Sometimes it seems like Pacemaker fails over an entire cluster when
> > only one resource has failed, even though no other resources are
> > dependent on it. Is that expected behavior?
> >
> > For example, suppose I have the following colocation constraints…
> >
> > filesystem with drbd master
> > vip with filesystem
> > mysql_01 with filesystem
> > mysql_02 with filesystem
> > mysql_03 with filesystem
>
> By default, a resource that is colocated with another resource will influence
> that resource's location. This ensures that as many resources are active as
> possible.
>
> So, if any one of the above resources fails and meets its migration- 
> threshold,
> all of the resources will move to another node so a recovery attempt can be
> made for the failed resource.
>
> No resource will be *stopped* due to the failed resource unless it depends
> on it.
>

Thanks, but I'm confused by your previous two paragraphs. On one hand, "if any 
one of the above resources fails and meets its migration- threshold, all of the 
resources will move to another node." Obviously moving resources requires 
stopping them. But then, "No resource will be *stopped* due to the failed 
resource unless it depends on it." Those two statements seem contradictory to 
me. Not trying to be argumentative. Just trying to understand.

> As of the forthcoming 2.1.0 release, the new "influence" option for
> colocation constraints (and "critical" resource meta-attribute) controls
> whether this effect occurs. If influence is turned off (or the resource made
> non-critical), then the failed resource will just stop, and the other 
> resources
> won't move to try to save it.
>

That sounds like the feature I'm waiting for. In the example configuration I 
provided, I would not want the failure of any mysql instance to cause cluster 
failover. I would only want the cluster to failover if the filesystem or drbd 
resources failed. Basically, if a resource breaks or fails to stop, I don't 
want the whole cluster to failover if nothing depends on that resource. Just 
let it stay down until someone can manually intervene. But if an underlying 
resource fails that everything else is dependent on (drbd or filesystem) then 
go ahead and failover the cluster.

> >
> > …and the following order constraints…
> >
> > promote drbd, then start filesystem
> > start filesystem, then start vip
> > start filesystem, then start mysql_01
> > start filesystem, then start mysql_02
> > start filesystem, then start mysql_03
> >
> > Now, if something goes wrong with mysql_02, will Pacemaker try to fail
> > over the whole cluster? And if mysql_02 can’t be run on either
> > cluster, then does Pacemaker refuse to run any resources?
> >
> > I’m asking because I’ve seen some odd behavior like that over the
> > years. Could be my own configuration mistakes, of course.
> >
> > -Eric
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-04 Thread Eric Robinson

Sometimes it seems like Pacemaker fails over an entire cluster when only one 
resource has failed, even though no other resources are dependent on it. Is 
that expected behavior?

For example, suppose I have the following colocation constraints...

filesystem with drbd master
vip with filesystem
mysql_01 with filesystem
mysql_02 with filesystem
mysql_03 with filesystem

...and the following order constraints...

promote drbd, then start filesystem
start filesystem, then start vip
start filesystem, then start mysql_01
start filesystem, then start mysql_02
start filesystem, then start mysql_03

Now, if something goes wrong with mysql_02, will Pacemaker try to fail over the 
whole cluster? And if mysql_02 can't be run on either cluster, then does 
Pacemaker refuse to run any resources?

I'm asking because I've seen some odd behavior like that over the years. Could 
be my own configuration mistakes, of course.

-Eric





Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] What Does the Monitor Action of the IPaddr2 RA Actually Do?

2021-06-03 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Thursday, June 3, 2021 12:23 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] What Does the Monitor Action of the IPaddr2 RA
> Actually Do?
>
> On 02.06.2021 23:38, Eric Robinson wrote:
> >> -Original Message-
> >> From: Users  On Behalf Of Andrei
> >> Borzenkov
> >> Sent: Tuesday, June 1, 2021 1:14 PM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] What Does the Monitor Action of the
> >> IPaddr2 RA Actually Do?
> >>
> >> On 01.06.2021 19:50, Eric Robinson wrote:
> >>> This is related to another question I currently have ongoing.
> >>>
> >>> I see in the logs that monitoring failed for a VIP resource, and
> >>> that may
> >> have been responsible for node failover. I read the code for the
> >> IPaddr2 RA but it is not clear to me exactly what the monitor action
> >> looks for to determine resource health.
> >>>
> >>> May 27 09:55:31 001store01a crmd[92171]:  notice: State transition
> >>> S_IDLE - S_POLICY_ENGINE May 27 09:55:31 001store01a
> pengine[92170]: warning:
> >> Processing failed op monitor for p_vip_ftpclust01 on 001store01a:
> >> unknown error (1)
> >>>
> >>
> >> enable trace for monitor operation. See e.g.
> >> https://www.suse.com/support/kb/doc/?id=19138.
> >>
> >
> > How does that differ from just doing debug-start, debug-stop, or debug-
> monitor?
> >
>
> I do not know what these do.

# pcs resource debug-

...where  is one of start, stop, or monitor, shows the stdout from the 
resource agent.

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-06-02 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Tuesday, June 1, 2021 12:52 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>
> On 01.06.2021 19:21, Eric Robinson wrote:
> >
> >> -Original Message-
> >> From: Users  On Behalf Of Klaus
> >> Wenninger
> >> Sent: Monday, May 31, 2021 12:54 AM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
> >>
> >> On 5/29/21 12:21 AM, Strahil Nikolov wrote:
> >>> I agree -> fencing is mandatory.
> >> Agreed that with proper fencing setup the cluster wouldn'thave run
> >> into that state.
> >> But still it might be interesting to find out what has happened.
> >
> > Thank you for looking past the fencing issue to the real question.
> Regardless of whether or not fencing was enabled, there should still be some
> indication of what actions the cluster took and why, but it appears that
> cluster services just terminated silently.
> >
> >> Not seeing anything in the log snippet either.
> >
> > Me neither.
> >
> >> Assuming you are running something systemd-based. Centos 7.
> >
> > Yes. CentOS Linux release 7.5.1804.
> >
> >> Did you check the journal for pacemaker to see what systemd is thinking?
> >> With the standard unit-file systemd should observe pacemakerd and
> >> restart it if it goes away ungracefully.
> >
> > The only log entry showing Pacemaker startup that I found in any of the
> messages files (current and several days of history) was the one when I
> started the cluster manually (see below).
> >
>
> If cluster processes stopped or crashed you obviously won't see any logs
> from them until they are restarted. You need to look at other system logs -
> may be they record something unusual around this time? Any crash dumps?

The messages log shows continued entries for various pacemaker components, as 
mentioned in a previous email. Could not find any crash dumps.

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] What Does the Monitor Action of the IPaddr2 RA Actually Do?

2021-06-02 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Tuesday, June 1, 2021 1:14 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] What Does the Monitor Action of the IPaddr2 RA
> Actually Do?
>
> On 01.06.2021 19:50, Eric Robinson wrote:
> > This is related to another question I currently have ongoing.
> >
> > I see in the logs that monitoring failed for a VIP resource, and that may
> have been responsible for node failover. I read the code for the IPaddr2 RA
> but it is not clear to me exactly what the monitor action looks for to
> determine resource health.
> >
> > May 27 09:55:31 001store01a crmd[92171]:  notice: State transition S_IDLE -
> > S_POLICY_ENGINE May 27 09:55:31 001store01a pengine[92170]: warning:
> Processing failed op monitor for p_vip_ftpclust01 on 001store01a: unknown
> error (1)
> >
>
> enable trace for monitor operation. See e.g.
> https://www.suse.com/support/kb/doc/?id=19138.
>

How does that differ from just doing debug-start, debug-stop, or debug-monitor?

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] What Does the Monitor Action of the IPaddr2 RA Actually Do?

2021-06-01 Thread Eric Robinson

This is related to another question I currently have ongoing.

I see in the logs that monitoring failed for a VIP resource, and that may have 
been responsible for node failover. I read the code for the IPaddr2 RA but it 
is not clear to me exactly what the monitor action looks for to determine 
resource health.

May 27 09:55:31 001store01a crmd[92171]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE May 27 09:55:31 001store01a pengine[92170]: warning: Processing 
failed op monitor for p_vip_ftpclust01 on 001store01a: unknown error (1)

-Erric



Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-06-01 Thread Eric Robinson

store01a crmd[92171]:  notice: Transition 91470 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 10:25:31 001store01a crmd[92171]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE
May 27 10:29:08 001store01a systemd: Starting "Sophos Linux Security update"...
May 27 10:29:24 001store01a savd: update.check: Successfully updated Sophos 
Anti-Virus from sdds:SOPHOS
May 27 10:29:24 001store01a systemd: Started "Sophos Linux Security update".
May 27 10:30:01 001store01a systemd: Started Session 203004 of user root.
May 27 10:30:01 001store01a systemd: Starting Session 203004 of user root.
May 27 10:40:01 001store01a systemd: Started Session 203005 of user root.
May 27 10:40:01 001store01a systemd: Starting Session 203005 of user root.
May 27 10:40:31 001store01a crmd[92171]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
May 27 10:40:31 001store01a pengine[92170]: warning: Processing failed op 
monitor for p_vip_ftpclust01 on 001store01a: unknown error (1)
May 27 10:40:31 001store01a pengine[92170]:  notice: Calculated transition 
91471, saving inputs in /var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 10:40:31 001store01a crmd[92171]:  notice: Transition 91471 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 10:40:31 001store01a crmd[92171]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE
May 27 10:49:11 001store01a rsyslogd: imjournal: journal reloaded... [v8.24.0 
try http://www.rsyslog.com/e/0 ]
May 27 10:50:01 001store01a systemd: Started Session 203006 of user root.
May 27 10:50:01 001store01a systemd: Starting Session 203006 of user root.
May 27 10:51:16 001store01a python: 2021-05-27T10:51:16.375599Z INFO ExtHandler 
ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.2.54.2 is running as the goal state 
agent [DEBUG HeartbeatCounter: 2654;HeartbeatId: 
9010FCD6-7B1C-44D7-9DAD-DF324A613FD1;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpd
ate: 1]
May 27 10:55:31 001store01a crmd[92171]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
May 27 10:55:31 001store01a pengine[92170]: warning: Processing failed op 
monitor for p_vip_ftpclust01 on 001store01a: unknown error (1)
May 27 10:55:31 001store01a pengine[92170]:  notice: Calculated transition 
91472, saving inputs in /var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 10:55:31 001store01a crmd[92171]:  notice: Transition 91472 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 10:55:31 001store01a crmd[92171]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE

> Did you find any core-dumps?

No.

>
> Regards,
> Klaus
> >
> > You can enable the debug logs by editing corosync.conf or
> > /etc/sysconfig/pacemaker.
> >
> > In case simple reload doesn't work, you can set the cluster in global
> > maintenance, stop and then start the stack.
> >
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Fri, May 28, 2021 at 22:13, Digimer
> >  wrote:
> > On 2021-05-28 3:08 p.m., Eric Robinson wrote:
> > >
> > >> -----Original Message-----
> > >> From: Digimer mailto:li...@alteeve.ca>>
> > >> Sent: Friday, May 28, 2021 12:43 PM
> > >> To: Cluster Labs - All topics related to open-source clustering
> > welcomed
> > >> mailto:users@clusterlabs.org>>; Eric
> > Robinson  > <mailto:eric.robin...@psmnv.com>>; Strahil
> > >> Nikolov  <mailto:hunter86...@yahoo.com>>
> > >> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
> > >>
> > >> Shared storage is not what triggers the need for fencing.
> > Coordinating actions
> > >> is what triggers the need. Specifically; If you can run
> > resource on both/all
> > >> nodes at the same time, you don't need HA. If you can't, you
> > need fencing.
> > >>
> > >> Digimer
> > >
> > > Thanks. That said, there is no fencing, so any thoughts on why
> > the node behaved the way it did?
> >
> > Without fencing, when a communication or membership issues arises,
> > it's
> > hard to predict what will happen.
> >
> > I don't see anything in the short log snippet to indicate what
> > happened.
> > What's on the other node during the event? When did the node
> disappear
> > and when was it rejoined, to help find relevant log entries?
> >
> > Going forward, if you want predictable and reliable operation,

Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-05-28 Thread Eric Robinson


> -Original Message-
> From: Digimer 
> Sent: Friday, May 28, 2021 12:43 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Eric Robinson ; Strahil
> Nikolov 
> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>
> Shared storage is not what triggers the need for fencing. Coordinating actions
> is what triggers the need. Specifically; If you can run resource on both/all
> nodes at the same time, you don't need HA. If you can't, you need fencing.
>
> Digimer

Thanks. That said, there is no fencing, so any thoughts on why the node behaved 
the way it did?

>
> On 2021-05-28 1:19 p.m., Eric Robinson wrote:
> > There is no fencing agent on this cluster and no shared storage.
> >
> > -Eric
> >
> > *From:* Strahil Nikolov 
> > *Sent:* Friday, May 28, 2021 10:08 AM
> > *To:* Cluster Labs - All topics related to open-source clustering
> > welcomed ; Eric Robinson
> > 
> > *Subject:* Re: [ClusterLabs] Cluster Stopped, No Messages?
> >
> > what is your fencing agent ?
> >
> > Best Regards,
> >
> > Strahil Nikolov
> >
> > On Thu, May 27, 2021 at 20:52, Eric Robinson
> >
> > mailto:eric.robin...@psmnv.com>>
> wrote:
> >
> > We found one of our cluster nodes down this morning. The server was
> > up but cluster services were not running. Upon examination of the
> > logs, we found that the cluster just stopped around 9:40:31 and then
> > I started it up manually (pcs cluster start) at 11:49:48. I can’t
> > imagine that Pacemaker just randomly terminates. Any thoughts why it
> > would behave this way?
> >
> >
> >
> >
> >
> > May 27 09:25:31 [92170] 001store01apengine:   notice:
> > process_pe_message:   Calculated transition 91482, saving inputs in
> > /var/lib/pacemaker/pengine/pe-input-756.bz2
> >
> > May 27 09:25:31 [92171] 001store01a   crmd: info:
> > do_state_transition:  State transition S_POLICY_ENGINE ->
> > S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> > origin=handle_response
> >
> > May 27 09:25:31 [92171] 001store01a   crmd: info:
> > do_te_invoke: Processing graph 91482
> > (ref=pe_calc-dc-1622121931-124396) derived from
> > /var/lib/pacemaker/pengine/pe-input-756.bz2
> >
> > May 27 09:25:31 [92171] 001store01a   crmd:   notice:
> > run_graph:Transition 91482 (Complete=0, Pending=0, Fired=0,
> > Skipped=0, Incomplete=0,
> > Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
> >
> > May 27 09:25:31 [92171] 001store01a   crmd: info:
> > do_log:   Input I_TE_SUCCESS received in state
> > S_TRANSITION_ENGINE from notify_crmd
> >
> > May 27 09:25:31 [92171] 001store01a   crmd:   notice:
> > do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE
> > | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd
> >
> > May 27 09:40:31 [92171] 001store01a   crmd: info:
> > crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped
> > (90ms)
> >
> > May 27 09:40:31 [92171] 001store01a   crmd:   notice:
> > do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE |
> > input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped
> >
> > May 27 09:40:31 [92171] 001store01a   crmd: info:
> > do_state_transition:  Progressed to state S_POLICY_ENGINE after
> > C_TIMER_POPPED
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > process_pe_message:   Input has not changed since last time, not
> > saving to disk
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > determine_online_status:  Node 001store01a is online
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > determine_op_status:  Operation monitor found resource
> > p_pure-ftpd-itls active on 001store01a
> >
> > May 27 09:40:31 [92170] 001store01apengine:  warning:
> > unpack_rsc_op_failure:Processing failed op monitor for
> > p_vip_ftpclust01 on 001store01a: unknown error (1)
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > determine_op_status:  Operation monitor found resource
> > p_pure-ftpd-etls active on 001store01a
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > u

Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-05-28 Thread Eric Robinson

There is no fencing agent on this cluster and no shared storage.

-Eric


From: Strahil Nikolov 
Sent: Friday, May 28, 2021 10:08 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?

what is your fencing agent ?

Best Regards,
Strahil Nikolov
On Thu, May 27, 2021 at 20:52, Eric Robinson
mailto:eric.robin...@psmnv.com>> wrote:

We found one of our cluster nodes down this morning. The server was up but 
cluster services were not running. Upon examination of the logs, we found that 
the cluster just stopped around 9:40:31 and then I started it up manually (pcs 
cluster start) at 11:49:48. I can’t imagine that Pacemaker just randomly 
terminates. Any thoughts why it would behave this way?





May 27 09:25:31 [92170] 001store01apengine:   notice: process_pe_message:   
Calculated transition 91482, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-756.bz2

May 27 09:25:31 [92171] 001store01a   crmd: info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response

May 27 09:25:31 [92171] 001store01a   crmd: info: do_te_invoke: 
Processing graph 91482 (ref=pe_calc-dc-1622121931-124396) derived from 
/var/lib/pacemaker/pengine/pe-input-756.bz2

May 27 09:25:31 [92171] 001store01a   crmd:   notice: run_graph:
Transition 91482 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete

May 27 09:25:31 [92171] 001store01a   crmd: info: do_log:   Input 
I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd

May 27 09:25:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd

May 27 09:40:31 [92171] 001store01a   crmd: info: crm_timer_popped: 
PEngine Recheck Timer (I_PE_CALC) just popped (90ms)

May 27 09:40:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_TIMER_POPPED origin=crm_timer_popped

May 27 09:40:31 [92171] 001store01a   crmd: info: do_state_transition:  
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED

May 27 09:40:31 [92170] 001store01apengine: info: process_pe_message:   
Input has not changed since last time, not saving to disk

May 27 09:40:31 [92170] 001store01apengine: info: 
determine_online_status:  Node 001store01a is online

May 27 09:40:31 [92170] 001store01apengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-itls active on 001store01a

May 27 09:40:31 [92170] 001store01apengine:  warning: 
unpack_rsc_op_failure:Processing failed op monitor for p_vip_ftpclust01 
on 001store01a: unknown error (1)

May 27 09:40:31 [92170] 001store01apengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-etls active on 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: unpack_node_loop: 
Node 1 is already processed

May 27 09:40:31 [92170] 001store01apengine: info: unpack_node_loop: 
Node 1 is already processed

May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_vip_ftpclust01(ocf::heartbeat:IPaddr2):   Started 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_replicator(systemd:pure-replicator):  Started 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_pure-ftpd-etls(systemd:pure-ftpd-etls):   Started 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_pure-ftpd-itls(systemd:pure-ftpd-itls):   Started 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_vip_ftpclust01(Started 001store01a)

May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_replicator(Started 001store01a)

May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_pure-ftpd-etls(Started 001store01a)

May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_pure-ftpd-itls(Started 001store01a)

May 27 09:40:31 [92170] 001store01apengine:   notice: process_pe_message:   
Calculated transition 91483, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-756.bz2

May 27 09:40:31 [92171] 001store01a   crmd: info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response

May 27 09:40:31 [92171] 001store01a   crmd: info: do_te_invoke: 
Processing graph 91483 (ref=pe_calc-dc-1622122831-124397) derived from 
/var/lib/pacemaker/pe

[ClusterLabs] Cluster Stopped, No Messages?

2021-05-27 Thread Eric Robinson

We found one of our cluster nodes down this morning. The server was up but 
cluster services were not running. Upon examination of the logs, we found that 
the cluster just stopped around 9:40:31 and then I started it up manually (pcs 
cluster start) at 11:49:48. I can't imagine that Pacemaker just randomly 
terminates. Any thoughts why it would behave this way?


May 27 09:25:31 [92170] 001store01apengine:   notice: process_pe_message:   
Calculated transition 91482, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:25:31 [92171] 001store01a   crmd: info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
May 27 09:25:31 [92171] 001store01a   crmd: info: do_te_invoke: 
Processing graph 91482 (ref=pe_calc-dc-1622121931-124396) derived from 
/var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:25:31 [92171] 001store01a   crmd:   notice: run_graph:
Transition 91482 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 09:25:31 [92171] 001store01a   crmd: info: do_log:   Input 
I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
May 27 09:25:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd
May 27 09:40:31 [92171] 001store01a   crmd: info: crm_timer_popped: 
PEngine Recheck Timer (I_PE_CALC) just popped (90ms)
May 27 09:40:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_TIMER_POPPED origin=crm_timer_popped
May 27 09:40:31 [92171] 001store01a   crmd: info: do_state_transition:  
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
May 27 09:40:31 [92170] 001store01apengine: info: process_pe_message:   
Input has not changed since last time, not saving to disk
May 27 09:40:31 [92170] 001store01apengine: info: 
determine_online_status:  Node 001store01a is online
May 27 09:40:31 [92170] 001store01apengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-itls active on 001store01a
May 27 09:40:31 [92170] 001store01apengine:  warning: 
unpack_rsc_op_failure:Processing failed op monitor for p_vip_ftpclust01 
on 001store01a: unknown error (1)
May 27 09:40:31 [92170] 001store01apengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-etls active on 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: unpack_node_loop: 
Node 1 is already processed
May 27 09:40:31 [92170] 001store01apengine: info: unpack_node_loop: 
Node 1 is already processed
May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_vip_ftpclust01(ocf::heartbeat:IPaddr2):   Started 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_replicator(systemd:pure-replicator):  Started 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_pure-ftpd-etls(systemd:pure-ftpd-etls):   Started 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_pure-ftpd-itls(systemd:pure-ftpd-itls):   Started 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_vip_ftpclust01(Started 001store01a)
May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_replicator(Started 001store01a)
May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_pure-ftpd-etls(Started 001store01a)
May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_pure-ftpd-itls(Started 001store01a)
May 27 09:40:31 [92170] 001store01apengine:   notice: process_pe_message:   
Calculated transition 91483, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:40:31 [92171] 001store01a   crmd: info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
May 27 09:40:31 [92171] 001store01a   crmd: info: do_te_invoke: 
Processing graph 91483 (ref=pe_calc-dc-1622122831-124397) derived from 
/var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:40:31 [92171] 001store01a   crmd:   notice: run_graph:
Transition 91483 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 09:40:31 [92171] 001store01a   crmd: info: do_log:   Input 
I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
May 27 09:40:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-18 Thread Eric Robinson

> > Here are the constraints...
> >
> > [root@ha09a ~]# pcs constraint --full
> > Location Constraints:
> > Ordering Constraints:
> >   promote p_drbd0-clone then start p_vdo0 (kind:Mandatory) (id:order-
> p_drbd0-clone-p_vdo0-mandatory)
> >   promote p_drbd1-clone then start p_vdo1 (kind:Mandatory) (id:order-
> p_drbd1-clone-p_vdo1-mandatory)
> >   start p_vdo0 then start p_fs_clust08 (kind:Mandatory) (id:order-p_vdo0-
> p_fs_clust08-mandatory)
> >   start p_vdo1 then start p_fs_clust09 (kind:Mandatory) (id:order-p_vdo1-
> p_fs_clust09-mandatory)
> > Colocation Constraints:
> >   p_vdo0 with p_drbd0-clone (score:INFINITY) (id:colocation-p_vdo0-
> p_drbd0-clone-INFINITY)
> >   p_vdo1 with p_drbd1-clone (score:INFINITY) (id:colocation-p_vdo1-
> p_drbd1-clone-INFINITY)
>
> This is wrong. It says vdo can be active on every node where a clone
> instance is active. You need colocation with master.
>

I thought the Master constraint was implied by the word "promote." In other 
words, "promote the clone, then start the vdo device." I removed the vdo 
constraints and re-added them with with-rsc-role=Master, and that problem 
appears to be fixed! I am still getting occasional resource failures due to RA 
timeouts, so I will look into that next. Thanks for looking into this.

-Eric


Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-18 Thread Eric Robinson

> Now, that's becoming ridiculous. Nobody said "use whatever upstream
> ships" because upstream probably never intended it to be used in a cluster
> environment. But if you are going to write your own LSB/OCF script anyway,
> you can just as well write your own service. Which is much simpler at this
> point
>

It was late and I was imprecise in what I said. I didn't accuse you of saying 
"use whatever upstream ships." You missed where I said in an earlier email that 
I tried writing my own custom systemd service, and that had the same problems.

> > Also, the assumption that the resource is active does not seem to be safe
> to make. I've been doing a lot of additional testing, and I think the reason
> why systemd, ocf, and lsb scripts have all failed me is because Pacemaker is
> not honoring the order constraints. IT keeps trying to start the vdo device
> before promoting drbd on the target node. I will check it some more and
> confirm.
>
> Pacemaker is correctly honoring your constraints. Your constraints do not
> forbid starting VDO on demoted instances. See another e-mail.

I saw your other email and will look into that.

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: DRBD + VDO HowTo?

2021-05-18 Thread Eric Robinson


> -Original Message-
> From: Users  On Behalf Of Ulrich Windl
> Sent: Tuesday, May 18, 2021 3:03 AM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: Re: Antw: [EXT] Re: DRBD + VDO HowTo?
>
> >>> Eric Robinson  schrieb am 18.05.2021 um
> >>> 08:58 in
> Nachricht
>  3.prod.outlook.com>
>
> >>  ‑Original Message‑
> >> From: Users  On Behalf Of Ulrich Windl
> >> Sent: Tuesday, May 18, 2021 12:51 AM
> >> To: users@clusterlabs.org
> >> Subject: [ClusterLabs] Antw: [EXT] Re: DRBD + VDO HowTo?
> >>
> >> >>> Eric Robinson  schrieb am 17.05.2021 um
> >> 20:28 in
> >> Nachricht
> >>
>  >> 03.prod.outlook.com>
> >> > R=$(/usr/bin/vdo status ‑n $VOL|grep Activate|awk
> '{$1=$1};1'|cut
> >>
> >> I just wonder: What is "awk '{$1=$1};1'" supposed to do?
> >>
> >
> > It replaces all whitespace, including tabs, with single spaces, which
> > makes
>
> > it easy to parse with cut.
> >
> >> I also believe that "grep Activate|awk '{$1=$1};1'|cut ‑d" " ‑f2)"
> >> can be
> done
> >> in one awk command.
> >
> > I'm sure it could, if I had ever taken time to learn awk.
>
> I'm not using VDO, but my guess would be:
> awk '/Activate/ { print $2 }'
>
> Probably the "/Activate/" part could be improved making the match more
> specific, but I would have to see the command output...
>

I tested it and you're 100% correct. I learned something!

> Regards,
> Ulrich
>
> >
> >>
> >> Regards,
> >> Ulrich
> >>
> >>
> >> ___
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> > Disclaimer : This email and any files transmitted with it are
> > confidential and intended solely for intended recipients. If you are
> > not the named addressee you should not disseminate, distribute, copy or
> alter this email.
>
> > Any views or opinions presented in this email are solely those of the
> > author
>
> > and might not represent those of Physician Select Management. Warning:
> > Although Physician Select Management has taken reasonable precautions
> > to ensure no viruses are present in this email, the company cannot
> > accept responsibility for any loss or damage arising from the use of
> > this email or
>
> > attachments.
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-18 Thread Eric Robinson





> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Tuesday, May 18, 2021 3:19 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] DRBD + VDO HowTo?
>
> On Tue, May 18, 2021 at 10:41 AM Eric Robinson
>  wrote:
> >
> > > - check that device mapper target exists - otherwise no VDO is
> > > possible at all
> > > - check that backing store device is visible - otherwise no VDO is
> > > possible
> > > - and only then possibly call vdo tool to check actual status
> > >
> >
> > Sorry, it is not clear to me what you mean by device mapper target. How
> would I check for the existence of the device mapper target for vdo0?
> >
>
> dmsetup targets
>
> But really, the most simple would be to use systemd service. Then you do
> not really need to monitor anything. Resource is assumed to be active when
> service is started. That is enough to quickly get it going.
>

That was the first thing I tried. The systemd service does not work because it 
wants to stop and start all vdo devices, but mine are on different nodes. Also, 
the assumption that the resource is active does not seem to be safe to make. 
I've been doing a lot of additional testing, and I think the reason why 
systemd, ocf, and lsb scripts have all failed me is because Pacemaker is not 
honoring the order constraints. IT keeps trying to start the vdo device before 
promoting drbd on the target node. I will check it some more and confirm.


> >
> > > -Original Message-
> > > From: Users  On Behalf Of Andrei
> > > Borzenkov
> > > Sent: Tuesday, May 18, 2021 12:22 AM
> > > To: users@clusterlabs.org
> > > Subject: Re: [ClusterLabs] DRBD + VDO HowTo?
> > >
> > > On 17.05.2021 21:28, Eric Robinson wrote:
> > > > Andrei --
> > > >
> > > > Sorry for the novels. Sometimes it is hard to tell whether people
> > > > want all
> > > the configs, logs, and scripts first, or if they want a description
> > > of the problem and what one is trying to accomplish first. I'll send
> > > whatever you want. I am very eager to get to the bottom of this.
> > > >
> > > > I'll start with my custom LSB RA. I can send the Pacemaker config a bit
> later.
> > > >
> > > > [root@ha09a init.d]# ll|grep vdo
> > > > lrwxrwxrwx. 1 root root 9 May 16 10:28 vdo0 -> vdo_multi
> > > > lrwxrwxrwx. 1 root root 9 May 16 10:28 vdo1 -> vdo_multi
> > > > -rwx--. 1 root root  3623 May 16 13:21 vdo_multi
> > > >
> > > > [root@ha09a init.d]#  cat vdo_multi #!/bin/bash
> > > >
> > > > #--custom script for managing vdo volumes
> > > >
> > > > #--functions
> > > > function isActivated() {
> > > > R=$(/usr/bin/vdo status -n $VOL 2>&1)
> > > > if [ $? -ne 0 ]; then
> > > > #--error occurred checking vdo status
> > > > echo "$VOL: an error occurred checking activation
> > > > status on
> > > $MY_HOSTNAME"
> > > > return 1
> > > > fi
> > > > R=$(/usr/bin/vdo status -n $VOL|grep Activate|awk 
> > > > '{$1=$1};1'|cut
> -d"
> > > " -f2)
> > > > echo "$R"
> > > > return 0
> > > > }
> > > >
> > > > function isOnline() {
> > > > R=$(/usr/bin/vdo status -n $VOL 2>&1)
> > > > if [ $? -ne 0 ]; then
> > > > #--error occurred checking vdo status
> > > > echo "$VOL: an error occurred checking activation
> > > > status on
> > > $MY_HOSTNAME"
> > > > return 1
> > > > fi
> > > > R=$(/usr/bin/vdo status -n $VOL|grep "Index status"|awk
> > > '{$1=$1};1'|cut -d" " -f3)
> > > > echo "$R"
> > > > return 0
> > > > }
> > > >
> > > > #--vars
> > > > MY_HOSTNAME=$(hostname -s)
> > > >
> > > > #--get the volume name
> > > > VOL=$(basename $0)
> > > >
> > > > #--get the action
> > > > ACTION=$1
> > > >
> > > > #--take the requested action
> > > > case $ACTION in
> > > >
> > > >

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-18 Thread Eric Robinson

> - check that device mapper target exists - otherwise no VDO is possible at all
> - check that backing store device is visible - otherwise no VDO is possible
> - and only then possibly call vdo tool to check actual status
>

Sorry, it is not clear to me what you mean by device mapper target. How would I 
check for the existence of the device mapper target for vdo0?


> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Tuesday, May 18, 2021 12:22 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] DRBD + VDO HowTo?
>
> On 17.05.2021 21:28, Eric Robinson wrote:
> > Andrei --
> >
> > Sorry for the novels. Sometimes it is hard to tell whether people want all
> the configs, logs, and scripts first, or if they want a description of the 
> problem
> and what one is trying to accomplish first. I'll send whatever you want. I am
> very eager to get to the bottom of this.
> >
> > I'll start with my custom LSB RA. I can send the Pacemaker config a bit 
> > later.
> >
> > [root@ha09a init.d]# ll|grep vdo
> > lrwxrwxrwx. 1 root root 9 May 16 10:28 vdo0 -> vdo_multi
> > lrwxrwxrwx. 1 root root 9 May 16 10:28 vdo1 -> vdo_multi
> > -rwx--. 1 root root  3623 May 16 13:21 vdo_multi
> >
> > [root@ha09a init.d]#  cat vdo_multi
> > #!/bin/bash
> >
> > #--custom script for managing vdo volumes
> >
> > #--functions
> > function isActivated() {
> > R=$(/usr/bin/vdo status -n $VOL 2>&1)
> > if [ $? -ne 0 ]; then
> > #--error occurred checking vdo status
> > echo "$VOL: an error occurred checking activation status on
> $MY_HOSTNAME"
> > return 1
> > fi
> > R=$(/usr/bin/vdo status -n $VOL|grep Activate|awk '{$1=$1};1'|cut 
> > -d"
> " -f2)
> > echo "$R"
> > return 0
> > }
> >
> > function isOnline() {
> > R=$(/usr/bin/vdo status -n $VOL 2>&1)
> > if [ $? -ne 0 ]; then
> > #--error occurred checking vdo status
> > echo "$VOL: an error occurred checking activation status on
> $MY_HOSTNAME"
> > return 1
> > fi
> > R=$(/usr/bin/vdo status -n $VOL|grep "Index status"|awk
> '{$1=$1};1'|cut -d" " -f3)
> > echo "$R"
> > return 0
> > }
> >
> > #--vars
> > MY_HOSTNAME=$(hostname -s)
> >
> > #--get the volume name
> > VOL=$(basename $0)
> >
> > #--get the action
> > ACTION=$1
> >
> > #--take the requested action
> > case $ACTION in
> >
> > start)
> >
> > #--check current status
> > R=$(isOnline "$VOL")
> > if [ $? -ne 0 ]; then
> > echo "error occurred checking $VOL status on
> $MY_HOSTNAME"
> > exit 0
> > fi
> > if [ "$R"  == "online" ]; then
> > echo "running on $MY_HOSTNAME"
> > exit 0 #--lsb: success
> > fi
> >
> > #--enter activation loop
> > ACTIVATED=no
> > TIMER=15
> > while [ $TIMER -ge 0 ]; do
> > R=$(isActivated "$VOL")
> > if [ "$R" == "enabled" ]; then
> > ACTIVATED=yes
> > break
> > fi
> > sleep 1
> > TIMER=$(( TIMER-1 ))
> > done
> > if [ "$ACTIVATED" == "no" ]; then
> > echo "$VOL: not activated on $MY_HOSTNAME"
> > exit 5 #--lsb: not running
> > fi
> >
> > #--enter start loop
> > /usr/bin/vdo start -n $VOL
> > ONLINE=no
> > TIMER=15
> > while [ $TIMER -ge 0 ]; do
> > R=$(isOnline "$VOL")
> > if [ "$R" == "online" ]; then
> > ONLINE=yes
> > break
> > fi
> > sleep 1

Re: [ClusterLabs] Antw: [EXT] Re: DRBD + VDO HowTo?

2021-05-18 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Ulrich Windl
> Sent: Tuesday, May 18, 2021 12:51 AM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: [EXT] Re: DRBD + VDO HowTo?
>
> >>> Eric Robinson  schrieb am 17.05.2021 um
> 20:28 in
> Nachricht
>  03.prod.outlook.com>
> > R=$(/usr/bin/vdo status -n $VOL|grep Activate|awk '{$1=$1};1'|cut
>
> I just wonder: What is "awk '{$1=$1};1'" supposed to do?
>

It replaces all whitespace, including tabs, with single spaces, which makes it 
easy to parse with cut.

> I also believe that "grep Activate|awk '{$1=$1};1'|cut -d" " -f2)" can be done
> in one awk command.

I'm sure it could, if I had ever taken time to learn awk.

>
> Regards,
> Ulrich
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-17 Thread Eric Robinson

bd0 on ha09b: ok
May 17 22:16:54 ha09b pacemaker-controld[2710]: notice: Result of notify 
operation for p_drbd0 on ha09b: ok
May 17 22:16:54 ha09b kernel: drbd ha01_mysql: Preparing cluster-wide state 
change 610633182 (0->-1 3/1)
May 17 22:16:54 ha09b kernel: drbd ha01_mysql: State change 610633182: 
primary_nodes=1, weak_nodes=FFFC
May 17 22:16:54 ha09b kernel: drbd ha01_mysql: Committing cluster-wide state 
change 610633182 (1ms)
May 17 22:16:54 ha09b kernel: drbd ha01_mysql: role( Secondary -> Primary )
May 17 22:16:54 ha09b pacemaker-controld[2710]: notice: Result of promote 
operation for p_drbd0 on ha09b: ok
May 17 22:16:54 ha09b pacemaker-controld[2710]: notice: Result of notify 
operation for p_drbd0 on ha09b: ok
May 17 22:16:55 ha09b kernel: uds: modprobe: loaded version 8.0.1.6
May 17 22:16:55 ha09b kernel: kvdo: modprobe: loaded version 6.2.3.114
May 17 22:16:55 ha09b kernel: kvdo0:dmsetup: underlying device, REQ_FLUSH: 
supported, REQ_FUA: supported
May 17 22:16:55 ha09b kernel: kvdo0:dmsetup: Using write policy async 
automatically.
May 17 22:16:55 ha09b kernel: kvdo0:dmsetup: loading device 'vdo0'
May 17 22:16:55 ha09b kernel: kvdo0:dmsetup: zones: 1 logical, 1 physical, 1 
hash; base threads: 5
May 17 22:16:55 ha09b kernel: kvdo0:dmsetup: starting device 'vdo0'
May 17 22:16:55 ha09b kernel: kvdo0:journalQ: VDO commencing normal operation
May 17 22:16:55 ha09b kernel: kvdo0:dmsetup: Setting UDS index target state to 
online
May 17 22:16:55 ha09b kernel: kvdo0:dmsetup: device 'vdo0' started
May 17 22:16:55 ha09b kernel: kvdo0:dmsetup: resuming device 'vdo0'
May 17 22:16:55 ha09b kernel: kvdo0:dmsetup: device 'vdo0' resumed
May 17 22:16:55 ha09b kernel: uds: kvdo0:dedupeQ: loading or rebuilding index: 
dev=/dev/drbd0 offset=4096 size=2781704192
May 17 22:16:55 ha09b kernel: uds: kvdo0:dedupeQ: Using 6 indexing zones for 
concurrency.
May 17 22:16:55 ha09b kernel: kvdo0:packerQ: compression is enabled
May 17 22:16:55 ha09b systemd[1]: Started Device-mapper event daemon.
May 17 22:16:55 ha09b dmeventd[3931]: dmeventd ready for processing.
May 17 22:16:55 ha09b UDS/vdodmeventd[3930]: INFO   (vdodmeventd/3930) VDO 
device vdo0 is now registered with dmeventd for monitoring
May 17 22:16:55 ha09b lvm[3931]: Monitoring VDO pool vdo0.
May 17 22:16:56 ha09b kernel: uds: kvdo0:dedupeQ: loaded index from chapter 0 
through chapter 85
May 17 22:16:56 ha09b pacemaker-controld[2710]: notice: Result of start 
operation for p_vdo0 on ha09b: ok
May 17 22:16:57 ha09b pacemaker-controld[2710]: notice: Result of monitor 
operation for p_vdo0 on ha09b: ok



> -----Original Message-
> From: Users  On Behalf Of Eric Robinson
> Sent: Monday, May 17, 2021 9:49 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] DRBD + VDO HowTo?
>
> Notice in that 'pcs status' shows errors for resource p_vdo0 on node ha09b,
> even after doing 'pcs resource cleanup p_vdo0'.
>
> [root@ha09a ~]# pcs status
> Cluster name: ha09ab
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: ha09a (version 2.0.4-6.el8_3.2-2deceaa3ae) - partition with
> quorum
>   * Last updated: Mon May 17 19:45:41 2021
>   * Last change:  Mon May 17 19:45:37 2021 by hacluster via crmd on ha09b
>   * 2 nodes configured
>   * 6 resource instances configured
>
> Node List:
>   * Online: [ ha09a ha09b ]
>
> Full List of Resources:
>   * Clone Set: p_drbd0-clone [p_drbd0] (promotable):
> * Masters: [ ha09a ]
> * Slaves: [ ha09b ]
>   * Clone Set: p_drbd1-clone [p_drbd1] (promotable):
> * Masters: [ ha09b ]
> * Slaves: [ ha09a ]
>   * p_vdo0  (lsb:vdo0):  Starting ha09a
>   * p_vdo1  (lsb:vdo1):  Started ha09b
>
> Failed Resource Actions:
>   * p_vdo0_monitor_0 on ha09b 'error' (1): call=83, status='complete',
> exitreason='', last-rc-change='2021-05-17 19:45:38 -07:00', queued=0ms,
> exec=175ms
>
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
>
>
> If I debug the monitor action on ha09b, it reports 'not installed,' which 
> makes
> sense because the drbd disk is in standby.
>
> [root@ha09b drbd.d]# pcs resource debug-monitor p_vdo0 Operation
> monitor for p_vdo0 (lsb::vdo0) returned: 'not installed' (5)  >  stdout: error
> occurred checking vdo0 status on ha09b
>
> Should it report something else?
>
> > -Original Message-
> > From: Users  On Behalf Of Eric Robinson
> > Sent: Monday, May 17, 2021 1:37 PM
> > To: Cluster Labs - All topics related to open-source clustering
> > welcomed 
> > Subject: Re: [ClusterLabs] DRBD + VDO HowTo?
> >
> > Andrei --
> >
> > To follow up, here is the Pacemaker config. Let's not talk about
> > fencing or quorum right now.

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-17 Thread Eric Robinson

Notice in that 'pcs status' shows errors for resource p_vdo0 on node ha09b, 
even after doing 'pcs resource cleanup p_vdo0'.

[root@ha09a ~]# pcs status
Cluster name: ha09ab
Cluster Summary:
  * Stack: corosync
  * Current DC: ha09a (version 2.0.4-6.el8_3.2-2deceaa3ae) - partition with 
quorum
  * Last updated: Mon May 17 19:45:41 2021
  * Last change:  Mon May 17 19:45:37 2021 by hacluster via crmd on ha09b
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ ha09a ha09b ]

Full List of Resources:
  * Clone Set: p_drbd0-clone [p_drbd0] (promotable):
* Masters: [ ha09a ]
* Slaves: [ ha09b ]
  * Clone Set: p_drbd1-clone [p_drbd1] (promotable):
* Masters: [ ha09b ]
* Slaves: [ ha09a ]
  * p_vdo0  (lsb:vdo0):  Starting ha09a
  * p_vdo1  (lsb:vdo1):  Started ha09b

Failed Resource Actions:
  * p_vdo0_monitor_0 on ha09b 'error' (1): call=83, status='complete', 
exitreason='', last-rc-change='2021-05-17 19:45:38 -07:00', queued=0ms, 
exec=175ms

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled


If I debug the monitor action on ha09b, it reports 'not installed,' which makes 
sense because the drbd disk is in standby.

[root@ha09b drbd.d]# pcs resource debug-monitor p_vdo0
Operation monitor for p_vdo0 (lsb::vdo0) returned: 'not installed' (5)
 >  stdout: error occurred checking vdo0 status on ha09b

Should it report something else?

> -Original Message-
> From: Users  On Behalf Of Eric Robinson
> Sent: Monday, May 17, 2021 1:37 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] DRBD + VDO HowTo?
>
> Andrei --
>
> To follow up, here is the Pacemaker config. Let's not talk about fencing or
> quorum right now. I want to focus on the vdo issue at hand.
>
> [root@ha09a ~]# pcs config
> Cluster Name: ha09ab
> Corosync Nodes:
>  ha09a ha09b
> Pacemaker Nodes:
>  ha09a ha09b
>
> Resources:
>  Clone: p_drbd0-clone
>   Meta Attrs: clone-max=2 clone-node-max=1 notify=true promotable=true
> promoted-max=1 promoted-node-max=1
>   Resource: p_drbd0 (class=ocf provider=linbit type=drbd)
>Attributes: drbd_resource=ha01_mysql
>Operations: demote interval=0s timeout=90 (p_drbd0-demote-interval-0s)
>monitor interval=60s (p_drbd0-monitor-interval-60s)
>notify interval=0s timeout=90 (p_drbd0-notify-interval-0s)
>promote interval=0s timeout=90 (p_drbd0-promote-interval-0s)
>reload interval=0s timeout=30 (p_drbd0-reload-interval-0s)
>start interval=0s timeout=240 (p_drbd0-start-interval-0s)
>stop interval=0s timeout=100 (p_drbd0-stop-interval-0s)
>  Clone: p_drbd1-clone
>   Meta Attrs: clone-max=2 clone-node-max=1 notify=true promotable=true
> promoted-max=1 promoted-node-max=1
>   Resource: p_drbd1 (class=ocf provider=linbit type=drbd)
>Attributes: drbd_resource=ha02_mysql
>Operations: demote interval=0s timeout=90 (p_drbd1-demote-interval-0s)
>monitor interval=60s (p_drbd1-monitor-interval-60s)
>notify interval=0s timeout=90 (p_drbd1-notify-interval-0s)
>promote interval=0s timeout=90 (p_drbd1-promote-interval-0s)
>reload interval=0s timeout=30 (p_drbd1-reload-interval-0s)
>start interval=0s timeout=240 (p_drbd1-start-interval-0s)
>stop interval=0s timeout=100 (p_drbd1-stop-interval-0s)
>  Resource: p_vdo0 (class=lsb type=vdo0)
>   Operations: force-reload interval=0s timeout=15 (p_vdo0-force-reload-
> interval-0s)
>   monitor interval=15 timeout=15 (p_vdo0-monitor-interval-15)
>   restart interval=0s timeout=15 (p_vdo0-restart-interval-0s)
>   start interval=0s timeout=15 (p_vdo0-start-interval-0s)
>   stop interval=0s timeout=15 (p_vdo0-stop-interval-0s)
>  Resource: p_vdo1 (class=lsb type=vdo1)
>   Operations: force-reload interval=0s timeout=15 (p_vdo1-force-reload-
> interval-0s)
>   monitor interval=15 timeout=15 (p_vdo1-monitor-interval-15)
>   restart interval=0s timeout=15 (p_vdo1-restart-interval-0s)
>   start interval=0s timeout=15 (p_vdo1-start-interval-0s)
>   stop interval=0s timeout=15 (p_vdo1-stop-interval-0s)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Ordering Constraints:
>   promote p_drbd0-clone then start p_vdo0 (kind:Mandatory) (id:order-
> p_drbd0-clone-p_vdo0-mandatory)
>   promote p_drbd1-clone then start p_vdo1 (kind:Mandatory) (id:order-
> p_drbd1-clone-p_vdo1-mandatory)
> Colocation Constraints:
>   p_vdo0 with p_drbd0-clone (score:INFINITY) (id:colocation-p_vdo0-
> p_drbd0-clone-INFINITY)
>

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-17 Thread Eric Robinson

='', last-rc-change='2021-05-17 11:34:25 -07:00', queued=0ms, 
exec=182ms

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

The vdo devices are available...

[root@ha09a ~]# vdo list
vdo0
vdo1


> -Original Message-
> From: Users  On Behalf Of Eric Robinson
> Sent: Monday, May 17, 2021 1:28 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] DRBD + VDO HowTo?
>
> Andrei --
>
> Sorry for the novels. Sometimes it is hard to tell whether people want all the
> configs, logs, and scripts first, or if they want a description of the problem
> and what one is trying to accomplish first. I'll send whatever you want. I am
> very eager to get to the bottom of this.
>
> I'll start with my custom LSB RA. I can send the Pacemaker config a bit later.
>
> [root@ha09a init.d]# ll|grep vdo
> lrwxrwxrwx. 1 root root 9 May 16 10:28 vdo0 -> vdo_multi
> lrwxrwxrwx. 1 root root 9 May 16 10:28 vdo1 -> vdo_multi
> -rwx--. 1 root root  3623 May 16 13:21 vdo_multi
>
> [root@ha09a init.d]#  cat vdo_multi
> #!/bin/bash
>
> #--custom script for managing vdo volumes
>
> #--functions
> function isActivated() {
> R=$(/usr/bin/vdo status -n $VOL 2>&1)
> if [ $? -ne 0 ]; then
> #--error occurred checking vdo status
> echo "$VOL: an error occurred checking activation status on
> $MY_HOSTNAME"
> return 1
> fi
> R=$(/usr/bin/vdo status -n $VOL|grep Activate|awk '{$1=$1};1'|cut -d" 
> "
> -f2)
> echo "$R"
> return 0
> }
>
> function isOnline() {
> R=$(/usr/bin/vdo status -n $VOL 2>&1)
> if [ $? -ne 0 ]; then
> #--error occurred checking vdo status
> echo "$VOL: an error occurred checking activation status on
> $MY_HOSTNAME"
> return 1
> fi
> R=$(/usr/bin/vdo status -n $VOL|grep "Index status"|awk
> '{$1=$1};1'|cut -d" " -f3)
> echo "$R"
> return 0
> }
>
> #--vars
> MY_HOSTNAME=$(hostname -s)
>
> #--get the volume name
> VOL=$(basename $0)
>
> #--get the action
> ACTION=$1
>
> #--take the requested action
> case $ACTION in
>
> start)
>
> #--check current status
> R=$(isOnline "$VOL")
> if [ $? -ne 0 ]; then
> echo "error occurred checking $VOL status on 
> $MY_HOSTNAME"
> exit 0
> fi
> if [ "$R"  == "online" ]; then
> echo "running on $MY_HOSTNAME"
> exit 0 #--lsb: success
> fi
>
> #--enter activation loop
> ACTIVATED=no
> TIMER=15
> while [ $TIMER -ge 0 ]; do
> R=$(isActivated "$VOL")
> if [ "$R" == "enabled" ]; then
> ACTIVATED=yes
> break
> fi
> sleep 1
> TIMER=$(( TIMER-1 ))
> done
> if [ "$ACTIVATED" == "no" ]; then
> echo "$VOL: not activated on $MY_HOSTNAME"
> exit 5 #--lsb: not running
> fi
>
> #--enter start loop
> /usr/bin/vdo start -n $VOL
> ONLINE=no
> TIMER=15
> while [ $TIMER -ge 0 ]; do
> R=$(isOnline "$VOL")
> if [ "$R" == "online" ]; then
> ONLINE=yes
> break
> fi
> sleep 1
> TIMER=$(( TIMER-1 ))
> done
> if [ "$ONLINE" == "yes" ]; then
> echo "$VOL: started on $MY_HOSTNAME"
> exit 0 #--lsb: success
> else
> echo "$VOL: not started on $MY_HOSTNAME (unknown
> problem)"
> exit 0 #--lsb: unknown problem
> fi
> ;;
> stop)
>
> #--check current status
> R

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-17 Thread Eric Robinson

exit 0 #--lsb:success
else
echo "$VOL: failed to stop on $MY_HOSTNAME (unknown 
problem)"
exit 0
fi
;;
status)
R=$(isOnline "$VOL")
if [ $? -ne 0 ]; then
echo "error occurred checking $VOL status on 
$MY_HOSTNAME"
exit 5
fi
if [ "$R"  == "online" ]; then
echo "$VOL started on $MY_HOSTNAME"
    exit 0 #--lsb: success
else
echo "$VOL not started on $MY_HOSTNAME"
exit 3 #--lsb: not running
fi
;;

esac



> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Monday, May 17, 2021 12:49 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] DRBD + VDO HowTo?
>
> On 17.05.2021 18:18, Eric Robinson wrote:
> > To Strahil and Klaus –
> >
> > I created the vdo devices using default parameters, so ‘auto’ mode was
> selected by default. vdostatus shows that the current mode is async. The
> underlying drbd devices are running protocol C, so I assume that vdo should
> be changed to sync mode?
> >
> > The VDO service is disabled and is solely under the control of Pacemaker,
> but I have been unable to get a resource agent to work reliably. I have two
> nodes. Under normal operation, Node A is primary for disk drbd0, and device
> vdo0 rides on top of that. Node B is primary for disk drbd1 and device vdo1
> rides on top of that. In the event of a node failure, the vdo device and the
> underlying drbd disk should migrate to the other node, and then that node
> will be primary for both drbd disks and both vdo devices.
> >
> > The default systemd vdo service does not work because it uses the –all flag
> and starts/stops all vdo devices. I noticed that there is also a vdo-start-by-
> dev.service, but there is no documentation on how to use it. I wrote my own
> vdo-by-dev system service, but that did not work reliably either. Then I
> noticed that there is already an OCF resource agent named vdo-vol, but that
> did not work either. I finally tried writing my own OCF-compliant RA, and
> then I tried writing an LSB-compliant script, but none of those worked very
> well.
> >
>
> You continue to write novels instead of simply showing your resource agent,
> your configuration and logs.
>
> > My big problem is that I don’t understand how Pacemaker uses the
> monitor action. Pacemaker would often fail vdo resources because the
> monitor action received an error when it ran on the standby node. For
> example, when Node A is primary for disk drbd1 and device vdo1, Pacemaker
> would fail device vdo1 because when it ran the monitor action on Node B,
> the RA reported an error. But OF COURSE it would report an error, because
> disk drbd1 is secondary on that node, and is therefore inaccessible to the vdo
> driver. I DON’T UNDERSTAND.
> >
>
> May be your definition of "error" does not match pacemaker definition of
> "error". It is hard to comment without seeing code.
>
> > -Eric
> >
> >
> >
> > From: Strahil Nikolov 
> > Sent: Monday, May 17, 2021 5:09 AM
> > To: kwenn...@redhat.com; Klaus Wenninger ;
> > Cluster Labs - All topics related to open-source clustering welcomed
> > ; Eric Robinson 
> > Subject: Re: [ClusterLabs] DRBD + VDO HowTo?
> >
> > Have you tried to set VDO in async mode ?
> >
> > Best Regards,
> > Strahil Nikolov
> > On Mon, May 17, 2021 at 8:57, Klaus Wenninger
> > mailto:kwenn...@redhat.com>> wrote:
> > Did you try VDO in sync-mode for the case the flush-fua stuff isn't
> > working through the layers?
> > Did you check that VDO-service is disabled and solely under
> > pacemaker-control and that the dependencies are set correctly?
> >
> > Klaus
> >
> > On 5/17/21 6:17 AM, Eric Robinson wrote:
> >
> > Yes, DRBD is working fine.
> >
> >
> >
> > From: Strahil Nikolov
> > <mailto:hunter86...@yahoo.com>
> > Sent: Sunday, May 16, 2021 6:06 PM
> > To: Eric Robinson
> > <mailto:eric.robin...@psmnv.com>; Cluster
> > Labs - All topics related to open-source clustering welcomed
> > <mailto:users@clusterlabs.org>
> > Subject: RE: [ClusterLabs] DRBD + VDO HowTo?
> >
> >
> >
> > Are you sure that the DRBD is working properly ?
> >
> >
> >
> > Best Regards,
> >
> > Strahil Nikolov
> >
> > On

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-17 Thread Eric Robinson

To Strahil and Klaus –

I created the vdo devices using default parameters, so ‘auto’ mode was selected 
by default. vdostatus shows that the current mode is async. The underlying drbd 
devices are running protocol C, so I assume that vdo should be changed to sync 
mode?

The VDO service is disabled and is solely under the control of Pacemaker, but I 
have been unable to get a resource agent to work reliably. I have two nodes. 
Under normal operation, Node A is primary for disk drbd0, and device vdo0 rides 
on top of that. Node B is primary for disk drbd1 and device vdo1 rides on top 
of that. In the event of a node failure, the vdo device and the underlying drbd 
disk should migrate to the other node, and then that node will be primary for 
both drbd disks and both vdo devices.

The default systemd vdo service does not work because it uses the –all flag and 
starts/stops all vdo devices. I noticed that there is also a 
vdo-start-by-dev.service, but there is no documentation on how to use it. I 
wrote my own vdo-by-dev system service, but that did not work reliably either. 
Then I noticed that there is already an OCF resource agent named vdo-vol, but 
that did not work either. I finally tried writing my own OCF-compliant RA, and 
then I tried writing an LSB-compliant script, but none of those worked very 
well.

My big problem is that I don’t understand how Pacemaker uses the monitor 
action. Pacemaker would often fail vdo resources because the monitor action 
received an error when it ran on the standby node. For example, when Node A is 
primary for disk drbd1 and device vdo1, Pacemaker would fail device vdo1 
because when it ran the monitor action on Node B, the RA reported an error. But 
OF COURSE it would report an error, because disk drbd1 is secondary on that 
node, and is therefore inaccessible to the vdo driver. I DON’T UNDERSTAND.

-Eric



From: Strahil Nikolov 
Sent: Monday, May 17, 2021 5:09 AM
To: kwenn...@redhat.com; Klaus Wenninger ; Cluster Labs - 
All topics related to open-source clustering welcomed ; 
Eric Robinson 
Subject: Re: [ClusterLabs] DRBD + VDO HowTo?

Have you tried to set VDO in async mode ?

Best Regards,
Strahil Nikolov
On Mon, May 17, 2021 at 8:57, Klaus Wenninger
mailto:kwenn...@redhat.com>> wrote:
Did you try VDO in sync-mode for the case the flush-fua
stuff isn't working through the layers?
Did you check that VDO-service is disabled and solely under
pacemaker-control and that the dependencies are set correctly?

Klaus

On 5/17/21 6:17 AM, Eric Robinson wrote:

Yes, DRBD is working fine.



From: Strahil Nikolov <mailto:hunter86...@yahoo.com>
Sent: Sunday, May 16, 2021 6:06 PM
To: Eric Robinson <mailto:eric.robin...@psmnv.com>; 
Cluster Labs - All topics related to open-source clustering welcomed 
<mailto:users@clusterlabs.org>
Subject: RE: [ClusterLabs] DRBD + VDO HowTo?



Are you sure that the DRBD is working properly ?



Best Regards,

Strahil Nikolov

On Mon, May 17, 2021 at 0:32, Eric Robinson

mailto:eric.robin...@psmnv.com>> wrote:

Okay, it turns out I was wrong. I thought I had it working, but I keep running 
into problems. Sometimes when I demote a DRBD resource on Node A and promote it 
on Node B, and I try to mount the filesystem, the system complains that it 
cannot read the superblock. But when I move the DRBD primary back to Node A, 
the file system is mountable again. Also, I have problems with filesystems not 
mounting because the vdo devices are not present. All kinds of issues.





From: Users 
mailto:users-boun...@clusterlabs.org>> On Behalf 
Of Eric Robinson
Sent: Friday, May 14, 2021 3:55 PM
To: Strahil Nikolov mailto:hunter86...@yahoo.com>>; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>
Subject: Re: [ClusterLabs] DRBD + VDO HowTo?





Okay, I have it working now. The default systemd service definitions did not 
work, so I created my own.





From: Strahil Nikolov mailto:hunter86...@yahoo.com>>
Sent: Friday, May 14, 2021 3:41 AM
To: Eric Robinson mailto:eric.robin...@psmnv.com>>; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>
Subject: RE: [ClusterLabs] DRBD + VDO HowTo?



There is no VDO RA according to my knowledge, but you can use systemd service 
as a resource.



Yet, the VDO service that comes with thr OS is a generic one and controlls all 
VDOs - so you need to create your own vdo service.



Best Regards,

Strahil Nikolov

On Fri, May 14, 2021 at 6:55, Eric Robinson

mailto:eric.robin...@psmnv.com>> wrote:

I created the VDO volumes fine on the drbd devices, formatted them as xfs 
filesystems, created cluster filesystem resources, and the cluster us using 
them. But the cluster won’t fail over. Is there a VDO cluster RA out there 
somewhere already?





From: Strahil Nikolov mailto:hunter86...@yahoo.com>>
Sent: Thursday, May 13, 2021 10:

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-16 Thread Eric Robinson

Yes, DRBD is working fine.

From: Strahil Nikolov 
Sent: Sunday, May 16, 2021 6:06 PM
To: Eric Robinson ; Cluster Labs - All topics related 
to open-source clustering welcomed 
Subject: RE: [ClusterLabs] DRBD + VDO HowTo?

Are you sure that the DRBD is working properly ?

Best Regards,
Strahil Nikolov
On Mon, May 17, 2021 at 0:32, Eric Robinson
mailto:eric.robin...@psmnv.com>> wrote:

Okay, it turns out I was wrong. I thought I had it working, but I keep running 
into problems. Sometimes when I demote a DRBD resource on Node A and promote it 
on Node B, and I try to mount the filesystem, the system complains that it 
cannot read the superblock. But when I move the DRBD primary back to Node A, 
the file system is mountable again. Also, I have problems with filesystems not 
mounting because the vdo devices are not present. All kinds of issues.

From: Users 
mailto:users-boun...@clusterlabs.org>> On Behalf 
Of Eric Robinson
Sent: Friday, May 14, 2021 3:55 PM
To: Strahil Nikolov mailto:hunter86...@yahoo.com>>; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>
Subject: Re: [ClusterLabs] DRBD + VDO HowTo?

Okay, I have it working now. The default systemd service definitions did not 
work, so I created my own.

From: Strahil Nikolov mailto:hunter86...@yahoo.com>>
Sent: Friday, May 14, 2021 3:41 AM
To: Eric Robinson mailto:eric.robin...@psmnv.com>>; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>
Subject: RE: [ClusterLabs] DRBD + VDO HowTo?

There is no VDO RA according to my knowledge, but you can use systemd service 
as a resource.

Yet, the VDO service that comes with thr OS is a generic one and controlls all 
VDOs - so you need to create your own vdo service.

Best Regards,

Strahil Nikolov

On Fri, May 14, 2021 at 6:55, Eric Robinson

mailto:eric.robin...@psmnv.com>> wrote:

I created the VDO volumes fine on the drbd devices, formatted them as xfs 
filesystems, created cluster filesystem resources, and the cluster us using 
them. But the cluster won’t fail over. Is there a VDO cluster RA out there 
somewhere already?

From: Strahil Nikolov mailto:hunter86...@yahoo.com>>
Sent: Thursday, May 13, 2021 10:07 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>; Eric Robinson 
mailto:eric.robin...@psmnv.com>>
Subject: Re: [ClusterLabs] DRBD + VDO HowTo?

For DRBD there is enough info, so let's focus on VDO.

There is a systemd service that starts all VDOs on the system. You can create 
the VDO once drbs is open for writes and then you can create your own systemd 
'.service' file which can be used as a cluster resource.

Best Regards,

Strahil Nikolov

On Fri, May 14, 2021 at 2:33, Eric Robinson

mailto:eric.robin...@psmnv.com>> wrote:

Can anyone point to a document on how to use VDO de-duplication with DRBD? 
Linbit has a blog page about it, but it was last updated 6 years ago and the 
embedded links are dead.

https://linbit.com/blog/albireo-virtual-data-optimizer-vdo-on-drbd/

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-16 Thread Eric Robinson

Okay, it turns out I was wrong. I thought I had it working, but I keep running 
into problems. Sometimes when I demote a DRBD resource on Node A and promote it 
on Node B, and I try to mount the filesystem, the system complains that it 
cannot read the superblock. But when I move the DRBD primary back to Node A, 
the file system is mountable again. Also, I have problems with filesystems not 
mounting because the vdo devices are not present. All kinds of issues.


From: Users  On Behalf Of Eric Robinson
Sent: Friday, May 14, 2021 3:55 PM
To: Strahil Nikolov ; Cluster Labs - All topics related 
to open-source clustering welcomed 
Subject: Re: [ClusterLabs] DRBD + VDO HowTo?


Okay, I have it working now. The default systemd service definitions did not 
work, so I created my own.


From: Strahil Nikolov mailto:hunter86...@yahoo.com>>
Sent: Friday, May 14, 2021 3:41 AM
To: Eric Robinson mailto:eric.robin...@psmnv.com>>; 
Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>
Subject: RE: [ClusterLabs] DRBD + VDO HowTo?

There is no VDO RA according to my knowledge, but you can use systemd service 
as a resource.

Yet, the VDO service that comes with thr OS is a generic one and controlls all 
VDOs - so you need to create your own vdo service.

Best Regards,
Strahil Nikolov
On Fri, May 14, 2021 at 6:55, Eric Robinson
mailto:eric.robin...@psmnv.com>> wrote:

I created the VDO volumes fine on the drbd devices, formatted them as xfs 
filesystems, created cluster filesystem resources, and the cluster us using 
them. But the cluster won’t fail over. Is there a VDO cluster RA out there 
somewhere already?





From: Strahil Nikolov mailto:hunter86...@yahoo.com>>
Sent: Thursday, May 13, 2021 10:07 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>; Eric Robinson 
mailto:eric.robin...@psmnv.com>>
Subject: Re: [ClusterLabs] DRBD + VDO HowTo?



For DRBD there is enough info, so let's focus on VDO.

There is a systemd service that starts all VDOs on the system. You can create 
the VDO once drbs is open for writes and then you can create your own systemd 
'.service' file which can be used as a cluster resource.


Best Regards,

Strahil Nikolov



On Fri, May 14, 2021 at 2:33, Eric Robinson

mailto:eric.robin...@psmnv.com>> wrote:

Can anyone point to a document on how to use VDO de-duplication with DRBD? 
Linbit has a blog page about it, but it was last updated 6 years ago and the 
embedded links are dead.



https://linbit.com/blog/albireo-virtual-data-optimizer-vdo-on-drbd/



-Eric









Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-14 Thread Eric Robinson

Okay, I have it working now. The default systemd service definitions did not 
work, so I created my own.

From: Strahil Nikolov 
Sent: Friday, May 14, 2021 3:41 AM
To: Eric Robinson ; Cluster Labs - All topics related 
to open-source clustering welcomed 
Subject: RE: [ClusterLabs] DRBD + VDO HowTo?

There is no VDO RA according to my knowledge, but you can use systemd service 
as a resource.

Yet, the VDO service that comes with thr OS is a generic one and controlls all 
VDOs - so you need to create your own vdo service.

Best Regards,
Strahil Nikolov
On Fri, May 14, 2021 at 6:55, Eric Robinson
mailto:eric.robin...@psmnv.com>> wrote:

I created the VDO volumes fine on the drbd devices, formatted them as xfs 
filesystems, created cluster filesystem resources, and the cluster us using 
them. But the cluster won’t fail over. Is there a VDO cluster RA out there 
somewhere already?

From: Strahil Nikolov mailto:hunter86...@yahoo.com>>
Sent: Thursday, May 13, 2021 10:07 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>; Eric Robinson 
mailto:eric.robin...@psmnv.com>>
Subject: Re: [ClusterLabs] DRBD + VDO HowTo?

For DRBD there is enough info, so let's focus on VDO.

There is a systemd service that starts all VDOs on the system. You can create 
the VDO once drbs is open for writes and then you can create your own systemd 
'.service' file which can be used as a cluster resource.

Best Regards,

Strahil Nikolov

On Fri, May 14, 2021 at 2:33, Eric Robinson

mailto:eric.robin...@psmnv.com>> wrote:

Can anyone point to a document on how to use VDO de-duplication with DRBD? 
Linbit has a blog page about it, but it was last updated 6 years ago and the 
embedded links are dead.

https://linbit.com/blog/albireo-virtual-data-optimizer-vdo-on-drbd/

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD + VDO HowTo?

2021-05-13 Thread Eric Robinson

I created the VDO volumes fine on the drbd devices, formatted them as xfs 
filesystems, created cluster filesystem resources, and the cluster us using 
them. But the cluster won’t fail over. Is there a VDO cluster RA out there 
somewhere already?

From: Strahil Nikolov 
Sent: Thursday, May 13, 2021 10:07 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] DRBD + VDO HowTo?

For DRBD there is enough info, so let's focus on VDO.
There is a systemd service that starts all VDOs on the system. You can create 
the VDO once drbs is open for writes and then you can create your own systemd 
'.service' file which can be used as a cluster resource.

Best Regards,
Strahil Nikolov

On Fri, May 14, 2021 at 2:33, Eric Robinson
mailto:eric.robin...@psmnv.com>> wrote:

Can anyone point to a document on how to use VDO de-duplication with DRBD? 
Linbit has a blog page about it, but it was last updated 6 years ago and the 
embedded links are dead.

https://linbit.com/blog/albireo-virtual-data-optimizer-vdo-on-drbd/

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] DRBD + VDO HowTo?

2021-05-13 Thread Eric Robinson

Can anyone point to a document on how to use VDO de-duplication with DRBD? 
Linbit has a blog page about it, but it was last updated 6 years ago and the 
embedded links are dead.

https://linbit.com/blog/albireo-virtual-data-optimizer-vdo-on-drbd/

-Eric




Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Location of High Availability Repositories?

2021-05-12 Thread Eric Robinson

Never mind, please forgive the list noise. The command...

yum config-manager --set-enabled ha


...is all I needed (except for policycoreutils-python) which still does not 
become available .

From: Users  On Behalf Of Eric Robinson
Sent: Wednesday, May 12, 2021 6:07 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: [ClusterLabs] Location of High Availability Repositories?

Like many people, when Centos 8 died I transitioned to AlmaLinux. Only problem 
is, yum and dnf do not find or install pacemaker or its related components. The 
command

# yum install -y pacemaker pcs psmisc policycoreutils-python

...specified in the Clusters from Scratch document is a fountain of 
disappointment. It does nothing good.

Where are the cluster packages, pacemaker, corosync, etc?

-Eric



Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Location of High Availability Repositories?

2021-05-12 Thread Eric Robinson

Like many people, when Centos 8 died I transitioned to AlmaLinux. Only problem 
is, yum and dnf do not find or install pacemaker or its related components. The 
command

# yum install -y pacemaker pcs psmisc policycoreutils-python

...specified in the Clusters from Scratch document is a fountain of 
disappointment. It does nothing good.

Where are the cluster packages, pacemaker, corosync, etc?

-Eric



Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Custom RA for Multi-Tenant MySQL?

2021-04-12 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Ulrich Windl
> Sent: Monday, April 12, 2021 2:37 AM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: [EXT] Custom RA for Multi-Tenant MySQL?
>
> >>> Eric Robinson  schrieb am 11.04.2021 um
> >>> 19:07 in
> Nachricht
>  3.prod.outlook.com>
>
> > We're writing a custom RA for a multi‑tenant MySQL cluster that runs
> > in active/standby mode. I've read the RA documentation about what exit
> > codes should be returned for various outcomes, but something is still
> > unclear to me.
> >
> > We run multiple instances of MySQL from one filesystem, like this:
> >
> > /app_root
> > /mysql1
> > /mysql2
> > /mysql3
> > ...etc.
> >
> > The /app_root filesystem lives on a DRBD volume, which is only mounted
> > on the active node.
> >
> > When the RA performs a "start," "stop," or "monitor" action on the
> > standby node, the filesystem is not mounted so the mysql instances are
> not present.
>
> > What should the return  codes for those actions be? Fail? Not installed?
> > Unknown error?
>
> If the DB needs the FS, and the FS is not there, the DB cannot be running, so
> monitor would report "stopped", stop would report success, and star woulkd
> report failure.
>

Under what conditions would you want to report "Unknown Error?"

> >
> >
> > [email_sig]
> >
> > Disclaimer : This email and any files transmitted with it are
> > confidential and intended solely for intended recipients. If you are
> > not the named addressee you should not disseminate, distribute, copy or
> alter this email.
>
> > Any views or opinions presented in this email are solely those of the
> > author
>
> > and might not represent those of Physician Select Management. Warning:
> > Although Physician Select Management has taken reasonable precautions
> > to ensure no viruses are present in this email, the company cannot
> > accept responsibility for any loss or damage arising from the use of
> > this email or
>
> > attachments.
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Custom RA for Multi-Tenant MySQL?

2021-04-11 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Sunday, April 11, 2021 1:20 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Custom RA for Multi-Tenant MySQL?
>
> On 11.04.2021 20:07, Eric Robinson wrote:
> > We're writing a custom RA for a multi-tenant MySQL cluster that runs in
> active/standby mode. I've read the RA documentation about what exit codes
> should be returned for various outcomes, but something is still unclear to
> me.
> >
> > We run multiple instances of MySQL from one filesystem, like this:
> >
> > /app_root
> > /mysql1
> > /mysql2
> > /mysql3
> > ...etc.
> >
> > The /app_root filesystem lives on a DRBD volume, which is only mounted
> on the active node.
> >
> > When the RA performs a "start," "stop," or "monitor" action on the standby
> node, the filesystem is not mounted so the mysql instances are not present.
>
> You are not supposed to do it in the first place. You are supposed to have
> ordering constraint that starts MySQL instances after filesystem is available.
>

That is what we have. The colocation constraints require mysql -> filesystem -> 
drbd master. The ordering constraints promote drbd, then start the filesystem, 
then start mysql.

> > What should the return  codes for those actions be? Fail? Not installed?
> Unknown error?
> >
>
> I believe that "not installed" is considered hard error and bans resource from
> this node. As missing filesystem is probably transient it does not look
> appropriate. There is no "fail" return code.
>
> In any case return code depends on action. For monitor you obviously are
> expected to return "not running" in this case. "stop" should probably return
> success (after all, instance is not running, right?) And "start"
> should return error indication, but it I am not sure what is better - generic
> error or not running.
>

That's a big part of my question. I'm just trying to avoid a condition where 
the mysql resource is running on node A, and Pacemaker thinks there is a 
"problem" with it on Node B.

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Custom RA for Multi-Tenant MySQL?

2021-04-11 Thread Eric Robinson

We're writing a custom RA for a multi-tenant MySQL cluster that runs in 
active/standby mode. I've read the RA documentation about what exit codes 
should be returned for various outcomes, but something is still unclear to me.

We run multiple instances of MySQL from one filesystem, like this:

/app_root
/mysql1
/mysql2
/mysql3
...etc.

The /app_root filesystem lives on a DRBD volume, which is only mounted on the 
active node.

When the RA performs a "start," "stop," or "monitor" action on the standby 
node, the filesystem is not mounted so the mysql instances are not present. 
What should the return  codes for those actions be? Fail? Not installed? 
Unknown error?


[email_sig]

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: RE: Antw: [EXT] Re: "Error: unable to fence '001db02a'" but It got fenced anyway

2021-03-03 Thread Eric Robinson


> -Original Message-
> From: Users  On Behalf Of Ulrich Windl
> Sent: Wednesday, March 3, 2021 12:57 AM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: RE: Antw: [EXT] Re: "Error: unable to fence
> '001db02a'" but It got fenced anyway
>
> >>> Eric Robinson  schrieb am 02.03.2021 um
> >>> 19:26 in
> Nachricht
>  3.prod.outlook.com>
>
> >>  -Original Message-
> >> From: Users  On Behalf Of Digimer
> >> Sent: Monday, March 1, 2021 11:02 AM
> >> To: Cluster Labs - All topics related to open-source clustering
> >> welcomed ; Ulrich Windl
> >> 
> >> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence
> > '001db02a'"
> ...
> >> >> Cloud fencing usually requires a higher timeout (20s reported here).
> >> >>
> >> >> Microsoft seems to suggest the following setup:
> >> >>
> >> >> # pcs property set stonith‑timeout=900
> >> >
> >> > But doesn't that mean the other node waits 15 minutes after stonith
> >> > until it performs the first post-stonith action?
> >>
> >> No, it means that if there is no reply by then, the fence has failed.
> >> If
> the
> >> fence happens sooner, and the caller is told this, recovery begins
> >> very
> > shortly
> >> after.
>
> How would the fencing be confirmed? I don't know.
>
>
> >>
> >
> > Interesting. Since users often report application failure within 1-3
> > minutes
>
> > and may engineers begin investigating immediately, a technician could
> > end up
>
> > connecting to a cluster node after the stonith command was called, and
> > could
>
> > conceivably bring a failed node back up manually, only to have Azure
> > finally get around to shooting it in the head. I don't suppose there's
> > a way to abort/cancel a STONITH operation that is in progress?
>
> I think you have to decide: Let the cluster handle the problem, or let the
> admin handle the problem, but preferrably not both.
> I also think you cannot cancel a STONITH; you can only confirm it.
>
> Regards,
> Ulrich
>

Standing by and letting the cluster handle the problem is a hard pill to 
swallow when a technician could resolve things and bring services back up 
sooner, but I get your point.

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

2021-03-02 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Jan Friesse
> Sent: Monday, March 1, 2021 3:27 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Andrei Borzenkov 
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
>
> > On 27.02.2021 22:12, Andrei Borzenkov wrote:
> >> On 27.02.2021 17:08, Eric Robinson wrote:
> >>>
> >>> I agree, one node is expected to go out of quorum. Still the question is,
> why didn't 001db01b take over the services? I just remembered that
> 001db01b has services running on it, and those services did not stop, so it
> seems that 001db01b did not lose quorum. So why didn't it take over the
> services that were running on 001db01a?
> >>
> >> That I cannot answer. I cannot reproduce it using similar configuration.
> >
> > Hmm ... actually I can.
> >
> > Two nodes ha1 and ha2 + qdevice. I blocked all communication *from*
> > ha1 (to be precise - all packets with ha1 source MAC are dropped).
> > This happened around 10:43:45. Now look:
> >
> > ha1 immediately stops all services:
> >
> > Feb 28 10:43:44 ha1 corosync[3692]:   [TOTEM ] A processor failed,
> > forming new configuration.
> > Feb 28 10:43:47 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 3 ms)
> > Feb 28 10:43:47 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2944) was formed. Members left: 2
> > Feb 28 10:43:47 ha1 corosync[3692]:   [TOTEM ] Failed to receive the
> > leave message. failed: 2
> > Feb 28 10:43:47 ha1 corosync[3692]:   [CPG   ] downlist left_list: 1
> > received
> > Feb 28 10:43:47 ha1 pacemaker-attrd[3703]:  notice: Node ha2 state is
> > now lost Feb 28 10:43:47 ha1 pacemaker-attrd[3703]:  notice: Removing
> > all ha2 attributes for peer loss Feb 28 10:43:47 ha1
> > pacemaker-attrd[3703]:  notice: Purged 1 peer with
> > id=2 and/or uname=ha2 from the membership cache Feb 28 10:43:47 ha1
> > pacemaker-based[3700]:  notice: Node ha2 state is now lost Feb 28
> > 10:43:47 ha1 pacemaker-based[3700]:  notice: Purged 1 peer with
> > id=2 and/or uname=ha2 from the membership cache Feb 28 10:43:47 ha1
> > pacemaker-controld[3705]:  warning: Stonith/shutdown of node ha2 was
> > not expected Feb 28 10:43:47 ha1 pacemaker-controld[3705]:  notice:
> > State transition S_IDLE -> S_POLICY_ENGINE Feb 28 10:43:47 ha1
> > pacemaker-fenced[3701]:  notice: Node ha2 state is now lost Feb 28
> > 10:43:47 ha1 pacemaker-fenced[3701]:  notice: Purged 1 peer with
> > id=2 and/or uname=ha2 from the membership cache
> > Feb 28 10:43:48 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 3 ms)
> > Feb 28 10:43:48 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2948) was formed. Members
> > Feb 28 10:43:48 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
> > received
> > Feb 28 10:43:50 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 3 ms)
> > Feb 28 10:43:50 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2952) was formed. Members
> > Feb 28 10:43:50 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
> > received
> > Feb 28 10:43:51 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 3 ms)
> > Feb 28 10:43:51 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2956) was formed. Members
> > Feb 28 10:43:51 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
> > received
> > Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Server didn't send echo
> > reply message on time Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Feb
> > 28 10:43:56 error Server didn't send echo reply message on time
> > Feb 28 10:43:56 ha1 corosync[3692]:   [QUORUM] This node is within the
> > non-primary component and will NOT provide any services.
> > Feb 28 10:43:56 ha1 corosync[3692]:   [QUORUM] Members[1]: 1
> > Feb 28 10:43:56 ha1 corosync[3692]:   [MAIN  ] Completed service
> > synchronization, ready to provide service.
> > Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  warning: Quorum lost
> > Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  notice: Node ha2 state
> > is now lost Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  warning:
> > Stonith/shutdown of node ha2 was not expected Feb 28 10:43:56 ha1
> > pacemaker-controld[3705]:  notice: Updating quorum status to false
> > (call=274) Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  warning:
> > Fencin

Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence '001db02a'" but It got fenced anyway

2021-03-02 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Digimer
> Sent: Monday, March 1, 2021 11:02 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Ulrich Windl 
> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence '001db02a'"
> but It got fenced anyway
>
> On 2021-03-01 2:50 a.m., Ulrich Windl wrote:
> >>>> Valentin Vidic  schrieb am
> >>>> 28.02.2021 um
> > 16:59
> > in Nachricht <20210228155921.gm29...@valentin-vidic.from.hr>:
> >> On Sun, Feb 28, 2021 at 03:34:20PM +, Eric Robinson wrote:
> >>> 001db02b rebooted. After it came back up, I tried it in the other
> > direction.
> >>>
> >>> On node 001db02b, the command...
> >>>
> >>> # pcs stonith fence 001db02a
> >>>
> >>> ...produced output...
> >>>
> >>> Error: unable to fence '001db02a'.
> >>>
> >>> However, node 001db02a did get restarted!
> >>>
> >>> We also saw this error...
> >>>
> >>> Failed Actions:
> >>> * stonith‑001db02ab_start_0 on 001db02a 'unknown error' (1):
> >>> call=70,
> >> status=Timed Out, exitreason='',
> >>> last‑rc‑change='Sun Feb 28 10:11:10 2021', queued=0ms,
> >>> exec=20014ms
> >>>
> >>> When that happens, does Pacemaker take over the other node's
> >>> resources, or
> >
> >> not?
> >>
> >> Cloud fencing usually requires a higher timeout (20s reported here).
> >>
> >> Microsoft seems to suggest the following setup:
> >>
> >> # pcs property set stonith‑timeout=900
> >
> > But doesn't that mean the other node waits 15 minutes after stonith
> > until it performs the first post-stonith action?
>
> No, it means that if there is no reply by then, the fence has failed. If the
> fence happens sooner, and the caller is told this, recovery begins very 
> shortly
> after.
>

Interesting. Since users often report application failure within 1-3 minutes 
and may engineers begin investigating immediately, a technician could end up 
connecting to a cluster node after the stonith command was called, and could 
conceivably bring a failed no back up manually, only to have Azure finally get 
around to shooting it in the head. I don't suppose there's a way to 
abort/cancel a STONITH operation that is in progress?

> >> # pcs stonith create rsc_st_azure fence_azure_arm username="login ID"
> >>   password="password" resourceGroup="resource group"
> tenantId="tenant ID"
> >>   subscriptionId="subscription id"
> >>
> >
> pcmk_host_map="prod‑cl1‑0:prod‑cl1‑0‑vm‑name;prod‑cl1‑1:prod‑cl1‑1‑vm‑
> name"
> >>   power_timeout=240 pcmk_reboot_timeout=900
> pcmk_monitor_timeout=120
> >>   pcmk_monitor_retries=4 pcmk_action_limit=3
> >>   op monitor interval=3600
> >>
> >>
> > https://docs.microsoft.com/en‑us/azure/virtual‑machines/workloads/sap/
> > high‑avai
> >
> >> lability‑guide‑rhel‑pacemaker
> >>
> >> ‑‑
> >> Valentin
> >> ___
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/ "I am, somehow, less
> interested in the weight and convolutions of Einstein’s brain than in the near
> certainty that people of equal talent have lived and died in cotton fields and
> sweatshops." - Stephen Jay Gould
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] "Error: unable to fence '001db02a'" but It got fenced anyway

2021-02-28 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Valentin Vidic
> Sent: Sunday, February 28, 2021 9:59 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] "Error: unable to fence '001db02a'" but It got
> fenced anyway
>
> On Sun, Feb 28, 2021 at 03:34:20PM +, Eric Robinson wrote:
> > 001db02b rebooted. After it came back up, I tried it in the other direction.
> >
> > On node 001db02b, the command...
> >
> > # pcs stonith fence 001db02a
> >
> > ...produced output...
> >
> > Error: unable to fence '001db02a'.
> >
> > However, node 001db02a did get restarted!
> >
> > We also saw this error...
> >
> > Failed Actions:
> > * stonith-001db02ab_start_0 on 001db02a 'unknown error' (1): call=70,
> status=Timed Out, exitreason='',
> > last-rc-change='Sun Feb 28 10:11:10 2021', queued=0ms, exec=20014ms
> >
> > When that happens, does Pacemaker take over the other node's
> resources, or not?
>
> Cloud fencing usually requires a higher timeout (20s reported here).
>
> Microsoft seems to suggest the following setup:
>
> # pcs property set stonith-timeout=900
> # pcs stonith create rsc_st_azure fence_azure_arm username="login ID"
>   password="password" resourceGroup="resource group" tenantId="tenant
> ID"
>   subscriptionId="subscription id"
>   pcmk_host_map="prod-cl1-0:prod-cl1-0-vm-name;prod-cl1-1:prod-cl1-1-
> vm-name"
>   power_timeout=240 pcmk_reboot_timeout=900
> pcmk_monitor_timeout=120
>   pcmk_monitor_retries=4 pcmk_action_limit=3
>   op monitor interval=3600
>

I made the changes and tried again. Fencing took about 3.5 minutes and did not 
throw an error. Which raises the question, what happens if fencing takes more 
than 900 seconds? Will Pacemaker on the survivor node refuse to start services 
if it cannot confirm that the other node has been shot?

-Eric
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] "Error: unable to fence '001db02a'" but It got fenced anyway

2021-02-28 Thread Eric Robinson

I just configured STONITH in Azure for the first time. My initial test went 
fine.

On node 001db02a, the command...

# pcs stonith fence 001db02b

...produced output...

001db02b fenced.

001db02b rebooted. After it came back up, I tried it in the other direction.

On node 001db02b, the command...

# pcs stonith fence 001db02a

...produced output...

Error: unable to fence '001db02a'.

However, node 001db02a did get restarted!

We also saw this error...

Failed Actions:
* stonith-001db02ab_start_0 on 001db02a 'unknown error' (1): call=70, 
status=Timed Out, exitreason='',
last-rc-change='Sun Feb 28 10:11:10 2021', queued=0ms, exec=20014ms

When that happens, does Pacemaker take over the other node's resources, or not?


[cid:image001.png@01D70DB2.BF2769D0]

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Filesystem Resource Move Fails Because Underlying DRBD Resource Won't Move

2021-02-28 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Valentin Vidic
> Sent: Sunday, February 28, 2021 8:02 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Filesystem Resource Move Fails Because
> Underlying DRBD Resource Won't Move
>
> On Sun, Feb 28, 2021 at 12:45:55PM +, Eric Robinson wrote:
> > Colocation Constraints:
> >   p_fs_clust03 with ms_drbd0 (score:INFINITY) (id:colocation-p_fs_clust03-
> ms_drbd0-INFINITY)
> >   p_fs_clust04 with ms_drbd1 (score:INFINITY)
> > (id:colocation-p_fs_clust04-ms_drbd1-INFINITY)
>
> This coloaction probably needs to be with role=Master, otherwise drbd Slave
> can satisfy it, but the mount will fail.
>

Valentin,

You're 100% on target. That fixed it. Thanks for coming to my aid!

-Eric
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Filesystem Resource Move Fails Because Underlying DRBD Resource Won't Move

2021-02-28 Thread Eric Robinson

We see in the log on 001db01a...

Feb 28 07:33:50 [61707] 001db02a.ccnva.localpengine: info: 
master_color:ms_drbd1: Promoted 1 instances of a possible 1 to master

...and then...


Feb 28 07:33:50 [61707] 001db02a.ccnva.localpengine:   notice: LogAction:   
 * Move   p_fs_clust04  ( 001db02b -> 001db02a )

...and then...


Feb 28 07:34:03 [61708] 001db02a.ccnva.local   crmd:   notice: 
te_rsc_command:  Initiating stop operation p_fs_clust04_stop_0 on 001db02b 
| action 69

..and then...


Feb 28 07:34:04 [61708] 001db02a.ccnva.local   crmd: info: 
match_graph_event:   Action p_fs_clust04_stop_0 (69) confirmed on 001db02b 
(rc=0)

Feb 28 07:34:04 [61708] 001db02a.ccnva.local   crmd:   notice: 
te_rsc_command:  Initiating start operation p_fs_clust04_start_0 locally on 
001db02a | action 70

...and finally...


Feb 28 07:34:04  Filesystem(p_fs_clust04)[15357]:INFO: Running start for 
/dev/drbd1 on /ha02_mysql

Feb 28 07:34:09  Filesystem(p_fs_clust04)[15357]:ERROR: Couldn't mount 
filesystem /dev/drbd1 on /ha02_mysql

Resource ms_drbd1 is not becoming master, so the filesystem won't mount.

Am I reading that right?

-Eric


From: Users  On Behalf Of Eric Robinson
Sent: Sunday, February 28, 2021 6:56 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Filesystem Resource Move Fails Because Underlying 
DRBD Resource Won't Move

Oops, sorry, here are links to the text logs.

Node 001db02a: https://www.dropbox.com/s/ymbatz91x3y84wp/001db02a_log.txt?dl=0

Node 001db02b: https://www.dropbox.com/s/etq6mn460imdega/001db02b_log.txt?dl=0

-Eric


From: Users 
mailto:users-boun...@clusterlabs.org>> On Behalf 
Of Eric Robinson
Sent: Sunday, February 28, 2021 6:46 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>
Subject: [ClusterLabs] Filesystem Resource Move Fails Because Underlying DRBD 
Resource Won't Move


Beginning with this cluster status...

Cluster name: 001db02ab
Stack: corosync
Current DC: 001db02a (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with 
quorum
Last updated: Sun Feb 28 07:24:31 2021
Last change: Sun Feb 28 07:19:51 2021 by hacluster via crmd on 001db02a

2 nodes configured
14 resources configured

Online: [ 001db02a 001db02b ]

Full list of resources:

Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db02a ]
 Slaves: [ 001db02b ]
Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db02b ]
 Slaves: [ 001db02a ]
p_fs_clust03   (ocf::heartbeat:Filesystem):Started 001db02a
p_fs_clust04   (ocf::heartbeat:Filesystem):Started 001db02b
p_mysql_009(lsb:mysql_009):Started 001db02a
p_mysql_010(lsb:mysql_010):Started 001db02a
p_mysql_011(lsb:mysql_011):Started 001db02a
p_mysql_012(lsb:mysql_012):Started 001db02a
p_mysql_014(lsb:mysql_014):Started 001db02b
p_mysql_015(lsb:mysql_015):Started 001db02b
p_mysql_016(lsb:mysql_016):Started 001db02b
stonith-001db02ab  (stonith:fence_azure_arm):  Started 001db02a

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

...and with these constraints...

Location Constraints:
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust03 (kind:Mandatory) 
(id:order-ms_drbd0-p_fs_clust03-mandatory)
  promote ms_drbd1 then start p_fs_clust04 (kind:Mandatory) 
(id:order-ms_drbd1-p_fs_clust04-mandatory)
  start p_fs_clust03 then start p_mysql_009 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_009-mandatory)
  start p_fs_clust03 then start p_mysql_010 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_010-mandatory)
  start p_fs_clust03 then start p_mysql_011 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_011-mandatory)
  start p_fs_clust03 then start p_mysql_012 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_012-mandatory)
  start p_fs_clust04 then start p_mysql_014 (kind:Mandatory) 
(id:order-p_fs_clust04-p_mysql_014-mandatory)
  start p_fs_clust04 then start p_mysql_015 (kind:Mandatory) 
(id:order-p_fs_clust04-p_mysql_015-mandatory)
  start p_fs_clust04 then start p_mysql_016 (kind:Mandatory) 
(id:order-p_fs_clust04-p_mysql_016-mandatory)
Colocation Constraints:
  p_fs_clust03 with ms_drbd0 (score:INFINITY) 
(id:colocation-p_fs_clust03-ms_drbd0-INFINITY)
  p_fs_clust04 with ms_drbd1 (score:INFINITY) 
(id:colocation-p_fs_clust04-ms_drbd1-INFINITY)
  p_mysql_009 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_009-p_fs_clust03-INFINITY)
  p_mysql_010 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_010-p_fs_clust03-INFINITY)
  p_mysql_011 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_011-p_fs_clust03-INFINITY)
  p_mysql_012 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_012-p_fs_clust03-INFINITY)
  p_mysql_014 with p_fs_clust04 (score:INFINITY) 
(id:colocation-p_mysql_0

Re: [ClusterLabs] Filesystem Resource Move Fails Because Underlying DRBD Resource Won't Move

2021-02-28 Thread Eric Robinson

Oops, sorry, here are links to the text logs.

Node 001db02a: https://www.dropbox.com/s/ymbatz91x3y84wp/001db02a_log.txt?dl=0

Node 001db02b: https://www.dropbox.com/s/etq6mn460imdega/001db02b_log.txt?dl=0

-Eric


From: Users  On Behalf Of Eric Robinson
Sent: Sunday, February 28, 2021 6:46 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: [ClusterLabs] Filesystem Resource Move Fails Because Underlying DRBD 
Resource Won't Move


Beginning with this cluster status...

Cluster name: 001db02ab
Stack: corosync
Current DC: 001db02a (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with 
quorum
Last updated: Sun Feb 28 07:24:31 2021
Last change: Sun Feb 28 07:19:51 2021 by hacluster via crmd on 001db02a

2 nodes configured
14 resources configured

Online: [ 001db02a 001db02b ]

Full list of resources:

Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db02a ]
 Slaves: [ 001db02b ]
Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db02b ]
 Slaves: [ 001db02a ]
p_fs_clust03   (ocf::heartbeat:Filesystem):Started 001db02a
p_fs_clust04   (ocf::heartbeat:Filesystem):Started 001db02b
p_mysql_009(lsb:mysql_009):Started 001db02a
p_mysql_010(lsb:mysql_010):Started 001db02a
p_mysql_011(lsb:mysql_011):Started 001db02a
p_mysql_012(lsb:mysql_012):Started 001db02a
p_mysql_014(lsb:mysql_014):Started 001db02b
p_mysql_015(lsb:mysql_015):Started 001db02b
p_mysql_016(lsb:mysql_016):Started 001db02b
stonith-001db02ab  (stonith:fence_azure_arm):  Started 001db02a

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

...and with these constraints...

Location Constraints:
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust03 (kind:Mandatory) 
(id:order-ms_drbd0-p_fs_clust03-mandatory)
  promote ms_drbd1 then start p_fs_clust04 (kind:Mandatory) 
(id:order-ms_drbd1-p_fs_clust04-mandatory)
  start p_fs_clust03 then start p_mysql_009 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_009-mandatory)
  start p_fs_clust03 then start p_mysql_010 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_010-mandatory)
  start p_fs_clust03 then start p_mysql_011 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_011-mandatory)
  start p_fs_clust03 then start p_mysql_012 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_012-mandatory)
  start p_fs_clust04 then start p_mysql_014 (kind:Mandatory) 
(id:order-p_fs_clust04-p_mysql_014-mandatory)
  start p_fs_clust04 then start p_mysql_015 (kind:Mandatory) 
(id:order-p_fs_clust04-p_mysql_015-mandatory)
  start p_fs_clust04 then start p_mysql_016 (kind:Mandatory) 
(id:order-p_fs_clust04-p_mysql_016-mandatory)
Colocation Constraints:
  p_fs_clust03 with ms_drbd0 (score:INFINITY) 
(id:colocation-p_fs_clust03-ms_drbd0-INFINITY)
  p_fs_clust04 with ms_drbd1 (score:INFINITY) 
(id:colocation-p_fs_clust04-ms_drbd1-INFINITY)
  p_mysql_009 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_009-p_fs_clust03-INFINITY)
  p_mysql_010 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_010-p_fs_clust03-INFINITY)
  p_mysql_011 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_011-p_fs_clust03-INFINITY)
  p_mysql_012 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_012-p_fs_clust03-INFINITY)
  p_mysql_014 with p_fs_clust04 (score:INFINITY) 
(id:colocation-p_mysql_014-p_fs_clust04-INFINITY)
  p_mysql_015 with p_fs_clust04 (score:INFINITY) 
(id:colocation-p_mysql_015-p_fs_clust04-INFINITY)
  p_mysql_016 with p_fs_clust04 (score:INFINITY) 
(id:colocation-p_mysql_016-p_fs_clust04-INFINITY)

...and this drbd status on node 001db02a...

ha01_mysql role:Primary
  disk:UpToDate
  001db02b role:Secondary
peer-disk:UpToDate

ha02_mysql role:Secondary
  disk:UpToDate
  001db02b role:Primary
peer-disk:UpToDate

...we issue the command...

# pcs resource move p_fs_clust04

...we get result...

Full list of resources:

Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db02a ]
 Slaves: [ 001db02b ]
Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db02b ]
 Slaves: [ 001db02a ]
p_fs_clust03   (ocf::heartbeat:Filesystem):Started 001db02a
p_fs_clust04   (ocf::heartbeat:Filesystem):Stopped
p_mysql_009(lsb:mysql_009):Started 001db02a
p_mysql_010(lsb:mysql_010):Started 001db02a
p_mysql_011(lsb:mysql_011):Started 001db02a
p_mysql_012(lsb:mysql_012):Started 001db02a
p_mysql_014(lsb:mysql_014):Stopped
p_mysql_015(lsb:mysql_015):Stopped
p_mysql_016(lsb:mysql_016):Stopped
stonith-001db02ab  (stonith:fence_azure_arm):  Started 001db02a

Failed Actions:
* p_fs_clust04_start_0 on 001db02a 'unknown error' (1): call=126, 
status=complete, exitreason='Couldn't mount filesystem /dev/drbd1 on 
/ha02_mysql',
last-rc-change='Sun Feb 28 07:34:04 2021', queued=0ms, exec=5251ms

Here is the log from

Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

2021-02-28 Thread Eric Robinson



> -Original Message-
> From: Users  On Behalf Of Valentin Vidic
> Sent: Sunday, February 28, 2021 4:37 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
> On Sun, Feb 28, 2021 at 07:45:27AM +, Strahil Nikolov wrote:
> > As this is in Asure and they support shared disks , I think that a simple 
> > SBD
> could solve the stonith case.
>
> Also fence_azure_arm: Azure Resource Manager :)
>
> --
> Valentin
> _

Azure does not support SBD for Red Hat, but I implemented fence_azure_arm and 
it seems to be working. The command...

# pcs stonith fence 001db02b

...successfully fenced the node.

-Eric
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Filesystem Resource Move Fails Because Underlying DRBD Resource Won't Move

2021-02-28 Thread Eric Robinson


Beginning with this cluster status...

Cluster name: 001db02ab
Stack: corosync
Current DC: 001db02a (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with 
quorum
Last updated: Sun Feb 28 07:24:31 2021
Last change: Sun Feb 28 07:19:51 2021 by hacluster via crmd on 001db02a

2 nodes configured
14 resources configured

Online: [ 001db02a 001db02b ]

Full list of resources:

Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db02a ]
 Slaves: [ 001db02b ]
Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db02b ]
 Slaves: [ 001db02a ]
p_fs_clust03   (ocf::heartbeat:Filesystem):Started 001db02a
p_fs_clust04   (ocf::heartbeat:Filesystem):Started 001db02b
p_mysql_009(lsb:mysql_009):Started 001db02a
p_mysql_010(lsb:mysql_010):Started 001db02a
p_mysql_011(lsb:mysql_011):Started 001db02a
p_mysql_012(lsb:mysql_012):Started 001db02a
p_mysql_014(lsb:mysql_014):Started 001db02b
p_mysql_015(lsb:mysql_015):Started 001db02b
p_mysql_016(lsb:mysql_016):Started 001db02b
stonith-001db02ab  (stonith:fence_azure_arm):  Started 001db02a

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

...and with these constraints...

Location Constraints:
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust03 (kind:Mandatory) 
(id:order-ms_drbd0-p_fs_clust03-mandatory)
  promote ms_drbd1 then start p_fs_clust04 (kind:Mandatory) 
(id:order-ms_drbd1-p_fs_clust04-mandatory)
  start p_fs_clust03 then start p_mysql_009 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_009-mandatory)
  start p_fs_clust03 then start p_mysql_010 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_010-mandatory)
  start p_fs_clust03 then start p_mysql_011 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_011-mandatory)
  start p_fs_clust03 then start p_mysql_012 (kind:Mandatory) 
(id:order-p_fs_clust03-p_mysql_012-mandatory)
  start p_fs_clust04 then start p_mysql_014 (kind:Mandatory) 
(id:order-p_fs_clust04-p_mysql_014-mandatory)
  start p_fs_clust04 then start p_mysql_015 (kind:Mandatory) 
(id:order-p_fs_clust04-p_mysql_015-mandatory)
  start p_fs_clust04 then start p_mysql_016 (kind:Mandatory) 
(id:order-p_fs_clust04-p_mysql_016-mandatory)
Colocation Constraints:
  p_fs_clust03 with ms_drbd0 (score:INFINITY) 
(id:colocation-p_fs_clust03-ms_drbd0-INFINITY)
  p_fs_clust04 with ms_drbd1 (score:INFINITY) 
(id:colocation-p_fs_clust04-ms_drbd1-INFINITY)
  p_mysql_009 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_009-p_fs_clust03-INFINITY)
  p_mysql_010 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_010-p_fs_clust03-INFINITY)
  p_mysql_011 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_011-p_fs_clust03-INFINITY)
  p_mysql_012 with p_fs_clust03 (score:INFINITY) 
(id:colocation-p_mysql_012-p_fs_clust03-INFINITY)
  p_mysql_014 with p_fs_clust04 (score:INFINITY) 
(id:colocation-p_mysql_014-p_fs_clust04-INFINITY)
  p_mysql_015 with p_fs_clust04 (score:INFINITY) 
(id:colocation-p_mysql_015-p_fs_clust04-INFINITY)
  p_mysql_016 with p_fs_clust04 (score:INFINITY) 
(id:colocation-p_mysql_016-p_fs_clust04-INFINITY)

...and this drbd status on node 001db02a...

ha01_mysql role:Primary
  disk:UpToDate
  001db02b role:Secondary
peer-disk:UpToDate

ha02_mysql role:Secondary
  disk:UpToDate
  001db02b role:Primary
peer-disk:UpToDate

...we issue the command...

# pcs resource move p_fs_clust04

...we get result...

Full list of resources:

Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db02a ]
 Slaves: [ 001db02b ]
Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db02b ]
 Slaves: [ 001db02a ]
p_fs_clust03   (ocf::heartbeat:Filesystem):Started 001db02a
p_fs_clust04   (ocf::heartbeat:Filesystem):Stopped
p_mysql_009(lsb:mysql_009):Started 001db02a
p_mysql_010(lsb:mysql_010):Started 001db02a
p_mysql_011(lsb:mysql_011):Started 001db02a
p_mysql_012(lsb:mysql_012):Started 001db02a
p_mysql_014(lsb:mysql_014):Stopped
p_mysql_015(lsb:mysql_015):Stopped
p_mysql_016(lsb:mysql_016):Stopped
stonith-001db02ab  (stonith:fence_azure_arm):  Started 001db02a

Failed Actions:
* p_fs_clust04_start_0 on 001db02a 'unknown error' (1): call=126, 
status=complete, exitreason='Couldn't mount filesystem /dev/drbd1 on 
/ha02_mysql',
last-rc-change='Sun Feb 28 07:34:04 2021', queued=0ms, exec=5251ms

Here is the log from node 001db02a: 
https://www.dropbox.com/s/vq3ytcsuvvmqwe5/001db02a_log?dl=0

Here is the log from node 001db02b: 
https://www.dropbox.com/s/g0el6ft0jmvzqsi/001db02b_log?dl=0

>From reading the logs, it seems that the filesystem p_fs_clust04  is getting 
>successfully unmounted on node 001db02b, but the drbd resource never stops. On 
>node 001db01b, it tries to mount the filesystem but fails because the drbd 
>volume is not master.

Why isn't drbd transitioning?

Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

2021-02-27 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Saturday, February 27, 2021 12:55 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
> On 27.02.2021 09:05, Eric Robinson wrote:
> >> -Original Message-
> >> From: Users  On Behalf Of Andrei
> >> Borzenkov
> >> Sent: Friday, February 26, 2021 1:25 PM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice
> >> Went Down Anyway?
> >>
> >> On 26.02.2021 21:58, Eric Robinson wrote:
> >>>> -Original Message-
> >>>> From: Users  On Behalf Of Andrei
> >>>> Borzenkov
> >>>> Sent: Friday, February 26, 2021 11:27 AM
> >>>> To: users@clusterlabs.org
> >>>> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate
> >>>> Qdevice Went Down Anyway?
> >>>>
> >>>> 26.02.2021 19:19, Eric Robinson пишет:
> >>>>> At 5:16 am Pacific time Monday, one of our cluster nodes failed
> >>>>> and its
> >>>> mysql services went down. The cluster did not automatically recover.
> >>>>>
> >>>>> We're trying to figure out:
> >>>>>
> >>>>>
> >>>>>   1.  Why did it fail?
> >>>>
> >>>> Pacemaker only registered loss of connection between two nodes. You
> >>>> need to investigate why it happened.
> >>>>
> >>>>>   2.  Why did it not automatically recover?
> >>>>>
> >>>>> The cluster did not recover until we manually executed...
> >>>>>
> >>>>
> >>>> *Cluster* never failed in the first place. Specific resource may.
> >>>> Do not confuse things more than is necessary.
> >>>>
> >>>>> # pcs resource cleanup p_mysql_622
> >>>>>
> >>>>
> >>>> Because this resource failed to stop and this is fatal.
> >>>>
> >>>>> Feb 22 05:16:30 [91682] 001db01apengine:   notice: LogAction:   
> >>>>>  *
> >> Stop
> >>>> p_mysql_622  ( 001db01a )   due to no quorum
> >>>>
> >>>> Remaining node lost quorum and decided to stop resources
> >>>>
> >>>
> >>> I consider this a cluster failure, exacerbated by a resource
> >>> failure. We can
> >> investigate why resource p_mysql_622 failed to stop, but it seems the
> >> underlying problem is the loss of quorum.
> >>
> >> This problem is outside of pacemaker scope. You are shooting the
> >> messenger here.
> >>
> >
> > I appreciate your patience here. Here is my confusion. We have three
> devices--two database servers and a qdevice. Unless two devices lost
> connection with the network at the same time, the cluster should not have
> lost quorum.
>
> No, you misunderstand how qdevice works. qdevice is not passive witness
> - when cluster is split in multiple partitions, qdevice decides which 
> partition
> should remain active and provides votes to this partition so it remains
> quorate. All other partitions will go out of quorum.
>
> So even if only connection between two nodes was lost, but both nodes
> retained connection to qnetd server, one node is expected to go out of
> quorum.

I must have not explained myself well earlier, because what you wrote is 
exactly how I understand it. Qdevice is an active participant that provides a 
vote to ensure that at least one partition remains quorate.

>
> > If node 001db01a lost all connectivity (and therefore quorum), then I
> understand that the default Pacemaker action would be to stop its services.
> However, that does not explain why node 001db01b did not take over and
> start the services, as it would still have had quorum.
>
> You really need to show your corosync and pacemaker configuration.
>

You can see the cluster config here: 
https://www.dropbox.com/s/9t5ecl2pjf9yu2o/cluster_config.txt?dl=0


> >
> >>> That should not have happened with the qdevice in the mix, should it?
> >>>
> >>
> >> Huh? It is up to you to provide infrastructure where qdevice
> >> connection never fails. Again this is outside of pacemaker scope.
> >>
> >
> > Does something in the logs indicate that BOTH database nodes lost
> quorum?
>
> Not

Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

2021-02-26 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Friday, February 26, 2021 1:25 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
> On 26.02.2021 21:58, Eric Robinson wrote:
> >> -Original Message-
> >> From: Users  On Behalf Of Andrei
> >> Borzenkov
> >> Sent: Friday, February 26, 2021 11:27 AM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice
> >> Went Down Anyway?
> >>
> >> 26.02.2021 19:19, Eric Robinson пишет:
> >>> At 5:16 am Pacific time Monday, one of our cluster nodes failed and
> >>> its
> >> mysql services went down. The cluster did not automatically recover.
> >>>
> >>> We're trying to figure out:
> >>>
> >>>
> >>>   1.  Why did it fail?
> >>
> >> Pacemaker only registered loss of connection between two nodes. You
> >> need to investigate why it happened.
> >>
> >>>   2.  Why did it not automatically recover?
> >>>
> >>> The cluster did not recover until we manually executed...
> >>>
> >>
> >> *Cluster* never failed in the first place. Specific resource may. Do
> >> not confuse things more than is necessary.
> >>
> >>> # pcs resource cleanup p_mysql_622
> >>>
> >>
> >> Because this resource failed to stop and this is fatal.
> >>
> >>> Feb 22 05:16:30 [91682] 001db01apengine:   notice: LogAction:*
> Stop
> >> p_mysql_622  ( 001db01a )   due to no quorum
> >>
> >> Remaining node lost quorum and decided to stop resources
> >>
> >
> > I consider this a cluster failure, exacerbated by a resource failure. We can
> investigate why resource p_mysql_622 failed to stop, but it seems the
> underlying problem is the loss of quorum.
>
> This problem is outside of pacemaker scope. You are shooting the messenger
> here.
>

I appreciate your patience here. Here is my confusion. We have three 
devices--two database servers and a qdevice. Unless two devices lost connection 
with the network at the same time, the cluster should not have lost quorum. If 
node 001db01a lost all connectivity (and therefore quorum), then I understand 
that the default Pacemaker action would be to stop its services. However, that 
does not explain why node 001db01b did not take over and start the services, as 
it would still have had quorum.

> > That should not have happened with the qdevice in the mix, should it?
> >
>
> Huh? It is up to you to provide infrastructure where qdevice connection
> never fails. Again this is outside of pacemaker scope.
>

Does something in the logs indicate that BOTH database nodes lost quorum? Are 
you suggesting that Azure's network went down and all the devices lost 
communication with each other, and that's why quorum was lost?

> > I'm confused about what is supposed to happen here. If the root cause is
> that node 001db01a briefly lost all communication with the network (just
> guessing), then it should have taken no action, including STONITH, since
> there would be no quorum.
>
> Read pacemaker documentation. Default action when node goes out of
> quorum is to stop all resources.
>
> > (There is no physical STONITH device anyway, as both nodes are in Azure.)
> Meanwhile, node 001db01b would still have had quorum (itself plus the
> qdevice), and should have assumed ownership of the resources and started
> them, or no?
>
> I commented on this in another mail. pacemaker documentation does not
> really describe what happens, and blindly restarting all resources locally
> would easily lead to data corruption.
>
> Having STONITH would solve your problem. 001db01b would have fenced
> 001db01a and restarted all resources.
>
> Without STONITH it is not possible in general to handle split brain and
> resource stop failures. You do not know what is left active and what not so it
> is not safe to attempt to restart resources elsewhere.

The nodes are using DRBD. Since that has its own split-brain detection, I don't 
think there is a concern about data corruption as there would be with shared 
storage. In a scenario where node 001db01a loses connectivity, 001db01b still 
has quorum because of the vote from the qdevice. It should promote DRBD and 
start the mysql services. If 001db01a subsequently comes back online, then both 
DRBD devices go into standalone and the services go back down, but there's no 
corruption. You then do a manual split-brain recovery (discard

Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

2021-02-26 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Friday, February 26, 2021 11:27 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
> 26.02.2021 19:19, Eric Robinson пишет:
> > At 5:16 am Pacific time Monday, one of our cluster nodes failed and its
> mysql services went down. The cluster did not automatically recover.
> >
> > We're trying to figure out:
> >
> >
> >   1.  Why did it fail?
>
> Pacemaker only registered loss of connection between two nodes. You need
> to investigate why it happened.
>
> >   2.  Why did it not automatically recover?
> >
> > The cluster did not recover until we manually executed...
> >
>
> *Cluster* never failed in the first place. Specific resource may. Do not
> confuse things more than is necessary.
>
> > # pcs resource cleanup p_mysql_622
> >
>
> Because this resource failed to stop and this is fatal.
>
> > Feb 22 05:16:30 [91682] 001db01apengine:   notice: LogAction:* 
> > Stop
> p_mysql_622  ( 001db01a )   due to no quorum
>
> Remaining node lost quorum and decided to stop resources
>

I consider this a cluster failure, exacerbated by a resource failure. We can 
investigate why resource p_mysql_622 failed to stop, but it seems the 
underlying problem is the loss of quorum. That should not have happened with 
the qdevice in the mix, should it?

I'm confused about what is supposed to happen here. If the root cause is that 
node 001db01a briefly lost all communication with the network (just guessing), 
then it should have taken no action, including STONITH, since there would be no 
quorum. (There is no physical STONITH device anyway, as both nodes are in 
Azure.) Meanwhile, node 001db01b would still have had quorum (itself plus the 
qdevice), and should have assumed ownership of the resources and started them, 
or no?

> > Feb 22 05:16:30 [91683] 001db01a   crmd:   notice: te_rsc_command:
> Initiating stop operation p_mysql_622_stop_0 locally on 001db01a | action 76
> ...
> > Feb 22 05:16:30 [91680] 001db01a   lrmd: info: log_execute: 
> > executing
> - rsc:p_mysql_622 action:stop call_id:308
> ...
> > Feb 22 05:16:45 [91680] 001db01a   lrmd:  warning:
> child_timeout_callback:  p_mysql_622_stop_0 process (PID 19225) timed out
> > Feb 22 05:16:45 [91680] 001db01a   lrmd:  warning: operation_finished:
> p_mysql_622_stop_0:19225 - timed out after 15000ms
> > Feb 22 05:16:45 [91680] 001db01a   lrmd: info: log_finished:
> > finished -
> rsc:p_mysql_622 action:stop call_id:308 pid:19225 exit-code:1 exec-
> time:15002ms queue-time:0ms
> > Feb 22 05:16:45 [91683] 001db01a   crmd:error: process_lrm_event:
> Result of stop operation for p_mysql_622 on 001db01a: Timed Out | call=308
> key=p_mysql_622_stop_0 timeout=15000ms
> ...
> > Feb 22 05:16:38 [112948] 001db01bpengine: info: LogActions: 
> > Leave
> p_mysql_622 (Started unmanaged)
>
> At this point pacemaker stops managing this resource because its status is
> unknown. Normal reaction to stop failure is to fence node and fail resource
> over, but apparently you also do not ave working stonith.
>
> Loss of quorum may be related to network issue so that nodes both lost
> connection to each other and to quorum device.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

2021-02-26 Thread Eric Robinson

> -Original Message-
> From: Digimer 
> Sent: Friday, February 26, 2021 10:35 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Eric Robinson 
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
> On 2021-02-26 11:19 a.m., Eric Robinson wrote:
> > At 5:16 am Pacific time Monday, one of our cluster nodes failed and
> > its mysql services went down. The cluster did not automatically recover.
> >
> > We're trying to figure out:
> >
> >  1. Why did it fail?
> >  2. Why did it not automatically recover?
> >
> > The cluster did not recover until we manually executed...
> >
> > # pcs resource cleanup p_mysql_622
> >
> > OS: CentOS Linux release 7.5.1804 (Core)
> >
> > Cluster version:
> >
> > corosync.x86_64  2.4.5-4.el7 @base
> > corosync-qdevice.x86_64  2.4.5-4.el7 @base
> > pacemaker.x86_64 1.1.21-4.el7@base
> >
> > Two nodes: 001db01a, 001db01b
> >
> > The following log snippet is from node 001db01a:
> >
> > [root@001db01a cluster]# grep "Feb 22 05:1[67]" corosync.log-20210223
>
> 
>
> > Feb 22 05:16:30 [91682] 001db01apengine:  warning: cluster_status:
> Fencing and resource management disabled due to lack of quorum
>
> Seems like there was no quorum from this node's perspective, so it won't do
> anything. What does the other node's logs say?
>

The logs from the other node are at the bottom of the original email.

> What is the cluster configuration? Do you have stonith (fencing) configured?

2-node with a separate qdevice. No fencing.

> Quorum is a useful tool when things are working properly, but it doesn't help
> when things enter an undefined / unexpected state.
> When that happens, stonith saves you. So said another way, you must have
> stonith for a stable cluster, quorum is optional.
>

In this case, if fencing was enabled, which node would have fenced the other? 
Would they have gotten into a STONITH war?

More importantly, why did the failure of resource p_mysql_622 keep the whole 
cluster from recovering? As soon as I did 'pcs resource cleanup p_mysql_622' 
all the other resources recovered, but none of them are dependent on that 
resource.

> --
> Digimer
> Papers and Projects: https://alteeve.com/w/ "I am, somehow, less
> interested in the weight and convolutions of Einstein's brain than in the near
> certainty that people of equal talent have lived and died in cotton fields and
> sweatshops." - Stephen Jay Gould
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Observed Difference Between ldirectord and keepalived

2021-01-03 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Timo Schöler
> Sent: Saturday, January 2, 2021 4:27 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Observed Difference Between ldirectord and
> keepalived
>
> On 1/2/21 10:11 AM, Eric Robinson wrote:
>
> > We recently switched from ldirectord to keepalived. We noticed that,
> > after the switch, LVS behaves a bit differently with respect to "down"
> > services.
> >
> > On ldirectord, a virtual service with 2 realservers displays "Masq0"
> > when one of them is down.
> >
> > TCP  192.168.5.100:3002 wlc persistent 50
> >
> >-> 192.168.8.53:3002Masq1  0  0
> >
> >-> 192.168.8.55:3002Masq0  0  4
> >
> > On keepalived, it does not shown the down server at all...
> >
> > TCP  192.168.5.100:3002 wlc persistent 50
> >
> >-> 192.168.8.53:3002Masq1  0  0
> >
> > Why is that? It makes it impossible to see when services are down.
>
> Hi Eric,
>
> I think you're showing the output of ``ipvsadm -Ln'' here, right?
>
> If so, what OS are you using? I observed similar behaviour when moving from
> CentOS 6 to CentOS 7 (i.e. Kernel 2.6.x to Kernel 3.10.x), so there may have
> changed default values of displaying stuff or APIs.
>
> Timo

Hi Timo,

That is exactly the case! The old load balancer is on CentOS 6 and the new one 
is on CentOS 7. (I'm aware of the recent announcement regarding CentOS, but 7 
will not be end-of-life for another 4 years.)

Did you ever come up with a solution? We have scripts that rely on the old 
behavior to alert admins of down services.

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Observed Difference Between ldirectord and keepalived

2021-01-03 Thread Eric Robinson

Reid –

My apologies, even though the list is specific to ClusterLabs, I thought it was 
a venue where general cluster-related discussion often takes place. Thanks for 
the clarification.

--Eric


From: Users  On Behalf Of Reid Wahl
Sent: Saturday, January 2, 2021 4:14 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Observed Difference Between ldirectord and keepalived

Hi, Eric.

To the best of my knowledge, neither ldirectord nor keepalived is part of the 
ClusterLabs project. It looks like the keepalived user group is here: 
https://www.keepalived.org/listes.html

Is there anything we can help you with regarding the ClusterLabs software?



On Sat, Jan 2, 2021 at 1:12 AM Eric Robinson 
mailto:eric.robin...@psmnv.com>> wrote:
We recently switched from ldirectord to keepalived. We noticed that, after the 
switch, LVS behaves a bit differently with respect to “down” services.

On ldirectord, a virtual service with 2 realservers displays “Masq0” when 
one of them is down.

TCP  192.168.5.100:3002<http://192.168.5.100:3002> wlc persistent 50
  -> 192.168.8.53:3002<http://192.168.8.53:3002>Masq1  0
  0
  -> 192.168.8.55:3002<http://192.168.8.55:3002>Masq0  0
  4

On keepalived, it does not shown the down server at all…

TCP  192.168.5.100:3002<http://192.168.5.100:3002> wlc persistent 50
  -> 192.168.8.53:3002<http://192.168.8.53:3002>Masq1  0
  0

Why is that? It makes it impossible to see when services are down.


[cid:image001.png@01D6E20B.D48964E0]

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


--
Regards,
Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Open Source Linux Load Balancer with HA and Split Brain Prevention?

2020-10-06 Thread Eric Robinson

After research, that appears to be the case. Thanks.


> -Original Message-
> From: Users  On Behalf Of Valentin Vidic
> Sent: Monday, October 5, 2020 5:46 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Open Source Linux Load Balancer with HA and Split
> Brain Prevention?
>
> On Sun, Oct 04, 2020 at 11:34:52PM +, Eric Robinson wrote:
> > I've been experimenting with keepalived. It relies on VRRP, but VRRP
> > does not have split-brain prevention.
>
> Perhaps keepalived can be configured to only setup IPVS (no VRRP) and than
> added to pacemaker as a systemd service.
>
> --
> Valentin
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Open Source Linux Load Balancer with HA and Split Brain Prevention?

2020-10-04 Thread Eric Robinson




Valentin --

I've been experimenting with keepalived. It relies on VRRP, but VRRP does not 
have split-brain prevention.

> -Original Message-
> From: Users  On Behalf Of Valentin Vidic
> Sent: Sunday, October 4, 2020 5:19 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Open Source Linux Load Balancer with HA and Split
> Brain Prevention?
>
> On Sun, Oct 04, 2020 at 09:28:40PM +, Eric Robinson wrote:
> > I don't want to proxy the services. I just want NAT redirection at
> > Layer 4, using LVS. Basically, all I need is a good health-checker
> > that works with LVS, like ldirectord does (except newer technology).
>
> keepalived is one possible alternative to ldirectord.
>
> --
> Valentin
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Open Source Linux Load Balancer with HA and Split Brain Prevention?

2020-10-04 Thread Eric Robinson

Hi Strahil --

I don't want to proxy the services. I just want NAT redirection at Layer 4, 
using LVS. Basically, all I need is a good health-checker that works with LVS, 
like ldirectord does (except newer technology).


> -Original Message-
> From: Users  On Behalf Of Strahil Nikolov
> Sent: Sunday, October 4, 2020 2:54 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Open Source Linux Load Balancer with HA and Split
> Brain Prevention?
>
> Actually, everything that works on the OS is OK for pacemaker.
> You got 2 options :
> - use systemd.service for your loadbalancer (for example HAProxy)
> - create your own script which just requires 'start', 'stop' and 'monitoring'
> methods so pacemaker can control it
>
> Based on my very fast search in the web , haproxy has a ready-to-go
> resource agent 'ocf:heartbeat:haproxy' , so you can give it a try.
>
> Best Regards,
> Strahil Nikolov
>
>
>
>
>
>
> В неделя, 4 октомври 2020 г., 22:41:59 Гринуич+3, Eric Robinson
>  написа:
>
>
>
>
>
>
>
>
> Greetings!
>
>
>
> We are looking for an open-source Linux load-balancing solution that
> supports high availability on 2 nodes with split-brain prevention. We’ve been
> using corosync+pacemaker+ldirectord+quorumdevice, and that works fine,
> but ldirectord isn’t available for CentOS 7 or 8, and we need to move along to
> something that’s still in active development. Any suggestions?
>
>
>
> -Eric
>
>
>
> Disclaimer : This email and any files transmitted with it are confidential and
> intended solely for intended recipients. If you are not the named addressee
> you should not disseminate, distribute, copy or alter this email. Any views or
> opinions presented in this email are solely those of the author and might not
> represent those of Physician Select Management. Warning: Although
> Physician Select Management has taken reasonable precautions to ensure
> no viruses are present in this email, the company cannot accept responsibility
> for any loss or damage arising from the use of this email or attachments.
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Open Source Linux Load Balancer with HA and Split Brain Prevention?

2020-10-04 Thread Eric Robinson

Greetings!

We are looking for an open-source Linux load-balancing solution that supports 
high availability on 2 nodes with split-brain prevention. We've been using 
corosync+pacemaker+ldirectord+quorumdevice, and that works fine, but ldirectord 
isn't available for CentOS 7 or 8, and we need to move along to something 
that's still in active development. Any suggestions?

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Removing DRBD w/out Data Loss?

2020-09-08 Thread Eric Robinson

I checked the DRBD manual for this, but didn't see an answer. We need to 
convert a DRBD cluster node into standalone server and remove DRBD without 
losing the data. Is that possible? I asked on the DRBD list but it didn't get 
much of a response.

Given:

The backing device is logical volume: /dev/vg1/lv1

The drbd volume is: drbd0

The filesystem is ext4 on /dev/drbd0

Since the filesystem is built on /dev/drbd0, not on /dev/vg1/lv1, if we remove 
drbd from the stack, how do we get access the data?

--Eric



Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD sync stalled at 100% ?

2020-06-28 Thread Eric Robinson

I could if linbit had per-incident pricing. Unfortunately, they only offer 
yearly contracts, which is way more than I need.

Get Outlook for Android<https://aka.ms/ghei36>


From: Strahil Nikolov 
Sent: Sunday, June 28, 2020 4:11:47 AM
To: Eric Robinson ; Cluster Labs - All topics related 
to open-source clustering welcomed 
Subject: RE: [ClusterLabs] DRBD sync stalled at 100% ?

I guess you  can open an issue to linbit, as you still have the logs.

Best Regards,
Strahil Nikolov

На 28 юни 2020 г. 8:19:59 GMT+03:00, Eric Robinson  
написа:
>I fixed it with a drbd down/up.
>
>From: Users  On Behalf Of Eric Robinson
>Sent: Saturday, June 27, 2020 4:32 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed ; Strahil Nikolov
>
>Subject: Re: [ClusterLabs] DRBD sync stalled at 100% ?
>
>Thanks for the feedback. I was hoping for a non-downtime solution. No
>way to do that?
>Get Outlook for Android<https://aka.ms/ghei36>
>
>
>From: Strahil Nikolov
>mailto:hunter86...@yahoo.com>>
>Sent: Saturday, June 27, 2020 2:40:38 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed mailto:users@clusterlabs.org>>; Eric
>Robinson mailto:eric.robin...@psmnv.com>>
>Subject: Re: [ClusterLabs] DRBD sync stalled at 100% ?
>
>I've  seen this  on a  test setup  after multiple  network disruptions.
>I managed  to fix it by stopping drbd on all  nodes  and starting it
>back.
>
>I guess  you can get downtime  and try that approach.
>
>
>Best Regards,
>Strahil Nikolov
>
>
>
>На 27 юни 2020 г. 16:36:10 GMT+03:00, Eric Robinson
>mailto:eric.robin...@psmnv.com>> написа:
>>I'm not seeing anything on Google about this. Two DRBD nodes lost
>>communication with each other, and then reconnected and started sync.
>>But then it got to 100% and is just stalled there.
>>
>>The nodes are 001db03a, 001db03b.
>>
>>On 001db03a:
>>
>>[root@001db03a ~]# drbdadm status
>>ha01_mysql role:Primary
>>  disk:UpToDate
>>  001db03b role:Secondary
>>replication:SyncSource peer-disk:Inconsistent done:100.00
>>
>>ha02_mysql role:Secondary
>>  disk:UpToDate
>>  001db03b role:Primary
>>peer-disk:UpToDate
>>
>>On 001drbd03b:
>>
>>[root@001db03b ~]# drbdadm status
>>ha01_mysql role:Secondary
>>  disk:Inconsistent
>>  001db03a role:Primary
>>replication:SyncTarget peer-disk:UpToDate done:100.00
>>
>>ha02_mysql role:Primary
>>  disk:UpToDate
>>  001db03a role:Secondary
>>peer-disk:UpToDate
>>
>>
>>On 001db03a, here are the DRBD messages from the onset of the problem
>>until now.
>>
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: PingAck did
>>not arrive in time.
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>>Connected -> NetworkFailure ) peer( Primary -> Unknown )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>>UpToDate -> Consistent )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b:
>>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b:
>ack_receiver
>>terminated
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Terminating
>>ack_recv thread
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Preparing
>>cluster-wide state change 2946943372 (1->-1 0/0)
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Committing
>>cluster-wide state change 2946943372 (6ms)
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>>Consistent -> UpToDate )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Connection
>>closed
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>>NetworkFailure -> Unconnected )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Restarting
>>receiver thread
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>>Unconnected -> Connecting )
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: PingAck did
>>not arrive in time.
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>>Connected -> NetworkFailure ) peer( Secondary -> Unknown )
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0 001db03b:
>>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b:
>ack_receiver
>>terminated
>>Jun 26 22:34:30 001db03a kernel: d

Re: [ClusterLabs] DRBD sync stalled at 100% ?

2020-06-27 Thread Eric Robinson

I fixed it with a drbd down/up.

From: Users  On Behalf Of Eric Robinson
Sent: Saturday, June 27, 2020 4:32 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Strahil Nikolov 
Subject: Re: [ClusterLabs] DRBD sync stalled at 100% ?

Thanks for the feedback. I was hoping for a non-downtime solution. No way to do 
that?
Get Outlook for Android<https://aka.ms/ghei36>


From: Strahil Nikolov mailto:hunter86...@yahoo.com>>
Sent: Saturday, June 27, 2020 2:40:38 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>; Eric Robinson 
mailto:eric.robin...@psmnv.com>>
Subject: Re: [ClusterLabs] DRBD sync stalled at 100% ?

I've  seen this  on a  test setup  after multiple  network disruptions.
I managed  to fix it by stopping drbd on all  nodes  and starting it back.

I guess  you can get downtime  and try that approach.


Best Regards,
Strahil Nikolov



На 27 юни 2020 г. 16:36:10 GMT+03:00, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:
>I'm not seeing anything on Google about this. Two DRBD nodes lost
>communication with each other, and then reconnected and started sync.
>But then it got to 100% and is just stalled there.
>
>The nodes are 001db03a, 001db03b.
>
>On 001db03a:
>
>[root@001db03a ~]# drbdadm status
>ha01_mysql role:Primary
>  disk:UpToDate
>  001db03b role:Secondary
>replication:SyncSource peer-disk:Inconsistent done:100.00
>
>ha02_mysql role:Secondary
>  disk:UpToDate
>  001db03b role:Primary
>peer-disk:UpToDate
>
>On 001drbd03b:
>
>[root@001db03b ~]# drbdadm status
>ha01_mysql role:Secondary
>  disk:Inconsistent
>  001db03a role:Primary
>replication:SyncTarget peer-disk:UpToDate done:100.00
>
>ha02_mysql role:Primary
>  disk:UpToDate
>  001db03a role:Secondary
>peer-disk:UpToDate
>
>
>On 001db03a, here are the DRBD messages from the onset of the problem
>until now.
>
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: PingAck did
>not arrive in time.
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>Connected -> NetworkFailure ) peer( Primary -> Unknown )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>UpToDate -> Consistent )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b:
>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: ack_receiver
>terminated
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Terminating
>ack_recv thread
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Preparing
>cluster-wide state change 2946943372 (1->-1 0/0)
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Committing
>cluster-wide state change 2946943372 (6ms)
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>Consistent -> UpToDate )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Connection
>closed
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>NetworkFailure -> Unconnected )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Restarting
>receiver thread
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>Unconnected -> Connecting )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: PingAck did
>not arrive in time.
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>Connected -> NetworkFailure ) peer( Secondary -> Unknown )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0 001db03b:
>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: ack_receiver
>terminated
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Terminating
>ack_recv thread
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0: new current
>UUID: D07A3D4B2F99832D weak: FFFD
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Connection
>closed
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>NetworkFailure -> Unconnected )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Restarting
>receiver thread
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>Unconnected -> Connecting )
>Jun 26 22:34:33 001db03a pengine[1474]:  notice:  * Start
>p_drbd0:1( 001db03b )
>Jun 26 22:34:33 001db03a crmd[1475]:  notice: Initiating notify
>operation p_drbd0_pre_notify_start_0 locally on 001db03a
>Jun 26 22:34:33 001db03a crmd[1475]:  notice: Result of notify
>operation for p_drbd0 on 001db03a: 0 (ok)
>Jun 26 22:34:33 001db03a crmd[1475]:  notice: Initiating start
>operation p_drbd0_start_0 on

Re: [ClusterLabs] DRBD sync stalled at 100% ?

2020-06-27 Thread Eric Robinson

Thanks for the feedback. I was hoping for a non-downtime solution. No way to do 
that?

Get Outlook for Android<https://aka.ms/ghei36>


From: Strahil Nikolov 
Sent: Saturday, June 27, 2020 2:40:38 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] DRBD sync stalled at 100% ?

I've  seen this  on a  test setup  after multiple  network disruptions.
I managed  to fix it by stopping drbd on all  nodes  and starting it back.

I guess  you can get downtime  and try that approach.


Best Regards,
Strahil Nikolov



На 27 юни 2020 г. 16:36:10 GMT+03:00, Eric Robinson  
написа:
>I'm not seeing anything on Google about this. Two DRBD nodes lost
>communication with each other, and then reconnected and started sync.
>But then it got to 100% and is just stalled there.
>
>The nodes are 001db03a, 001db03b.
>
>On 001db03a:
>
>[root@001db03a ~]# drbdadm status
>ha01_mysql role:Primary
>  disk:UpToDate
>  001db03b role:Secondary
>replication:SyncSource peer-disk:Inconsistent done:100.00
>
>ha02_mysql role:Secondary
>  disk:UpToDate
>  001db03b role:Primary
>peer-disk:UpToDate
>
>On 001drbd03b:
>
>[root@001db03b ~]# drbdadm status
>ha01_mysql role:Secondary
>  disk:Inconsistent
>  001db03a role:Primary
>replication:SyncTarget peer-disk:UpToDate done:100.00
>
>ha02_mysql role:Primary
>  disk:UpToDate
>  001db03a role:Secondary
>peer-disk:UpToDate
>
>
>On 001db03a, here are the DRBD messages from the onset of the problem
>until now.
>
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: PingAck did
>not arrive in time.
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>Connected -> NetworkFailure ) peer( Primary -> Unknown )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>UpToDate -> Consistent )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b:
>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: ack_receiver
>terminated
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Terminating
>ack_recv thread
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Preparing
>cluster-wide state change 2946943372 (1->-1 0/0)
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Committing
>cluster-wide state change 2946943372 (6ms)
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>Consistent -> UpToDate )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Connection
>closed
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>NetworkFailure -> Unconnected )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Restarting
>receiver thread
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>Unconnected -> Connecting )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: PingAck did
>not arrive in time.
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>Connected -> NetworkFailure ) peer( Secondary -> Unknown )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0 001db03b:
>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: ack_receiver
>terminated
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Terminating
>ack_recv thread
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0: new current
>UUID: D07A3D4B2F99832D weak: FFFD
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Connection
>closed
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>NetworkFailure -> Unconnected )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Restarting
>receiver thread
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>Unconnected -> Connecting )
>Jun 26 22:34:33 001db03a pengine[1474]:  notice:  * Start
>p_drbd0:1( 001db03b )
>Jun 26 22:34:33 001db03a crmd[1475]:  notice: Initiating notify
>operation p_drbd0_pre_notify_start_0 locally on 001db03a
>Jun 26 22:34:33 001db03a crmd[1475]:  notice: Result of notify
>operation for p_drbd0 on 001db03a: 0 (ok)
>Jun 26 22:34:33 001db03a crmd[1475]:  notice: Initiating start
>operation p_drbd0_start_0 on 001db03b
>Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Handshake to
>peer 0 successful: Agreed network protocol version 113
>Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Feature
>flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
>WRITE_ZEROES.
>Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Starting
>ack_recv thread (from drbd_r_ha02

[ClusterLabs] DRBD sync stalled at 100% ?

2020-06-27 Thread Eric Robinson

I'm not seeing anything on Google about this. Two DRBD nodes lost communication 
with each other, and then reconnected and started sync. But then it got to 100% 
and is just stalled there.

The nodes are 001db03a, 001db03b.

On 001db03a:

[root@001db03a ~]# drbdadm status
ha01_mysql role:Primary
  disk:UpToDate
  001db03b role:Secondary
replication:SyncSource peer-disk:Inconsistent done:100.00

ha02_mysql role:Secondary
  disk:UpToDate
  001db03b role:Primary
peer-disk:UpToDate

On 001drbd03b:

[root@001db03b ~]# drbdadm status
ha01_mysql role:Secondary
  disk:Inconsistent
  001db03a role:Primary
replication:SyncTarget peer-disk:UpToDate done:100.00

ha02_mysql role:Primary
  disk:UpToDate
  001db03a role:Secondary
peer-disk:UpToDate


On 001db03a, here are the DRBD messages from the onset of the problem until now.

Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: PingAck did not 
arrive in time.
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn( Connected -> 
NetworkFailure ) peer( Primary -> Unknown )
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk( UpToDate -> 
Consistent )
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b: pdsk( 
UpToDate -> DUnknown ) repl( Established -> Off )
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: ack_receiver 
terminated
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Terminating ack_recv 
thread
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Preparing cluster-wide state 
change 2946943372 (1->-1 0/0)
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Committing cluster-wide state 
change 2946943372 (6ms)
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk( Consistent -> 
UpToDate )
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Connection closed
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn( NetworkFailure 
-> Unconnected )
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Restarting receiver 
thread
Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn( Unconnected -> 
Connecting )
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: PingAck did not 
arrive in time.
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn( Connected -> 
NetworkFailure ) peer( Secondary -> Unknown )
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0 001db03b: pdsk( 
UpToDate -> DUnknown ) repl( Established -> Off )
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: ack_receiver 
terminated
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Terminating ack_recv 
thread
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0: new current UUID: 
D07A3D4B2F99832D weak: FFFD
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Connection closed
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn( NetworkFailure 
-> Unconnected )
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Restarting receiver 
thread
Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn( Unconnected -> 
Connecting )
Jun 26 22:34:33 001db03a pengine[1474]:  notice:  * Start  p_drbd0:1
( 001db03b )
Jun 26 22:34:33 001db03a crmd[1475]:  notice: Initiating notify operation 
p_drbd0_pre_notify_start_0 locally on 001db03a
Jun 26 22:34:33 001db03a crmd[1475]:  notice: Result of notify operation for 
p_drbd0 on 001db03a: 0 (ok)
Jun 26 22:34:33 001db03a crmd[1475]:  notice: Initiating start operation 
p_drbd0_start_0 on 001db03b
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Handshake to peer 0 
successful: Agreed network protocol version 113
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Feature flags 
enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Starting ack_recv 
thread (from drbd_r_ha02_mys [2116])
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Preparing remote 
state change 3920461435
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Committing remote 
state change 3920461435 (primary_nodes=1)
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: conn( Connecting -> 
Connected ) peer( Unknown -> Primary )
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql/0 drbd1: disk( UpToDate -> 
Outdated )
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b: 
drbd_sync_handshake:
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b: self 
492F8D33A72A8E08::659DC04F5C85B6E4:8254EEA2EC50AD7C bits:0 
flags:120
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b: peer 
5A6B1EBE80500C39:492F8D33A72A8E09:659DC04F5C85B6E4:51A00A23ED88187A bits:1 
flags:120
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b: 
uuid_compare()=-2 by rule 50
Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b: pdsk( 
DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
Jun 26 22:34:34

Re: [ClusterLabs] qdevice up and running -- but questions

2020-04-14 Thread Eric Robinson

> -Original Message-
> From: Jan Friesse 
> Sent: Tuesday, April 14, 2020 2:43 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Sherrard Burton ;
> Eric Robinson 
> Subject: Re: [ClusterLabs] qdevice up and running -- but questions
>
> >
> >
> > On 4/11/20 6:52 PM, Eric Robinson wrote:
> >>  1. What command can I execute on the qdevice node which tells me
> >> which
> >> client nodes are connected and alive?
> >>
> >
> > i use
> > corosync-qnetd-tool -v -l
> >
> >
> >>  2. In the output of the pcs qdevice status command, what is the
> >> meaning of...
> >>
> >>  Vote:   ACK (ACK)
>
> This is documented in the corosync-qnetd-tool man page.
>
> >>
> >>  3. In the output of the  pcs quorum status Command, what is the
> >> meaning of...
> >>
> >> Membership information
> >>
> >> --
> >>
> >>  Nodeid  VotesQdevice Name
> >>
> >>   1  1A,V,NMW 001db03a
> >>
> >>   2  1A,V,NMW 001db03b (local)
>
> Pcs just displays output of corosync-quorumtool. Nodeid and votes columns
> are probably obvious, name is resolved reverse dns name of the node ip
> addr (-i suppress this behavior) and qdevice column really needs to be
> documented, so:
> A/NA - Qdevice is alive / not alive. This is rather internal flag, but it can 
> be
> seen as a heartbeat between qdevice and corosync. Should be always alive.
> V/NV - Qdevice has cast voted / not cast voted. Take is a an ACK/NACK.
> MW/NMW - Master wins / not master wins. This is really internal flag. By
> default corosync-qdevice newer asks for master wins so it is going to be
> nmw. For more information what it means see
> votequorum_qdevice_master_wins (3)
>
> Honza

[Eric] This is super helpful information, thank you.
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] qdevice up and running -- but questions

2020-04-14 Thread Eric Robinson

> -Original Message-
> From: Sherrard Burton 
> Sent: Sunday, April 12, 2020 9:09 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Eric Robinson 
> Subject: Re: [ClusterLabs] qdevice up and running -- but questions
>
>
>
> On 4/11/20 6:52 PM, Eric Robinson wrote:
> >  1. What command can I execute on the qdevice node which tells me which
> > client nodes are connected and alive?
> >
>
> i use
> corosync-qnetd-tool -v -l
>
>

[Eric] Thanks much!

> >  2. In the output of the pcs qdevice status command, what is the meaning
> of...
> >
> >  Vote:   ACK (ACK)
> >
> >  3. In the output of the  pcs quorum status Command, what is the meaning
> of...
> >
> > Membership information
> >
> > --
> >
> >  Nodeid  VotesQdevice Name
> >
> >   1  1A,V,NMW 001db03a
> >
> >   2  1A,V,NMW 001db03b (local)
> >
> > --Eric
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Verifying DRBD Run-Time Configuration

2020-04-14 Thread Eric Robinson

> -Original Message-
> From: Strahil Nikolov 
> Sent: Sunday, April 12, 2020 6:00 AM
> To: Eric Robinson ; Cluster Labs - All topics
> related to open-source clustering welcomed 
> Subject: RE: [ClusterLabs] Verifying DRBD Run-Time Configuration
>
> On April 12, 2020 10:58:39 AM GMT+03:00, Eric Robinson
>  wrote:
> >> -Original Message-
> >> From: Strahil Nikolov 
> >> Sent: Sunday, April 12, 2020 2:54 AM
> >> To: Cluster Labs - All topics related to open-source clustering
> >welcomed
> >> ; Eric Robinson 
> >> Subject: Re: [ClusterLabs] Verifying DRBD Run-Time Configuration
> >>
> >> On April 11, 2020 6:17:14 PM GMT+03:00, Eric Robinson
> >>  wrote:
> >> >If I want to know the current DRBD runtime settings such as timeout,
> >> >ping-int, or connect-int, how do I check that? I'm assuming they may
> >> >not be the same as what shows in the config file.
> >> >
> >> >--Eric
> >> >
> >> >
> >> >
> >> >
> >> >Disclaimer : This email and any files transmitted with it are
> >> >confidential and intended solely for intended recipients. If you are
> >> >not the named addressee you should not disseminate, distribute, copy
> >or
> >> >alter this email. Any views or opinions presented in this email are
> >> >solely those of the author and might not represent those of
> >Physician
> >> >Select Management. Warning: Although Physician Select Management
> has
> >> >taken reasonable precautions to ensure no viruses are present in
> >this
> >> >email, the company cannot accept responsibility for any loss or
> >damage
> >> >arising from the use of this email or attachments.
> >>
> >> You can get everything the cluster know via 'cibadmin -Q
> >> /tmp/cluster_conf.xml'
> >>
> >> Then you can examine it.
> >>
> >
> >As usual, I guess there's more than one way to get things. Someone
> >suggested 'drbdsetup show  --show-defaults' and that works great.
> >
> >
> >Disclaimer : This email and any files transmitted with it are
> >confidential and intended solely for intended recipients. If you are
> >not the named addressee you should not disseminate, distribute, copy or
> >alter this email. Any views or opinions presented in this email are
> >solely those of the author and might not represent those of Physician
> >Select Management. Warning: Although Physician Select Management has
> >taken reasonable precautions to ensure no viruses are present in this
> >email, the company cannot accept responsibility for any loss or damage
> >arising from the use of this email or attachments.
>
> You left the inpression that  you need the cluster data (this is clusterlabs
> maining list after all) :)

True, but the DRBD list is somewhat quiet these days. 

>
> Best Regards,
> Strahil Nikolov
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Verifying DRBD Run-Time Configuration

2020-04-12 Thread Eric Robinson

> -Original Message-
> From: Strahil Nikolov 
> Sent: Sunday, April 12, 2020 2:54 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Eric Robinson 
> Subject: Re: [ClusterLabs] Verifying DRBD Run-Time Configuration
>
> On April 11, 2020 6:17:14 PM GMT+03:00, Eric Robinson
>  wrote:
> >If I want to know the current DRBD runtime settings such as timeout,
> >ping-int, or connect-int, how do I check that? I'm assuming they may
> >not be the same as what shows in the config file.
> >
> >--Eric
> >
> >
> >
> >
> >Disclaimer : This email and any files transmitted with it are
> >confidential and intended solely for intended recipients. If you are
> >not the named addressee you should not disseminate, distribute, copy or
> >alter this email. Any views or opinions presented in this email are
> >solely those of the author and might not represent those of Physician
> >Select Management. Warning: Although Physician Select Management has
> >taken reasonable precautions to ensure no viruses are present in this
> >email, the company cannot accept responsibility for any loss or damage
> >arising from the use of this email or attachments.
>
> You can get everything the cluster know via 'cibadmin -Q
> /tmp/cluster_conf.xml'
>
> Then you can examine it.
>

As usual, I guess there's more than one way to get things. Someone suggested 
'drbdsetup show  --show-defaults' and that works great.

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-04-12 Thread Eric Robinson

> -Original Message-
> From: Strahil Nikolov 
> Sent: Sunday, April 12, 2020 1:32 AM
> To: Eric Robinson ; Cluster Labs - All topics
> related to open-source clustering welcomed ;
> Andrei Borzenkov 
> Subject: RE: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On April 11, 2020 5:01:37 PM GMT+03:00, Eric Robinson
>  wrote:
> >
> >Hi Strahil --
> >
> >I hope you won't mind if I revive this old question. In your comments
> >below, you suggested using a 1s  token with a 1.2s consensus. I
> >currently have 2-node clusters (will soon install a qdevice). I was
> >reading in the corosync.conf man page where it says...
> >
> >"For  two  node  clusters,  a  consensus larger than the join timeout
> >but less than token is safe.  For three node or larger clusters,
> >consensus should be larger than token."
> >
> >Do you still think the consensus should be 1.2 * token in a 2-node
> >cluster? Why is a smaller consensus considered safe for 2-node
> >clusters? Should I use a larger consensus anyway?
> >
> >--Eric
> >
> >
> >> -Original Message-
> >> From: Strahil Nikolov 
> >> Sent: Thursday, February 6, 2020 1:07 PM
> >> To: Eric Robinson ; Cluster Labs - All
> >topics
> >> related to open-source clustering welcomed ;
> >> Andrei Borzenkov 
> >> Subject: RE: [ClusterLabs] Why Do Nodes Leave the Cluster?
> >>
> >> On February 6, 2020 7:35:53 PM GMT+02:00, Eric Robinson
> >>  wrote:
> >> >Hi Nikolov --
> >> >
> >> >> Defaults are 1s  token,  1.2s  consensus which is too small.
> >> >> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
> >> >> With these settings, cluster  will not react   for 22s.
> >> >>
> >> >> I think it's a good start for your cluster .
> >> >> Don't forget to put  the cluster  in maintenance (pcs property set
> >> >> maintenance-mode=true) before restarting the stack ,  or  even
> >better
> >> >- get
> >> >> some downtime.
> >> >>
> >> >> You can use the following article to run a simulation before
> >removing
> >> >the
> >> >> maintenance:
> >> >> https://www.suse.com/support/kb/doc/?id=7022764
> >> >>
> >> >
> >> >
> >> >Thanks for the suggestions. Any thoughts on timeouts for DRBD?
> >> >
> >> >--Eric
> >> >
> >> >Disclaimer : This email and any files transmitted with it are
> >> >confidential and intended solely for intended recipients. If you are
> >> >not the named addressee you should not disseminate, distribute, copy
> >or
> >> >alter this email. Any views or opinions presented in this email are
> >> >solely those of the author and might not represent those of
> >Physician
> >> >Select Management. Warning: Although Physician Select Management
> has
> >> >taken reasonable precautions to ensure no viruses are present in
> >this
> >> >email, the company cannot accept responsibility for any loss or
> >damage
> >> >arising from the use of this email or attachments.
> >>
> >> Hi Eric,
> >>
> >> The timeouts can be treated as 'how much time to wait before  taking
> >any
> >> action'. The workload is not very important (HANA  is something
> >different).
> >>
> >> You can try with 10s (token) , 12s (consensus) and if needed  you can
> >adjust.
> >>
> >> Warning: Use a 3 node cluster or at least 2 drbd nodes + qdisk. The 2
> >node
> >> cluster is vulnerable to split brain, especially when one of the
> >nodes  is
> >> syncing  (for example after a patching) and the source is
> >> fenced/lost/disconnected. It's very hard to extract data from a
> >semi-synced
> >> drbd.
> >>
> >> Also, if you need guidance for the SELINUX, I can point you to my
> >guide in the
> >> centos forum.
> >>
> >> Best Regards,
> >> Strahil Nikolov
> >Disclaimer : This email and any files transmitted with it are
> >confidential and intended solely for intended recipients. If you are
> >not the named addressee you should not disseminate, distribute, copy or
> >alter this email. Any views or opinions presented in this email are
> >solely those of the author and might not represent those of Physician
> >Select Management. Warning: Although Physician

[ClusterLabs] qdevice up and running -- but questions

2020-04-11 Thread Eric Robinson

  1.  What command can I execute on the qdevice node which tells me which 
client nodes are connected and alive?


  1.  In the output of the pcs qdevice status command, what is the meaning of...


Vote:   ACK (ACK)


  1.  In the output of the  pcs quorum status Command, what is the meaning of...

Membership information
--
Nodeid  VotesQdevice Name
 1  1A,V,NMW 001db03a
 2  1A,V,NMW 001db03b (local)


--Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Verifying DRBD Run-Time Configuration

2020-04-11 Thread Eric Robinson

If I want to know the current DRBD runtime settings such as timeout, ping-int, 
or connect-int, how do I check that? I'm assuming they may not be the same as 
what shows in the config file.

--Eric




Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-04-11 Thread Eric Robinson



Hi Strahil --

I hope you won't mind if I revive this old question. In your comments below, 
you suggested using a 1s  token with a 1.2s consensus. I currently have 2-node 
clusters (will soon install a qdevice). I was reading in the corosync.conf man 
page where it says...

"For  two  node  clusters,  a  consensus larger than the join timeout but less 
than token is safe.  For three node or larger clusters, consensus should be 
larger than token."

Do you still think the consensus should be 1.2 * token in a 2-node cluster? Why 
is a smaller consensus considered safe for 2-node clusters? Should I use a 
larger consensus anyway?

--Eric


> -Original Message-
> From: Strahil Nikolov 
> Sent: Thursday, February 6, 2020 1:07 PM
> To: Eric Robinson ; Cluster Labs - All topics
> related to open-source clustering welcomed ;
> Andrei Borzenkov 
> Subject: RE: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 6, 2020 7:35:53 PM GMT+02:00, Eric Robinson
>  wrote:
> >Hi Nikolov --
> >
> >> Defaults are 1s  token,  1.2s  consensus which is too small.
> >> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
> >> With these settings, cluster  will not react   for 22s.
> >>
> >> I think it's a good start for your cluster .
> >> Don't forget to put  the cluster  in maintenance (pcs property set
> >> maintenance-mode=true) before restarting the stack ,  or  even better
> >- get
> >> some downtime.
> >>
> >> You can use the following article to run a simulation before removing
> >the
> >> maintenance:
> >> https://www.suse.com/support/kb/doc/?id=7022764
> >>
> >
> >
> >Thanks for the suggestions. Any thoughts on timeouts for DRBD?
> >
> >--Eric
> >
> >Disclaimer : This email and any files transmitted with it are
> >confidential and intended solely for intended recipients. If you are
> >not the named addressee you should not disseminate, distribute, copy or
> >alter this email. Any views or opinions presented in this email are
> >solely those of the author and might not represent those of Physician
> >Select Management. Warning: Although Physician Select Management has
> >taken reasonable precautions to ensure no viruses are present in this
> >email, the company cannot accept responsibility for any loss or damage
> >arising from the use of this email or attachments.
>
> Hi Eric,
>
> The timeouts can be treated as 'how much time to wait before  taking any
> action'. The workload is not very important (HANA  is something different).
>
> You can try with 10s (token) , 12s (consensus) and if needed  you can adjust.
>
> Warning: Use a 3 node cluster or at least 2 drbd nodes + qdisk. The 2 node
> cluster is vulnerable to split brain, especially when one of the nodes  is
> syncing  (for example after a patching) and the source is
> fenced/lost/disconnected. It's very hard to extract data from a semi-synced
> drbd.
>
> Also, if you need guidance for the SELINUX, I can point you to my guide in the
> centos forum.
>
> Best Regards,
> Strahil Nikolov
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] DRBD / LVM Global Filter Question

2020-04-03 Thread Eric Robinson

Greetings,

I've read the relevant sections of the DRBD User Guide, visited various web 
sites, and spoken to people in the lists, but I'm still looking for a simple 
rule of thumb.

Basically, here is what I've come to understand:


  1.  If DRBD only lives *above* an LVM volume, then the default LVM filtering 
settings are fine; no changes are required to lvm.conf. The LVs will remain 
active on all DRBD nodes, and an LVM resource agent is not required.



  1.  If DRBD lives *below* or *between* LVM volumes, then:



 *   set global_filter to reject the DRBD backing devices
 *   set write_cache_state = 0
 *   set use_lvmetad = 0
 *   set volume_list to include block devices required to boot
 *   remove /etc/lvm/cache/.cache.
 *   run lvscan
 *   regenerate initrd
 *   reboot
 *   Use a cluster resource agent to activate/de-activate LVs as required 
by cluster operation

Please feel free to correct any mistakes.

--Eric


[cid:image001.png@01D60983.14DF69C0]

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-06 Thread Eric Robinson

Hi Nikolov --

> Defaults are 1s  token,  1.2s  consensus which is too small.
> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
> With these settings, cluster  will not react   for 22s.
>
> I think it's a good start for your cluster .
> Don't forget to put  the cluster  in maintenance (pcs property set
> maintenance-mode=true) before restarting the stack ,  or  even better - get
> some downtime.
>
> You can use the following article to run a simulation before removing the
> maintenance:
> https://www.suse.com/support/kb/doc/?id=7022764
>


Thanks for the suggestions. Any thoughts on timeouts for DRBD?

--Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: Why Do Nodes Leave the Cluster?

2020-02-06 Thread Eric Robinson

> >
> > I've done that with all my other clusters, but these two servers are
> > in Azure, so the network is out of our control.
>
> Is a normal cluster supported to use corosync over Internet? I'm not sure
> (because of the delays and possible packet losses).
>
>

As with most things, the main concern is latency and loss. The latency between 
these two nodes is < 1ms, and loss is always 0%.

--Eric


Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson

Hi Strahil –

I think you may be right about the token timeouts being too short. I’ve also 
noticed that periods of high load can cause drbd to disconnect. What would you 
recommend for changes to the timeouts?

I’m running Red Hat’s Corosync Cluster Engine, version 2.4.3. The config is 
relatively simple.

Corosync config looks like this…

totem {
version: 2
cluster_name: 001db01ab
secauth: off
transport: udpu
}

nodelist {
node {
ring0_addr: 001db01a
nodeid: 1
}

node {
ring0_addr: 001db01b
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}

From: Users  On Behalf Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 6:39 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Andrei Borzenkov 
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

Hi Andrei,

don't trust Azure so much :D . I've seen stuff that was way more unbelievable.
Can you check other systems in the same subnet reported any issues. Yet, pcs 
most probably won't report any short-term issues. I have noticed that RHEL7 
defaults for token and consensus are quite small and any short-term disruption 
could cause an issue.
Actually when I tested live migration on oVirt - the other hosts fenced the 
node that was migrated.
What is your corosync config and OS version ?

Best Regards,
Strahil Nikolov

В четвъртък, 6 февруари 2020 г., 01:44:55 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:

Hi Strahil –

I can’t prove there was no network loss, but:

  1.  There were no dmesg indications of ethernet link loss.
  2.  Other than corosync, there are no other log messages about connectivity 
issues.
  3.  Wouldn’t pcsd say something about connectivity loss?
  4.  Both servers are in Azure.
  5.  There are many other servers in the same Azure subscription, including 
other corosync clusters, none of which had issues.

So I guess it’s possible, but it seems unlikely.

--Eric

From: Users 
mailto:users-boun...@clusterlabs.org>> On Behalf 
Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 3:13 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>; Andrei Borzenkov 
mailto:arvidj...@gmail.com>>
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

Hi Erik,

what has led you to think that there was no network loss ?

Best Regards,

Strahil Nikolov

В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:

> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; 
> users@clusterlabs.org<mailto:users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
> mailto:arvidj...@gmail.com>> wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expe

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson

Hi Strahil –

I can’t prove there was no network loss, but:


  1.  There were no dmesg indications of ethernet link loss.
  2.  Other than corosync, there are no other log messages about connectivity 
issues.
  3.  Wouldn’t pcsd say something about connectivity loss?
  4.  Both servers are in Azure.
  5.  There are many other servers in the same Azure subscription, including 
other corosync clusters, none of which had issues.

So I guess it’s possible, but it seems unlikely.

--Eric

From: Users  On Behalf Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 3:13 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Andrei Borzenkov 
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

Hi Erik,

what has led you to think that there was no network loss ?

Best Regards,
Strahil Nikolov

В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:



> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; 
> users@clusterlabs.org<mailto:users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
> mailto:arvidj...@gmail.com>> wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb
> >> 5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer
> >with id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> >-> S_POLICY_ENGINE
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> >Quorum: Ignore
> >>
> >> From 001db01b:
> >>
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> >(10.51.14.34:960) was formed. Members left: 1
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC
> >(001db01a) is dead
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
> >the leave message. fai

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson


> -Original Message-
> From: Users  On Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov ; users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
>  wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb
> >> 5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer
> >with id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> >-> S_POLICY_ENGINE
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> >Quorum: Ignore
> >>
> >> From 001db01b:
> >>
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> >(10.51.14.34:960) was formed. Members left: 1
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC
> >(001db01a) is dead
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
> >the leave message. failed: 1
> >> Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2 Feb
> >> 5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer
> >with id=1 and/or uname=001db01a from the membership cache
> >> Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> >S_NOT_DC -> S_ELECTION
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is
> >now lost
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is
> >now lost
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer
> >001db01a
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with
> >id=1 and/or uname=00

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson





> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Wednesday, February 5, 2020 12:14 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> 05.02.2020 20:55, Eric Robinson пишет:
> > The two servers 001db01a and 001db01b were up and responsive. Neither
> had been rebooted and neither were under heavy load. There's no indication
> in the logs of loss of network connectivity. Any ideas on why both nodes
> seem to think the other one is at fault?
>
> The very fact that nodes lost connection to each other *is* indication of
> network problems. Your logs start too late, after any problem already
> happened.
>

All the log messages before those are just normal repetitive stuff that always 
gets logged, even during normal production. The snippet I provided shows the 
first indication of anything unusual. Also, there is no other indication of 
network connectivity loss, and both servers are in Azure.

> >
> > (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an
> > option at this time.)
> >
> > Log from 001db01a:
> >
> > Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> forming new configuration.
> > Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> > (10.51.14.33:960) was formed. Members left: 2 Feb  5 08:01:03 001db01a
> > corosync[1306]: [TOTEM ] Failed to receive the leave message. failed:
> > 2 Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state
> > is now lost Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing
> > all 001db01b attributes for peer loss Feb  5 08:01:03 001db01a
> > cib[1522]:  notice: Node 001db01b state is now lost Feb  5 08:01:03
> > 001db01a cib[1522]:  notice: Purged 1 peer with id=2 and/or
> > uname=001db01b from the membership cache Feb  5 08:01:03 001db01a
> > attrd[1525]:  notice: Purged 1 peer with id=2 and/or uname=001db01b
> > from the membership cache Feb  5 08:01:03 001db01a crmd[1527]:
> > warning: No reason to expect node 2 to be down Feb  5 08:01:03 001db01a
> stonith-ng[1523]:  notice: Node 001db01b state is now lost Feb  5 08:01:03
> 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not matched
> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb  5
> 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> > Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer with
> > id=2 and/or uname=001db01b from the membership cache Feb  5 08:01:03
> > 001db01a pacemakerd[1491]:  notice: Node 001db01b state is now lost
> > Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> > -> S_POLICY_ENGINE Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node
> > 001db01b state is now lost Feb  5 08:01:03 001db01a crmd[1527]:
> > warning: No reason to expect node 2 to be down Feb  5 08:01:03
> > 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not matched
> > Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> > Quorum: Ignore
> >
> > From 001db01b:
> >
> > Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> > (10.51.14.34:960) was formed. Members left: 1 Feb  5 08:01:03 001db01b
> > crmd[1693]:  notice: Our peer on the DC (001db01a) is dead Feb  5
> > 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a state is
> > now lost Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to
> > receive the leave message. failed: 1 Feb  5 08:01:03 001db01b
> corosync[1455]: [QUORUM] Members[1]: 2 Feb  5 08:01:03 001db01b
> corosync[1455]: [MAIN  ] Completed service synchronization, ready to
> provide service.
> > Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer with
> > id=1 and/or uname=001db01a from the membership cache Feb  5 08:01:03
> > 001db01b pacemakerd[1678]:  notice: Node 001db01a state is now lost
> > Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> > S_NOT_DC -> S_ELECTION Feb  5 08:01:03 001db01b crmd[1693]:  notice:
> > Node 001db01a state is now lost Feb  5 08:01:03 001db01b attrd[1691]:
> > notice: Node 001db01a state is now lost Feb  5 08:01:03 001db01b
> > attrd[1691]:  notice: Removing all 001db01a attributes for peer loss
> > Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer
> > 001db01a Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer
> > with id=1 and/or uname=001db01a from the membership cache Feb  5
> > 08:01:03 001db01b crmd[1693]:  notice: State transition S_ELECTION ->
> > S_INTEGRATION Feb  5 08:01:03 001db01b cib[1688

[ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson

The two servers 001db01a and 001db01b were up and responsive. Neither had been 
rebooted and neither were under heavy load. There's no indication in the logs 
of loss of network connectivity. Any ideas on why both nodes seem to think the 
other one is at fault?

(Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an option 
at this time.)

Log from 001db01a:

Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, forming 
new configuration.
Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership 
(10.51.14.33:960) was formed. Members left: 2
Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive the leave 
message. failed: 2
Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is now lost
Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b attributes 
for peer loss
Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is now lost
Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2 and/or 
uname=001db01b from the membership cache
Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with id=2 and/or 
uname=001db01b from the membership cache
Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to be 
down
Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b state is now 
lost
Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not 
matched
Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1
Feb  5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer with id=2 
and/or uname=001db01b from the membership cache
Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b state is now 
lost
Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is now lost
Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to be 
down
Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not 
matched
Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM Quorum: Ignore

>From 001db01b:

Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership 
(10.51.14.34:960) was formed. Members left: 1
Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC (001db01a) is 
dead
Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a state is now 
lost
Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive the leave 
message. failed: 1
Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2
Feb  5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer with id=1 
and/or uname=001db01a from the membership cache
Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a state is now 
lost
Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_NOT_DC -> 
S_ELECTION
Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is now lost
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is now lost
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a attributes 
for peer loss
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer 001db01a
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with id=1 and/or 
uname=001db01a from the membership cache
Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_ELECTION -> 
S_INTEGRATION
Feb  5 08:01:03 001db01b cib[1688]:  notice: Node 001db01a state is now lost
Feb  5 08:01:03 001db01b cib[1688]:  notice: Purged 1 peer with id=1 and/or 
uname=001db01a from the membership cache
Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: [cib_diff_notify] Patch 
aborted: Application of an update diff failed (-206)
Feb  5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC received in 
state S_INTEGRATION from do_election_check
Feb  5 08:01:03 001db01b pengine[1692]:  notice: On loss of CCM Quorum: Ignore


-Eric



Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Stupid DRBD/LVM Global Filter Question

2019-10-30 Thread Eric Robinson

Roger --

Thank you, sir. That does help.

-Original Message-
From: Roger Zhou 
Sent: Wednesday, October 30, 2019 2:56 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] Stupid DRBD/LVM Global Filter Question


On 10/30/19 6:17 AM, Eric Robinson wrote:
> If I have an LV as a backing device for a DRBD disk, can someone
> explain why I need an LVM filter? It seems to me that we would want
> the LV to be always active under both the primary and secondary DRBD
> devices, and there should be no need or desire to have the LV
> activated or deactivated by Pacemaker. What am I missing?

Your understanding is correct. No need to use LVM resource agent from Pacemaker 
in your case.

--Roger

>
> --Eric
>
> Disclaimer : This email and any files transmitted with it are
> confidential and intended solely for intended recipients. If you are
> not the named addressee you should not disseminate, distribute, copy
> or alter this email. Any views or opinions presented in this email are
> solely those of the author and might not represent those of Physician
> Select Management. Warning: Although Physician Select Management has
> taken reasonable precautions to ensure no viruses are present in this
> email, the company cannot accept responsibility for any loss or damage
> arising from the use of this email or attachments.
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Stupid DRBD/LVM Global Filter Question

2019-10-29 Thread Eric Robinson

If I have an LV as a backing device for a DRBD disk, can someone explain why I 
need an LVM filter? It seems to me that we would want the LV to be always 
active under both the primary and secondary DRBD devices, and there should be 
no need or desire to have the LV activated or deactivated by Pacemaker. What am 
I missing?

--Eric




Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Simulate Failure Behavior

2019-02-22 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Ken Gaillot
> Sent: Friday, February 22, 2019 5:06 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Simulate Failure Behavior
> 
> On Sat, 2019-02-23 at 00:28 +, Eric Robinson wrote:
> > I want to mess around with different on-fail options and see how the
> > cluster responds. I’m looking through the documentation, but I don’t
> > see a way to simulate resource failure and observe behavior without
> > actually failing over the mode. Isn’t there a way to have the cluster
> > MODEL failure and simply report what it WOULD do?
> >
> > --Eric
> 
> Yes, appropriately enough it is called crm_simulate :)

Thanks. I knew about crm_simulate, but I thought that was really old stuff and 
might not apply in the pcs world. 

> 
> The documentation is not exactly great, but you see:
> 
> https://wiki.clusterlabs.org/wiki/Using_crm_simulate
> 
> along with the man page and:
> 
> http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-
> single/Pacemaker_Administration/index.html#s-config-testing-changes
> 
> --
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Simulate Failure Behavior

2019-02-22 Thread Eric Robinson

I want to mess around with different on-fail options and see how the cluster 
responds. I'm looking through the documentation, but I don't see a way to 
simulate resource failure and observe behavior without actually failing over 
the mode. Isn't there a way to have the cluster MODEL failure and simply report 
what it WOULD do?

--Eric



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Eric Robinson





> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Wednesday, February 20, 2019 8:51 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When
> Just One Fails?
> 
> 20.02.2019 21:51, Eric Robinson пишет:
> >
> > The following should show OK in a fixed font like Consolas, but the
> following setup is supposed to be possible, and is even referenced in the
> ClusterLabs documentation.
> >
> >
> >
> >
> >
> > +--+
> >
> > |   mysql001   +--+
> >
> > +--+  |
> >
> > +--+  |
> >
> > |   mysql002   +--+
> >
> > +--+  |
> >
> > +--+  |   +-+   ++   +--+
> >
> > |   mysql003   +->+ floating ip +-->+ filesystem +-->+ blockdev |
> >
> > +--+  |   +-+   ++   +--+
> >
> > +--+  |
> >
> > |   mysql004   +--+
> >
> > +--+  |
> >
> > +--+  |
> >
> > |   mysql005   +--+
> >
> > +--+
> >
> >
> >
> > In the layout above, the MySQL instances are dependent on the same
> underlying service stack, but they are not dependent on each other.
> Therefore, as I understand it, the failure of one MySQL instance should not
> cause the failure of other MySQL instances if on-fail=ignore on-fail=stop. At
> least, that’s the way it seems to me, but based on the thread, I guess it does
> not behave that way.
> >
> 
> This works this way for monitor operation if you set on-fail=block.
> Failed resource is left "as is". The only case when it does not work seems to
> be stop operation; even with explicit on-fail=block it still attempts to 
> initiate
> follow up actions. I still consider this a bug.
> 
> If this is not a bug, this needs clear explanation in documentation.
> 
> But please understand that assuming on-fail=block works you effectively
> reduce your cluster to controlled start of resources during boot. As we have

Or failover, correct?

> seen, stopping of resource IP is blocked, meaning pacemaker also cannot
> perform resource level recovery at all. And for mysql resources you explicitly
> ignore any result of monitoring or failure to stop it.
> And not having stonith also prevents pacemaker from handling node failure.
> What leaves is at most restart of resources on another node during graceful
> shutdown.
> 
> It begs a question - what do you need such "cluster" for at all?

Mainly to manage the other relevant resources: drbd, filesystem, and floating 
IP. I'm content to forego resource level recovery for MySQL services and 
monitor their health from outside the cluster and remediate them manually if 
necessary. I don't see an option if I want to avoid the sort of deadlock 
situation we talked about earlier. 

> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Eric Robinson









> -Original Message-

> From: Users  On Behalf Of Ulrich Windl

> Sent: Tuesday, February 19, 2019 11:35 PM

> To: users@clusterlabs.org

> Subject: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When

> Just One Fails?

>

> >>> Eric Robinson mailto:eric.robin...@psmnv.com>> 
> >>> schrieb am 19.02.2019 um

> >>> 21:06 in

> Nachricht

> mailto:mn2pr03mb4845be22fada30b472174b79fa...@mn2pr03mb4845.namprd03.prod.outlook.com>

> d03.prod.outlook.com<mailto:mn2pr03mb4845be22fada30b472174b79fa...@mn2pr03mb4845.namprd03.prod.outlook.com>>

>

> >>  -Original Message-

> >> From: Users 
> >> mailto:users-boun...@clusterlabs.org>> On 
> >> Behalf Of Ken Gaillot

> >> Sent: Tuesday, February 19, 2019 10:31 AM

> >> To: Cluster Labs - All topics related to open-source clustering

> >> welcomed mailto:users@clusterlabs.org>>

> >> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just

> >> One Fails?

> >>

> >> On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote:

> >> > > -Original Message-

> >> > > From: Users 
> >> > > mailto:users-boun...@clusterlabs.org>> 
> >> > > On Behalf Of Andrei

> >> > > Borzenkov

> >> > > Sent: Sunday, February 17, 2019 11:56 AM

> >> > > To: users@clusterlabs.org<mailto:users@clusterlabs.org>

> >> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When

> >> > > Just One Fails?

> >> > >

> >> > > 17.02.2019 0:44, Eric Robinson пишет:

> >> > > > Thanks for the feedback, Andrei.

> >> > > >

> >> > > > I only want cluster failover to occur if the filesystem or drbd

> >> > > > resources fail,

> >> > >

> >> > > or if the cluster messaging layer detects a complete node failure.

> >> > > Is there a

> >> > > way to tell PaceMaker not to trigger a cluster failover if any of

> >> > > the p_mysql resources fail?

> >> > > >

> >> > >

> >> > > Let's look at this differently. If all these applications depend

> >> > > on each other, you should not be able to stop individual resource

> >> > > in the first place - you need to group them or define dependency

> >> > > so that stopping any resource would stop everything.

> >> > >

> >> > > If these applications are independent, they should not share

> >> > > resources.

> >> > > Each MySQL application should have own IP and own FS and own

> >> > > block device for this FS so that they can be moved between

> >> > > cluster nodes independently.

> >> > >

> >> > > Anything else will lead to troubles as you already observed.

> >> >

> >> > FYI, the MySQL services do not depend on each other. All of them

> >> > depend on the floating IP, which depends on the filesystem, which

> >> > depends on DRBD, but they do not depend on each other. Ideally, the

> >> > failure of p_mysql_002 should not cause failure of other mysql

> >> > resources, but now I understand why it happened. Pacemaker wanted

> >> > to start it on the other node, so it needed to move the floating

> >> > IP, filesystem, and DRBD primary, which had the cascade effect of

> >> > stopping the other MySQL resources.

> >> >

> >> > I think I also understand why the p_vip_clust01 resource blocked.

> >> >

> >> > FWIW, we've been using Linux HA since 2006, originally Heartbeat,

> >> > but then Corosync+Pacemaker. The past 12 years have been relatively

> >> > problem free. This symptom is new for us, only within the past year.

> >> > Our cluster nodes have many separate instances of MySQL running, so

> >> > it is not practical to have that many filesystems, IPs, etc. We are

> >> > content with the way things are, except for this new troubling

> >> > behavior.

> >> >

> >> > If I understand the thread correctly, op-fail=stop will not work

> >> > because the cluster will still try to stop the resources that are

> >> > implied dependencies.

> >> >

> >> > Bottom line is, how do we configure the cluster in such a way that

> >> > there are no cascading circumsta

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Ken Gaillot
> Sent: Tuesday, February 19, 2019 10:31 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote:
> > > -Original Message-
> > > From: Users  On Behalf Of Andrei
> > > Borzenkov
> > > Sent: Sunday, February 17, 2019 11:56 AM
> > > To: users@clusterlabs.org
> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just
> > > One Fails?
> > >
> > > 17.02.2019 0:44, Eric Robinson пишет:
> > > > Thanks for the feedback, Andrei.
> > > >
> > > > I only want cluster failover to occur if the filesystem or drbd
> > > > resources fail,
> > >
> > > or if the cluster messaging layer detects a complete node failure.
> > > Is there a
> > > way to tell PaceMaker not to trigger a cluster failover if any of
> > > the p_mysql resources fail?
> > > >
> > >
> > > Let's look at this differently. If all these applications depend on
> > > each other, you should not be able to stop individual resource in
> > > the first place - you need to group them or define dependency so
> > > that stopping any resource would stop everything.
> > >
> > > If these applications are independent, they should not share
> > > resources.
> > > Each MySQL application should have own IP and own FS and own block
> > > device for this FS so that they can be moved between cluster nodes
> > > independently.
> > >
> > > Anything else will lead to troubles as you already observed.
> >
> > FYI, the MySQL services do not depend on each other. All of them
> > depend on the floating IP, which depends on the filesystem, which
> > depends on DRBD, but they do not depend on each other. Ideally, the
> > failure of p_mysql_002 should not cause failure of other mysql
> > resources, but now I understand why it happened. Pacemaker wanted to
> > start it on the other node, so it needed to move the floating IP,
> > filesystem, and DRBD primary, which had the cascade effect of stopping
> > the other MySQL resources.
> >
> > I think I also understand why the p_vip_clust01 resource blocked.
> >
> > FWIW, we've been using Linux HA since 2006, originally Heartbeat, but
> > then Corosync+Pacemaker. The past 12 years have been relatively
> > problem free. This symptom is new for us, only within the past year.
> > Our cluster nodes have many separate instances of MySQL running, so it
> > is not practical to have that many filesystems, IPs, etc. We are
> > content with the way things are, except for this new troubling
> > behavior.
> >
> > If I understand the thread correctly, op-fail=stop will not work
> > because the cluster will still try to stop the resources that are
> > implied dependencies.
> >
> > Bottom line is, how do we configure the cluster in such a way that
> > there are no cascading circumstances when a MySQL resource fails?
> > Basically, if a MySQL resource fails, it fails. We'll deal with that
> > on an ad-hoc basis. I don't want the whole cluster to barf. What about
> > op-fail=ignore? Earlier, you suggested symmetrical=false might also do
> > the trick, but you said it comes with its own can or worms.
> > What are the downsides with op-fail=ignore or asymmetrical=false?
> >
> > --Eric
> 
> Even adding on-fail=ignore to the recurring monitors may not do what you
> want, because I suspect that even an ignored failure will make the node less
> preferable for all the other resources. But it's worth testing.
> 
> Otherwise, your best option is to remove all the recurring monitors from the
> mysql resources, and rely on external monitoring (e.g. nagios, icinga, monit,
> ...) to detect problems.

This is probably a dumb question, but can we remove just the monitor operation 
but leave the resource configured in the cluster? If a node fails over, we do 
want the resources to start automatically on the new primary node.

> --
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Sunday, February 17, 2019 11:56 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> 17.02.2019 0:44, Eric Robinson пишет:
> > Thanks for the feedback, Andrei.
> >
> > I only want cluster failover to occur if the filesystem or drbd resources 
> > fail,
> or if the cluster messaging layer detects a complete node failure. Is there a
> way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql
> resources fail?
> >
> 
> Let's look at this differently. If all these applications depend on each 
> other,
> you should not be able to stop individual resource in the first place - you
> need to group them or define dependency so that stopping any resource
> would stop everything.
> 
> If these applications are independent, they should not share resources.
> Each MySQL application should have own IP and own FS and own block
> device for this FS so that they can be moved between cluster nodes
> independently.
> 
> Anything else will lead to troubles as you already observed.

FYI, the MySQL services do not depend on each other. All of them depend on the 
floating IP, which depends on the filesystem, which depends on DRBD, but they 
do not depend on each other. Ideally, the failure of p_mysql_002 should not 
cause failure of other mysql resources, but now I understand why it happened. 
Pacemaker wanted to start it on the other node, so it needed to move the 
floating IP, filesystem, and DRBD primary, which had the cascade effect of 
stopping the other MySQL resources.

I think I also understand why the p_vip_clust01 resource blocked. 

FWIW, we've been using Linux HA since 2006, originally Heartbeat, but then 
Corosync+Pacemaker. The past 12 years have been relatively problem free. This 
symptom is new for us, only within the past year. Our cluster nodes have many 
separate instances of MySQL running, so it is not practical to have that many 
filesystems, IPs, etc. We are content with the way things are, except for this 
new troubling behavior.

If I understand the thread correctly, op-fail=stop will not work because the 
cluster will still try to stop the resources that are implied dependencies.

Bottom line is, how do we configure the cluster in such a way that there are no 
cascading circumstances when a MySQL resource fails? Basically, if a MySQL 
resource fails, it fails. We'll deal with that on an ad-hoc basis. I don't want 
the whole cluster to barf. What about op-fail=ignore? Earlier, you suggested 
symmetrical=false might also do the trick, but you said it comes with its own 
can or worms. What are the downsides with op-fail=ignore or asymmetrical=false?

--Eric






> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson

I'm looking through the docs but I don't see how to set the on-fail value for a 
resource. 


> -Original Message-
> From: Users  On Behalf Of Eric Robinson
> Sent: Saturday, February 16, 2019 1:47 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> > On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote:
> > > I just noticed that. I also noticed that the lsb init script has a
> > > hard-coded stop timeout of 30 seconds. So if the init script waits
> > > longer than the cluster resource timeout of 15s, that would cause
> > > the
> >
> > Yes, you should use higher timeouts in pacemaker (45s for example).
> >
> > > resource to fail. However, I don't want cluster failover to be
> > > triggered by the failure of one of the MySQL resources. I only want
> > > cluster failover to occur if the filesystem or drbd resources fail,
> > > or if the cluster messaging layer detects a complete node failure.
> > > Is there a way to tell PaceMaker not to trigger cluster failover if
> > > any of the p_mysql resources fail?
> >
> > You can try playing with the on-fail option but I'm not sure how
> > reliably this whole setup will work without some form of fencing/stonith.
> >
> > https://clusterlabs.org/pacemaker/doc/en-
> >
> US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html
> 
> Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what 
> I'm
> looking for, at least for the MySQL resources.
> 
> >
> > --
> > Valentin
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson

> On Sat, Feb 16, 2019 at 09:33:42PM +0000, Eric Robinson wrote:
> > I just noticed that. I also noticed that the lsb init script has a
> > hard-coded stop timeout of 30 seconds. So if the init script waits
> > longer than the cluster resource timeout of 15s, that would cause the
> 
> Yes, you should use higher timeouts in pacemaker (45s for example).
> 
> > resource to fail. However, I don't want cluster failover to be
> > triggered by the failure of one of the MySQL resources. I only want
> > cluster failover to occur if the filesystem or drbd resources fail, or
> > if the cluster messaging layer detects a complete node failure. Is
> > there a way to tell PaceMaker not to trigger cluster failover if any
> > of the p_mysql resources fail?
> 
> You can try playing with the on-fail option but I'm not sure how reliably this
> whole setup will work without some form of fencing/stonith.
> 
> https://clusterlabs.org/pacemaker/doc/en-
> US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html

Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what 
I'm looking for, at least for the MySQL resources. 

> 
> --
> Valentin
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson

Thanks for the feedback, Andrei.

I only want cluster failover to occur if the filesystem or drbd resources fail, 
or if the cluster messaging layer detects a complete node failure. Is there a 
way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql 
resources fail?  

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Saturday, February 16, 2019 1:34 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> 17.02.2019 0:03, Eric Robinson пишет:
> > Here are the relevant corosync logs.
> >
> > It appears that the stop action for resource p_mysql_002 failed, and that
> caused a cascading series of service changes. However, I don't understand
> why, since no other resources are dependent on p_mysql_002.
> >
> 
> You have mandatory colocation constraints for each SQL resource with VIP. it
> means that to move SQL resource to another node pacemaker also must
> move VIP to another node which in turn means it needs to move all other
> dependent resources as well.
> ...
> > Feb 16 14:06:39 [3912] 001db01apengine:  warning:
> check_migration_threshold:Forcing p_mysql_002 away from 001db01a
> after 100 failures (max=100)
> ...
> > Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
> > Stop
> p_vip_clust01 (   001db01a )   blocked
> ...
> > Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
> > Stop
> p_mysql_001   (   001db01a )   due to colocation with 
> p_vip_clust01
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Valentin Vidic
> Sent: Saturday, February 16, 2019 1:28 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote:
> > Here are the relevant corosync logs.
> >
> > It appears that the stop action for resource p_mysql_002 failed, and
> > that caused a cascading series of service changes. However, I don't
> > understand why, since no other resources are dependent on p_mysql_002.
> 
> The stop failed because of a timeout (15s), so you can try to update that
> value:
> 


I just noticed that. I also noticed that the lsb init script has a hard-coded 
stop timeout of 30 seconds. So if the init script waits longer than the cluster 
resource timeout of 15s, that would cause the resource to fail. However, I 
don't want cluster failover to be triggered by the failure of one of the MySQL 
resources. I only want cluster failover to occur if the filesystem or drbd 
resources fail, or if the cluster messaging layer detects a complete node 
failure. Is there a way to tell PaceMaker not to trigger cluster failover if 
any of the p_mysql resources fail?  


>   Result of stop operation for p_mysql_002 on 001db01a: Timed Out |
> call=1094 key=p_mysql_002_stop_0 timeout=15000ms
> 
> After the stop failed it should have fenced that node, but you don't have
> fencing configured so it tries to move mysql_002 and all the other resources
> related to it (vip, fs, drbd) to the other node.
> Since other mysql resources depend on the same (vip, fs, drbd) they need to
> be stopped first.
> 
> --
> Valentin
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson

Here are the relevant corosync logs.

It appears that the stop action for resource p_mysql_002 failed, and that 
caused a cascading series of service changes. However, I don't understand why, 
since no other resources are dependent on p_mysql_002.

[root@001db01a cluster]# cat corosync_filtered.log
Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request:  
Forwarding cib_apply_diff operation for section 'all' to all 
(origin=local/cibadmin/2)
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   Diff: 
--- 0.345.30 2
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   Diff: 
+++ 0.346.0 cc0da1b030418ec8b7c72db1115e2af1
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   +  
/cib:  @epoch=346, @num_updates=0
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   ++ 
/cib/configuration/resources/primitive[@id='p_mysql_002']:  
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   ++  
 
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   ++  
   
Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request:  
Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=001db01a/cibadmin/2, version=0.346.0)
Feb 16 14:06:24 [3913] 001db01a   crmd: info: abort_transition_graph:   
Transition aborted by meta_attributes.p_mysql_002-meta_attributes 'create': 
Configuration change | cib=0.346.0 source=te_update_diff:456 
path=/cib/configuration/resources/primitive[@id='p_mysql_002'] complete=true
Feb 16 14:06:24 [3913] 001db01a   crmd:   notice: do_state_transition:  
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph
Feb 16 14:06:24 [3912] 001db01apengine:   notice: unpack_config:On loss 
of CCM Quorum: Ignore
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status:  
Node 001db01b is online
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status:  
Node 001db01a is online
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd0:0 active in master mode on 001db01b
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd1:0 active on 001db01b
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_004 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_005 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd0:1 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd1:1 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_001 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_002 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_002 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: 
p_vip_clust01   (ocf::heartbeat:IPaddr2):   Started 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print:   
Master/Slave Set: ms_drbd0 [p_drbd0]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Masters: [ 001db01a ]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Slaves: [ 001db01b ]
Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print:   
Master/Slave Set: ms_drbd1 [p_drbd1]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Masters: [ 001db01b ]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Slaves: [ 001db01a ]
Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: 
p_fs_clust01(ocf::heartbeat:Filesystem):Started 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: 
p_fs_clust02(ocf::heartbeat:Filesystem):Started 001db01b
Feb 16 14:06:24 [3912] 001db01apengine:

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson

: p_mysql_004 (class=lsb type=mysql_004)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_004-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_004-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_004-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_004-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_004-stop-interval-0s)
Resource: p_mysql_005 (class=lsb type=mysql_005)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_005-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_005-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_005-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_005-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_005-stop-interval-0s)
Resource: p_mysql_006 (class=lsb type=mysql_006)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_006-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_006-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_006-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_006-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_006-stop-interval-0s)
Resource: p_mysql_007 (class=lsb type=mysql_007)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_007-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_007-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_007-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_007-start-interval-0s)
 stop interval=0s timeout=15 (p_mysql_007-stop-interval-0s)
Resource: p_mysql_008 (class=lsb type=mysql_008)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_008-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_008-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_008-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_008-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_008-stop-interval-0s)
Resource: p_mysql_622 (class=lsb type=mysql_622)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_622-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_622-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_622-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_622-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_622-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: p_vip_clust02
Enabled on: 001db01b (score:INFINITY) (role: Started) 
(id:cli-prefer-p_vip_clust02)
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory)
  promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory)
  start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory)
  start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_001 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_002 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_003 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_004 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_005 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_006 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_007 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_008 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_622 (kind:Mandatory)
Colocation Constraints:
  p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master)
  p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master)
  p_vip_clust01 with p_fs_clust01 (score:INFINITY)
  p_vip_clust02 with p_fs_clust02 (score:INFINITY)
  p_mysql_001 with p_vip_clust01 (score:INFINITY)
  p_mysql_000 with p_vip_clust01 (score:INFINITY)
  p_mysql_002 with p_vip_clust01 (score:INFINITY)
  p_mysql_003 with p_vip_clust01 (score:INFINITY)
  p_mysql_004 with p_vip_clust01 (score:INFINITY)
  p_mysql_005 with p_vip_clust01 (score:INFINITY)
  p_mysql_006 with p_vip_clust02 (score:INFINITY)
  p_mysql_007 with p_vip_clust02 (score:INFINITY)
  p_mysql_008 with p_vip_clust02 (score:INFINITY)
  p_mysql_622 with p_vip_clust01 (score:INFINITY)
Ticket Constraints:

Alerts:
No alerts defined

Resources Defaults:
resource-stickiness: 100
Operations Defaults:
No defaults set

Cluster Properties:
cluster-infrastructure: corosync
cluster-name: 001db01ab
dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
have-watchdog: false
last-lrm-refresh: 1550347798
maintenance-mode: false
no-quorum-policy: ignore
stonith-enabled: false

--Eric


From: Users  On Behalf Of Eric Robinson
Sent: Saturday, February 16, 2019 12:34 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: [ClusterLabs] Why Do All The Services Go

[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson

These are the resources on our cluster.

[root@001db01a ~]# pcs status
Cluster name: 001db01ab
Stack: corosync
Current DC: 001db01a (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with 
quorum
Last updated: Sat Feb 16 15:24:55 2019
Last change: Sat Feb 16 15:10:21 2019 by root via cibadmin on 001db01b

2 nodes configured
18 resources configured

Online: [ 001db01a 001db01b ]

Full list of resources:

p_vip_clust01  (ocf::heartbeat:IPaddr2):   Started 001db01a
Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db01a ]
 Slaves: [ 001db01b ]
Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db01b ]
 Slaves: [ 001db01a ]
p_fs_clust01   (ocf::heartbeat:Filesystem):Started 001db01a
p_fs_clust02   (ocf::heartbeat:Filesystem):Started 001db01b
p_vip_clust02  (ocf::heartbeat:IPaddr2):   Started 001db01b
p_mysql_001(lsb:mysql_001):Started 001db01a
p_mysql_000(lsb:mysql_000):Started 001db01a
p_mysql_002(lsb:mysql_002):Started 001db01a
p_mysql_003(lsb:mysql_003):Started 001db01a
p_mysql_004(lsb:mysql_004):Started 001db01a
p_mysql_005(lsb:mysql_005):Started 001db01a
p_mysql_006(lsb:mysql_006):Started 001db01b
p_mysql_007(lsb:mysql_007):Started 001db01b
p_mysql_008(lsb:mysql_008):Started 001db01b
p_mysql_622(lsb:mysql_622):Started 001db01a

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Why is it that when one of the resources that start with p_mysql_* goes into a 
FAILED state, all the other MySQL services also stop?

[root@001db01a ~]# pcs constraint
Location Constraints:
  Resource: p_vip_clust02
Enabled on: 001db01b (score:INFINITY) (role: Started)
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory)
  promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory)
  start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory)
  start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_001 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_002 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_003 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_004 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_005 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_006 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_007 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_008 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_622 (kind:Mandatory)
Colocation Constraints:
  p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master)
  p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master)
  p_vip_clust01 with p_fs_clust01 (score:INFINITY)
  p_vip_clust02 with p_fs_clust02 (score:INFINITY)
  p_mysql_001 with p_vip_clust01 (score:INFINITY)
  p_mysql_000 with p_vip_clust01 (score:INFINITY)
  p_mysql_002 with p_vip_clust01 (score:INFINITY)
  p_mysql_003 with p_vip_clust01 (score:INFINITY)
  p_mysql_004 with p_vip_clust01 (score:INFINITY)
  p_mysql_005 with p_vip_clust01 (score:INFINITY)
  p_mysql_006 with p_vip_clust02 (score:INFINITY)
  p_mysql_007 with p_vip_clust02 (score:INFINITY)
  p_mysql_008 with p_vip_clust02 (score:INFINITY)
  p_mysql_622 with p_vip_clust01 (score:INFINITY)
Ticket Constraints:

--Eric





___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

1 2 >

1 - 100 of 175 matches

Mail list logo