Re: [ClusterLabs] After Startup, Pacemaker Gasps and Dies

2016-07-25 Thread Eric Robinson
Thanks for the suggestions. 

I basically uninstalled pacemaker and ripped every single file out of the 
system that had "pacemaker" in its name, then reinstalled and it is now working 
fine. I had already uninstalled and re-installed a few times, but the uninstall 
left some orphaned files. 

--
Eric Robinson


-Original Message-
From: Ken Gaillot [mailto:kgail...@redhat.com] 
Sent: Monday, July 25, 2016 7:52 AM
To: users@clusterlabs.org
Cc: Eric Robinson <eric.robin...@psmnv.com>
Subject: Re: [ClusterLabs] After Startup, Pacemaker Gasps and Dies

On 07/23/2016 05:30 PM, Eric Robinson wrote:
> I've created a 15 or so Corosync+Pacemaker clusters and never had this
> kind of issue.
> 
>  
> 
> These servers are running the following software
> 
>  
> 
> RHEL 6.3
> 
> pacemaker-libs-1.1.12-8.el6_7.2.x86_64
> 
> pacemaker-1.1.12-8.el6_7.2.x86_64
> 
> corosync-1.4.7-5.el6.x86_64
> 
> pacemaker-cluster-libs-1.1.12-8.el6_7.2.x86_64
> 
> pacemaker-cli-1.1.12-8.el6_7.2.x86_64
> 
> corosynclib-1.4.7-5.el6.x86_64
> 
> crmsh-2.0-1.el6.x86_64
> 
>  
> 
> Corosync starts fine and both nodes join the cluster.
> 
> Pacemaker appears to start fine, but 'crm configure show' produces the
> error...
> 
>  
> 
> [root@ha14b ~]# crm configure show
> 
> ERROR: running cibadmin -Ql: Could not establish cib_rw connection:
> Connection refused (111)
> 
> Signon to CIB failed: Transport endpoint is not connected
> 
> Init failed, could not perform requested operations
> 
> ERROR: configure: Missing requirements

There have been many fixes since RHEL 6.3 -- I'd recommend upgrading to
6.8 if possible. (FYI, RHEL switched from crm to pcs in that time frame,
so the command-line syntax is a little different.)

The logs show that the cluster is using the "legacy plugin" to connect
pacemaker to corosync. On RHEL 6, it's preferred to use CMAN instead of
the plugin, so again, I'd recommend trying that first if possible. It
would involve installing the cman package and reconfiguring corosync.conf.

The immediate cause of the problem is that the CIB daemon is dumping
core. Unfortunately, there's no indication of why in the logs. If you
have the debuginfo versions of the packages available, you could try
looking at the stack trace. However, no one is familiar with the code
base in 6.3 anymore, so that might not be terribly useful.

FYI to subscribe to this list, see
http://clusterlabs.org/mailman/listinfo/users

>  
> 
> After a short while Pacemaker dies...
> 
>  
> 
> [root@ha14b ~]# service pacemaker status
> 
> pacemakerd dead but pid file exists
> 
>  
> 
> The Pacemaker log shows the following...
> 
>  
> 
> [root@ha14a log]# cat pacemaker.log
> 
> Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: crm_ipc_connect:
> Could not establish pacemakerd connection: Connection refused (111)
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: config_find_next:   
> Processing additional service options...
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: 
> Found 'pacemaker' for option: name
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: 
> Found '1' for option: ver
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_cluster_type:   
> Detected an active 'classic openais (with plugin)' cluster
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: mcp_read_config:
> Reading configure for stack: classic openais (with plugin)
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: config_find_next:   
> Processing additional service options...
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: 
> Found 'pacemaker' for option: name
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: 
> Found '1' for option: ver
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: 
> Defaulting to 'no' for option: use_logd
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: 
> Defaulting to 'no' for option: use_mgmtd
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: config_find_next:   
> Processing additional logging options...
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: 
> Found 'off' for option: debug
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: 
> Found 'yes' for option: to_logfile
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd: info: get_config_opt: 
> Found '/var/log/corosync.log' for option: logfile
> 
> Jul 22 23:29:45 [4616] ha14a pacemakerd:   notice: crm_add_logfile:
>   

[ClusterLabs] subscribe

2016-07-23 Thread Eric Robinson


--
Eric Robinson
Chief Information Officer
Physician Select Management, LLC
775.885.2211 x 112


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] subscribe

2016-07-23 Thread Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Colocations and Orders Syntax Changed?

2017-01-31 Thread Eric Robinson
Indeed. My mistake. 

--
Eric Robinson

-Original Message-
From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] 
Sent: Friday, January 20, 2017 4:25 AM
To: users@clusterlabs.org
Subject: [ClusterLabs] Antw: Re: Antw: Colocations and Orders Syntax Changed?

>>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 20.01.2017 um 
>>> 12:56 in
Nachricht
<dm5pr03mb2729d5003219b644b4e0bc7cfa...@dm5pr03mb2729.namprd03.prod.outlook.com>

> Thanks for the input. I usually just do a 'crm config show > 
> myfile.xml.date_time' and the read it back in if I need to.

I guess 'crm configure show xml > myfile.xml.date_time', because here I get 
"ERROR: config: No such command" and no XML... ;-)

Acutally I'm using "cibadmin -Q -o configuration", because I think it's 
faster...

Regards,
Ulrich



___
Users mailing list: Users@clusterlabs.org 
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Colocations and Orders Syntax Changed?

2017-01-20 Thread Eric Robinson
Thanks for the input. I usually just do a 'crm config show > 
myfile.xml.date_time' and the read it back in if I need to. 

--
Eric Robinson
   

> -Original Message-
> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> Sent: Thursday, January 19, 2017 12:04 AM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: Colocations and Orders Syntax Changed?
> 
> Hi!
> 
> This might not help, but messing up your cluster is a part of life. I decided 
> to
> track cluster changes automatically (more human-usable than the diffs
> logged in pacemaker's log). So if the current configuration stops to work, I
> look at the changes and try to undo them until things work again (hoping it's
> not the current pacemaker patch) ;-)
> 
> So I have a script run by cron that saves the current configuration (readable
> and XML), then looks if it's different from the one saved last. If so, the new
> configuration is saved. For convenience I add a hash to the files, and link a
> timestamp to the hashed files (So if you cycle between configurations, you'll
> save some space ;-)) So I can diff between any of the configurations saved (a
> kind of time machine)...
> 
> Regards,
> Ulrich
> 
> >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 19.01.2017 um
> >>> 05:08 in
> Nachricht
> <DM5PR03MB2729B7480FDF490F0DA81AC0FA7E0@DM5PR03MB2729.nampr
> d03.prod.outlook.com>
> 
> > Greetings!
> >
> > I have a lot of pacemaker clusters, each running multiple instances of
> > mysql.  I configure it so that the mysql resources are all dependent
> > on an underlying stack of supporting resources which consists of a
> > virtual IP address (p_vip), a filesystem (p_fs), often an LVM resource
> > (p_lvm), and a drbd resource (p_drbd). If any resource in the
> > underlying stack resource moves, then all of them move together and the
> mysql resources follow.
> > However, each of the mysql resources can be stopped and started
> > independently without impacting any other resources. I accomplish that
> > with a configuration such as the following:
> >
> > colocation c_clust10 inf: ( p_mysql_103 p_mysql_150 p_mysql_204
> > p_mysql_206
> > p_mysql_244 p_mysql_247 ) p_vip_clust10 p_fs_clust10 ms_drbd0:Master
> > order o_clust10 inf: ms_drbd0:promote p_fs_clust10 p_vip_clust10 ( p
> > p_mysql_103 p_mysql_150 p_mysql_204 p_mysql_206 p_mysql_244
> > p_mysql_247)
> >
> > This has suddenly stopped working. On my newest cluster I have the
> > following. When I try to use the same approach, the configuration gets
> > rearranged on me automatically. The parentheses get moved. Often each
> > of the underlying resources is changed to the same thing with ":Master"
> following.
> > Sometimes the whole colocation stanza gets replaced with raw xml. I
> > have messed around with it, and the following is the best I can come
> > up with, but when I stop a mysql resource everything else stops!
> >
> > colocation c_clust19 inf: ( p_mysql_057 p_mysql_092 p_mysql_187
> > p_mysql_213
> > p_vip_clust19 p_mysql_702 p_mysql_743 p_fs_clust19 p_lv_on_drbd0 ) (
> > ms_drbd0:Master ) order o_clust19 inf: ms_drbd0:promote (
> > p_lv_on_drbd0:start ) ( p_fs_clust19
> > p_vip_clust19 ) ( p_mysql_057 p_mysql_092 p_mysql_187 p_mysql_213
> > p_mysql_702
> > p_mysql_743 )
> >
> > The old cluster is running Pacemaker 1.1.10. The new one is running 1.1.12.
> >
> > What can I do to get it running right again? I want all the underlying
> > resources (vip, fs, lvm, drbd) to move together. I want the mysql
> > instances to be collocated with the underlying resources, but I want
> > them to be independent of each other so they can each be started and
> > stopped without hurting anything.
> >
> > --
> > Eric Robinson
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Colocations and Orders Syntax Changed?

2017-01-18 Thread Eric Robinson
Greetings!

I have a lot of pacemaker clusters, each running multiple instances of mysql.  
I configure it so that the mysql resources are all dependent on an underlying 
stack of supporting resources which consists of a virtual IP address (p_vip), a 
filesystem (p_fs), often an LVM resource (p_lvm), and a drbd resource (p_drbd). 
If any resource in the underlying stack resource moves, then all of them move 
together and the mysql resources follow. However, each of the mysql resources 
can be stopped and started independently without impacting any other resources. 
I accomplish that with a configuration such as the following:

colocation c_clust10 inf: ( p_mysql_103 p_mysql_150 p_mysql_204 p_mysql_206 
p_mysql_244 p_mysql_247 ) p_vip_clust10 p_fs_clust10 ms_drbd0:Master
order o_clust10 inf: ms_drbd0:promote p_fs_clust10 p_vip_clust10 ( p 
p_mysql_103 p_mysql_150 p_mysql_204 p_mysql_206 p_mysql_244 p_mysql_247)

This has suddenly stopped working. On my newest cluster I have the following. 
When I try to use the same approach, the configuration gets rearranged on me 
automatically. The parentheses get moved. Often each of the underlying 
resources is changed to the same thing with ":Master" following. Sometimes the 
whole colocation stanza gets replaced with raw xml. I have messed around with 
it, and the following is the best I can come up with, but when I stop a mysql 
resource everything else stops!

colocation c_clust19 inf: ( p_mysql_057 p_mysql_092 p_mysql_187 p_mysql_213 
p_vip_clust19 p_mysql_702 p_mysql_743 p_fs_clust19 p_lv_on_drbd0 ) ( 
ms_drbd0:Master )
order o_clust19 inf: ms_drbd0:promote ( p_lv_on_drbd0:start ) ( p_fs_clust19 
p_vip_clust19 ) ( p_mysql_057 p_mysql_092 p_mysql_187 p_mysql_213 p_mysql_702 
p_mysql_743 )

The old cluster is running Pacemaker 1.1.10. The new one is running 1.1.12.

What can I do to get it running right again? I want all the underlying 
resources (vip, fs, lvm, drbd) to move together. I want the mysql instances to 
be collocated with the underlying resources, but I want them to be independent 
of each other so they can each be started and stopped without hurting anything.

--
Eric Robinson


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] I've been working on a split-brain prevention strategy for 2-node clusters.

2016-10-09 Thread Eric Robinson
Digimer, thanks for your thoughts. Booth is one of the solutions I looked at, 
but I don't like it because it is complex and difficult to implement (and 
perhaps costly in terms of AWS services or something similar)). As I read 
through your comments, I returned again and again to the feeling that the 
troubles you described do not apply to the deaddrop scenario. Your observations 
are correct in that you cannot make assumptions about the state of the other 
node when all coms are down.  You cannot count on the other node being in a 
predictable state. That is certainly true, and it is the very problem that I 
hope to address with DeadDrop. It provides a last-resort "back channel" for 
coms between the cluster nodes when all other coms are down, removing the 
element of assumption. 

Consider a few scenarios.

1. Data center A is primary, B is secondary. Coms are lost between A and B, but 
both of them can still reach the Internet. Node A notices loss of coms with B, 
but it is already primary so it cares not. Node B sees loss of normal cluster 
communication, and it might normally think of switching to primary, but first 
it checks the DeadDrop and it sees a note from A saying, "I'm fine and serving 
pages for customers." B aborts its plan to become primary. Later, after normal 
links are restored, B rejoins the cluster still as secondary. There is no 
element of assumption here.

2.  Data center A is primary, B is secondary. A loses communication with the 
Internet, but not with B. B can still talk to the Internet. B initiates a 
graceful failover. Again no assumptions.

3. Data center A is primary, B is secondary. Data center A goes completely 
dark. No communication to anything, not to B, and not to the outside world. B 
wants to go primary, but first it checks DeadDrop, and it finds that A is not 
leaving messages there either. It therefore KNOWS that A cannot reach the 
Internet and is not reachable by customers. No assumptions there. B assumes 
primary role and customers are happy. When A comes back online, it detects 
split-brain and refuses to join the cluster, notifying operators. Later, 
operators manually resolve the split brain.

There is no perfect solution, of course, but is seems to me that this simple 
approach provides a level of availability beyond what you would normally get 
with a 2-node cluster. What am I missing?

--
Eric Robinson
   

-Original Message-
From: Digimer [mailto:li...@alteeve.ca] 
Sent: Sunday, October 09, 2016 2:05 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>
Subject: Re: [ClusterLabs] I've been working on a split-brain prevention 
strategy for 2-node clusters.

On 09/10/16 04:33 PM, Eric Robinson wrote:
> I've been working on a script for preventing split-brain in 2-node 
> clusters and I would appreciate comments from everyone. If someone 
> already has a solution like this, let me know!
>  
> Most of my database clusters are 2-nodes, with each node in a 
> geographically separate data center. Our layout looks like the 
> following diagram. Each server node has three physical connections to the 
> world.
> LANs A, B , C, D are all physically separate cable plants and 
> cross-connects between the data centers (using different switches, 
> routers, power, fiber paths, etc.). This is to ensure maximum cluster 
> communication intelligence. LANs A and B (Corosync ring 0) are bonded 
> at the NICs, as are LANs C and D (Corosync ring 1).
>  
> Hopefully this diagram will come through intact...
>  
>  
>  
>  ++
>  ||
>  |   Third party  |
>  |  Web Hosting   |
>  +---++---+
>  ||
>  ||
>  ||
>  ||
>  ||
>  ||
>  ++XX |
> XXX XX+-+XXX
>XX XX   XXX
>  XXX XX
>   XX  X
>  X  XXX
> ++The InterwebsXXX+-+
> |XXX X  |
> | XX XX |
> | X  XX |
> |  X   XXX  |
> |  XX  XX XX|
> |   XXX |
> |   |
> | Internet  | Internet
> |   

Re: [ClusterLabs] Establishing Timeouts

2016-10-10 Thread Eric Robinson
I'm mostly interested in prventing false-positive cluster failovers that might 
occur during manual network maintenance (for example, testing switch and link 
outages). 


>> Thanks for the clarification. So what's the easiest way to ensure that the 
>> cluster waits a 
>> desired timeout before deciding that a re-convergence is > necessary? 

>By raising the token (lost) timeout I would say.

>Please correct my (Chrissie) but I see the token (lost) timout somehow as 
>resilience against static delays + jitter on top and the 
>token_retransmits_before_loss_const 
>as resilience against packet-loss.



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Establishing Timeouts

2016-10-10 Thread Eric Robinson
Basically, when we turn off a switch, I want to keep the cluster from failing 
over before Linux bonding has had a chance to recover. 

I'm mostly interested in prventing false-positive cluster failovers that might 
occur during manual network maintenance (for example, testing switch and link 
outages). 


>> Thanks for the clarification. So what's the easiest way to ensure 
>> that the cluster waits a desired timeout before deciding that a 
>> re-convergence is > necessary?

>By raising the token (lost) timeout I would say.

>Please correct my (Chrissie) but I see the token (lost) timout somehow 
>as resilience against static delays + jitter on top and the 
>token_retransmits_before_loss_const
>as resilience against packet-loss.



___
Users mailing list: Users@clusterlabs.org 
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] I've been working on a split-brain prevention strategy for 2-node clusters.

2016-10-09 Thread Eric Robinson
 role in arbitration.

Thoughts?

--
Eric Robinson



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Establishing Timeouts

2016-10-09 Thread Eric Robinson
I have about a dozen corosync+pacemaker clusters and I am just now getting 
around to understanding timeouts.

Most of my corosync.conf files look something like this:

version:2
token:  5000
token_retransmits_before_loss_const: 10
join:   1000
consensus:  7500
vsftype:none
max_messages:   20
secauth:off
threads:0
clear_node_high_bit: yes
rrp_mode: active

If I understand this correctly, this means the node will wait 50 seconds 
(5000ms x 10) before deciding that a cluster reconfig is necessary (perhaps 
after a link failure). Is that correct?

I'm trying to understand how this works together with my bonded NIC's 
arp_interval settings. I normally set arp_interval=1000. My question is, how 
many arp losses are required before the bonding driver decides to failover to 
the other link? If arp_interval=1000, how many times does the driver send an 
arp and fail to receive a reply before it decides that the link is dead?

I think I need to know this so I can set my corosync.conf settings correctly to 
avoid "false positive" cluster failovers. In other words, if there is a link or 
switch failure, I want to make sure that the cluster allows plenty of time for 
link communication to recover before deciding that a node has actually died. 

--
Eric Robinson


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Easy Linux Bonding Question?

2016-10-10 Thread Eric Robinson
Short version: How many missed arp_intervals does the bonding driver wait 
before removing the PASSIVE slave from the bond?

Long version: I'm confused about this because I know the passive slave watches 
for the active slave's arp broadcast as a way of knowing that the passive 
slave's link is good. However, if the switch to which the active slave is 
connected fails, then NEITHER the active slave nor the passive slave will see a 
packet. (The active slave won't get a reply from the target, and the passive 
slave won't see the active's request.) So how does the bonding driver decide if 
it should deactivate the passive slave too? I'm assuming it sends it goes 
active immediately and begins sending its own arp requests to the target, and 
if it still does not get a response, then it is removed from the bond too, 
resulting in NO active slaves.

This means that if you have two servers that are looking at each other as 
arp_targets, you could end up with race conditions where there are no active 
interfaces at either end. That would be bad. To prevent that, I'm wanting to 
configure different timeouts at each end, but the downdelay parameter only 
works with miimon. How do I control the delay with arp_ip_target?

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Trying this question again re: arp_interval

2016-10-14 Thread Eric Robinson
Does anyone know how many arp_intervals must pass without a reply before the 
bonding driver downs the primary NIC? Just one?

--
Eric Robinson


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Can Bonding Cause a Broadcast Storm?

2016-11-15 Thread Eric Robinson
If a Linux server with bonded interfaces attached to different switches is 
rebooted, is it possible that a bridge loop could result for a brief period? We 
noticed that one of our 100 Linux servers became unresponsive and appears to 
have rebooted. (The cause has not been determined.) A couple of minutes 
afterwards, we saw a gigantic spike in traffic on all switches in the network 
that lasted for about 7 minutes, causing latency and packet loss on the 
network. Everything was still reachable, but slowly. The condition stopped as 
soon as the Linux server in question became reachable again.  

--
Eric Robinson


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Can Bonding Cause a Broadcast Storm?

2016-11-15 Thread Eric Robinson
mode 1. No special switch configuration. spanning tree not enabled. I have 100+ 
Linux servers, all of which use bonding. The network has been stable for 10 
years. No changes recently. However, this is the second time that we have seen 
high latency and traced it down to the behavior of one particular server. I'm 
wondering if there is something about bonding that could result in a temporary 
bridge loop.


From: Jeremy Voorhis <jvoor...@gmail.com>
Sent: Tuesday, November 15, 2016 2:13:59 PM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Can Bonding Cause a Broadcast Storm?

What bonding mode are you using? Some modes require additional configuration 
from the switch to avoid flooding. Also, is spanning tree enabled on the 
switches?

On Tue, Nov 15, 2016 at 1:26 PM Eric Robinson 
<eric.robin...@psmnv.com<mailto:eric.robin...@psmnv.com>> wrote:
If a Linux server with bonded interfaces attached to different switches is 
rebooted, is it possible that a bridge loop could result for a brief period? We 
noticed that one of our 100 Linux servers became unresponsive and appears to 
have rebooted. (The cause has not been determined.) A couple of minutes 
afterwards, we saw a gigantic spike in traffic on all switches in the network 
that lasted for about 7 minutes, causing latency and packet loss on the 
network. Everything was still reachable, but slowly. The condition stopped as 
soon as the Linux server in question became reachable again.

--
Eric Robinson


___
Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org>
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Establishing Timeouts

2016-10-10 Thread Eric Robinson

> AFAIK, it _all_ ARP targets did not respond _once_ the link will be 
> considered down 

It would be great if someone could confirm that.

> after "Down Delay". I guess you want to use multiple (and the correct ones) 
> ARP IP targets... 

Yes, I use multiple targets, and arp_all_targets=any. Down Delay only applies 
to miimon links, I'm afraid, not to arp_ip_target. 

--Eric



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Can't See Why This Cluster Failed Over

2017-04-10 Thread Eric Robinson
> crm configure show xml c_clust19

Here is what I am entering using crmsh (version 2.0-1):


colocation c_clust19 inf: [ p_mysql_057 p_mysql_092 p_mysql_187 ] p_vip_clust19 
p_fs_clust19 p_lv_on_drbd0 ms_drbd0:Master
order o_clust19 inf: ms_drbd0:promote p_lv_on_drbd0 p_fs_clust19 p_vip_clust19 
[ p_mysql_057 p_mysql_092 p_mysql_187 ]


After I save it, I get no errors, but it converts it to this...


colocation c_clust19 inf: [ p_mysql_057 p_mysql_092 p_mysql_187 ] ( 
p_vip_clust19:Master p_fs_clust19:Master p_lv_on_drbd0:Master ) ( 
ms_drbd0:Master )
order o_clust19 inf: ms_drbd0:promote ( p_lv_on_drbd0:start p_fs_clust19:start 
p_vip_clust19:start ) [ p_mysql_057 p_mysql_092 p_mysql_187 ]

This looks incorrect to me.

Here is the xml that it generates.


  

  
  
  
  
  
  
  
  
  
  
  
  
  


  
  
  


  

  


The resources in set c_clust19-1 should start sequentially, starting with 
p_lv_on_drbd0 and ending with p_vip_clust19. I also don't understand why 
p_lv_on_drbd0 and p_vip_clust19 are getting the Master designation. 

--
Eric Robinson
   

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Fraud Detection Check?

2017-04-13 Thread Eric Robinson
> -Original Message-
> From: Dmitri Maziuk [mailto:dmitri.maz...@gmail.com]
> Sent: Thursday, April 13, 2017 8:30 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Fraud Detection Check?
> 
> On 2017-04-13 01:39, Jan Pokorný wrote:
> 
> > After a bit of a search, the best practice at the list server seems to
> > be:
> >
> >> [...] if you change the message (eg, by adding a list signature or by
> >> adding the list name to the Subject field), you *should* DKIM sign.
> 
> This is of course going entirely off-topic for the list, but DKIM's stated
> purpose is to sign mail coming from *.clusterlabs.org with a key from
> clustrlab.org's dns zone file. DKIM is not mandatory, so you strip all 
> existing
> dkim headers and then either sign or not, it's up to you.
> 
> None of this is new. SourceForge list manager, for example, adds SF footers
> *inside* the PGP-signed MIME part, resulting in the exact same "invalid
> signature" problem.
> 
> Dima
> 

Thanks for all the feedback, guys. Bottom line for me is that I'm only seeing 
it in messages that I send to ClusterLabs list, but you guys are not seeing it 
at all, even in my messages. So if it isn't bothering anybody else, it is also 
not bothering me enough to do anything about it. I believe the problem is on my 
end (or rather, at Office 365) but if it is not being seen in the list then I 
don't care that much. I just want to make sure people in the list are not 
getting alerts that my mails are fraudulent. 

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2-Node Cluster Pointless?

2017-04-16 Thread Eric Robinson
> -Original Message-
> From: Digimer [mailto:li...@alteeve.ca]
> Sent: Sunday, April 16, 2017 11:17 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users@clusterlabs.org>; Eric Robinson <eric.robin...@psmnv.com>
> Subject: Re: [ClusterLabs] 2-Node Cluster Pointless?
> 
> On 16/04/17 01:53 PM, Eric Robinson wrote:
> > I was reading in "Clusters from Scratch" where Beekhof states, "Some
> would argue that two-node clusters are always pointless, but that is an
> argument for another time." Is there a page or thread where this argument
> has been fleshed out? Most of my dozen clusters are 2 nodes. I hate to think
> they're pointless.
> >
> > --
> > Eric Robinson
> 
> There is a belief that you can't build a reliable cluster without quorum. I 
> am of
> the mind that you *can* build a very reliable 2-node cluster. In fact, every
> cluster our company has deployed, going back over five years, has been 2-
> node and have had exception uptimes.
> 
> The confusion comes from the belief that quorum is required and stonith is
> option. The reality is the opposite. I'll come back to this in a minute.
> 
> In a two-node cluster, you have two concerns;
> 
> 1. If communication between the nodes fail, but both nodes are alive, how
> do you avoid a split brain?
> 
> 2. If you have a two node cluster and enable cluster startup on boot, how do
> you avoid a fence loop?
> 
> Many answer #1 by saying "you need a quorum node to break the tie". In
> some cases, this works, but only when all nodes are behaving in a predictable
> manner.
> 
> Many answer #2 by saying "well, with three nodes, if a node boots and can't
> talk to either other node, it is inquorate and won't do anything".
> This is a valid mechanism, but it is not the only one.
> 
> So let me answer these from a 2-node perspective;
> 
> 1. You use stonith and the faster node lives, the slower node dies. From the
> moment of comms failure, the cluster blocks (needed with quorum,
> too) and doesn't restore operation until the (slower) peer is in a known
> state; Off. You can bias this by setting a fence delay against your preferred
> node. So say node 1 is the node that normally hosts your services, then you
> add 'delay="15"' to node 1's fence method. This tells node 2 to wait 15
> seconds before fencing node 1. If both nodes are alive, node 2 will be fenced
> before the timer expires.
> 
> 2. In Corosync v2+, there is a 'wait_for_all' option that tells a node to not 
> do
> anything until it is able to talk to the peer node. So in the case of a fence 
> after
> a comms break, the node that reboots will come up, fail to reach the survivor
> node and do nothing more. Perfect.
> 
> Now let me come back to quorum vs. stonith;
> 
> Said simply; Quorum is a tool for when everything is working. Fencing is a 
> tool
> for when things go wrong.
> 
> Lets assume that your cluster is working find, then for whatever reason,
> node 1 hangs hard. At the time of the freeze, it was hosting a virtual IP and
> an NFS service. Node 2 declares node 1 lost after a period of time and
> decides it needs to take over;
> 
> In the 3-node scenario, without stonith, node 2 reforms a cluster with node 3
> (quorum node), decides that it is quorate, starts its NFS server and takes
> over the virtual IP. So far, so good... Until node 1 comes out of its hang. At
> that moment, node 1 has no idea time has passed. It has no reason to think
> "am I still quorate? Are my locks still valid?" It just finishes whatever it 
> was in
> the middle of doing and bam, split-brain. At the least, you have two nodes
> claiming the same IP at the same time. At worse, you had uncoordinated
> writes to shared storage and you've corrupted your data.
> 
> In the 2-node scenario, with stonith, node 2 is always quorate, so after
> declaring node 1 lost, it moves to fence node 1. Once node 1 is fenced,
> *then* it starts NFS, takes over the virtual IP and restores services.
> In this case, no split-brain is possible because node 1 has rebooted and
> comes up with a fresh state (or it's on fire and never coming back anyway).
> 
> This is why quorum is optional and stonith/fencing is not.
> 
> Now, with this said, I won't say that 3+ node clusters are bad. They're fine 
> if
> they suit your use-case, but even with 3+ nodes you still must use stonith.
> 
> My *personal* arguments in favour of 2-node clusters over 3+ nodes is this;
> 
> A cluster is not beautiful when there is nothing left to add. It is beautiful
> when there is nothing left to take away.
> 
> In availability clustering, nothing should ev

Re: [ClusterLabs] 2-Node Cluster Pointless?

2017-04-17 Thread Eric Robinson
> In shred-nothing cluster "split brain" means whichever MAC address 
> is in ARP cache of the border router is the one that gets the traffic. 
> How does the existing code figure this one out?

I'm guessing the surviving node broadcasts a gratuitous arp reply. 

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2-Node Cluster Pointless?

2017-04-17 Thread Eric Robinson
> This isn't the first time this has come up, so I decided 
> to elaborate on this email by writing an article on the topic.

> It's a first-draft so there are likely spelling/grammar 
> mistakes. However, the body is done.

> https://www.alteeve.com/w/The_2-Node_Myth

It looks like my question was well-timed, as it served as a catalyst for you to 
write the article. Thanks much, I am working through it now and will doubtless 
have some questions and comments. Before I say anything more, I want to do some 
testing in my lab to make sure I have my thoughts collected. 

--
Eric Robinson



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Can't See Why This Cluster Failed Over

2017-04-07 Thread Eric Robinson
Somebody want to look at this log and tell me why the cluster failed over? All 
we did was add a new resource. We've done it many times before without any 
problems.

--

Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request:
Forwarding cib_apply_diff operation for section 'all' to master 
(origin=local/cibadmin/2)
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
--- 0.605.2 2
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
+++ 0.607.0 65654c97e62cd549f22f777a5290fe3a
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: +  
/cib:  @epoch=607, @num_updates=0
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/resources:  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/resources:  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/constraints/rsc_colocation[@id='c_clust19']/resource_set[@id='c_clust19-0']:
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/constraints/rsc_colocation[@id='c_clust19']/resource_set[@id='c_clust19-0']:
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/constraints/rsc_order[@id='o_clust19']/resource_set[@id='o_clust19-3']:
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/constraints/rsc_order[@id='o_clust19']/resource_set[@id='o_clust19-3']:
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request:
Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=ha14a/cibadmin/2, version=0.607.0)
Apr 03 08:50:30 [22762] ha14acib: info: write_cib_contents: 
Archived previous version as /var/lib/pacemaker/cib/cib-36.raw
Apr 03 08:50:30 [22762] ha14acib: info: write_cib_contents: 
Wrote version 0.607.0 of the CIB to disk (digest: 
1afdb9e480f870a095aa9e39719d29c4)
Apr 03 08:50:30 [22762] ha14acib: info: retrieveCib:Reading 
cluster configuration from: /var/lib/pacemaker/cib/cib.DkIgSs (digest: 
/var/lib/pacemaker/cib/cib.hPwa66)
Apr 03 08:50:30 [22764] ha14a   lrmd: info: process_lrmd_get_rsc_info:  
Resource 'p_mysql_745' not found (17 active resources)
Apr 03 08:50:30 [22764] ha14a   lrmd: info: process_lrmd_rsc_register:  
Added 'p_mysql_745' to the rsc list (18 active resources)
Apr 03 08:50:30 [22767] ha14a   crmd: info: do_lrm_rsc_op:  
Performing key=10:7484:7:91ef4b03-8769-47a1-a364-060569c46e52 
op=p_mysql_745_monitor_0
Apr 03 08:50:30 [22764] ha14a   lrmd: info: process_lrmd_get_rsc_info:  
Resource 'p_mysql_746' not found (18 active resources)
Apr 03 08:50:30 [22764] ha14a   lrmd: info: process_lrmd_rsc_register:  
Added 'p_mysql_746' to the rsc list (19 active resources)
Apr 03 08:50:30 [22767] ha14a   crmd: info: do_lrm_rsc_op:  
Performing key=11:7484:7:91ef4b03-8769-47a1-a364-060569c46e52 
op=p_mysql_746_monitor_0
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
--- 0.607.0 2
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
+++ 0.607.1 (null)
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: +  
/cib:  @num_updates=1
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/status/node_state[@id='ha14b']/lrm[@id='ha14b']/lrm_resources:  

Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++  
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0, 
origin=ha14b/crmd/7665, version=0.607.1)
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
--- 0.607.1 2
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
+++ 0.607.2 (null)
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: +  
/cib:  @num_updates=2
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/status/node_state[@id='ha14b']/lrm[@id='ha14b']/lrm_resources:  

Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++  
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0, 
origin=ha14b/crmd/7666, version=0.607.2)
Apr 03 08:50:30 [22767] ha14a   crmd:   notice: process_lrm_event:  
Operation p_mysql_745_monitor_0: not running (node=ha14a, call=142, rc=7, 
cib-update=88, confirmed=true)
Apr 03 08:50:30 [22767] ha14a   crmd:   notice: process_lrm_event:  
ha14a-p_mysql_745_monitor_0:142 [ not started\n ]
Apr 03 08:50:30 [22762] 

[ClusterLabs] Can't See Why This Cluster Failed Over

2017-04-07 Thread Eric Robinson
Somebody want to look at this log and tell me why the cluster failed over? All 
we did was add a new resource. We've done it many times before without any 
problems.

--

Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request:
Forwarding cib_apply_diff operation for section 'all' to master 
(origin=local/cibadmin/2)
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
--- 0.605.2 2
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
+++ 0.607.0 65654c97e62cd549f22f777a5290fe3a
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: +  
/cib:  @epoch=607, @num_updates=0
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/resources:  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/resources:  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/constraints/rsc_colocation[@id='c_clust19']/resource_set[@id='c_clust19-0']:
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/constraints/rsc_colocation[@id='c_clust19']/resource_set[@id='c_clust19-0']:
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/constraints/rsc_order[@id='o_clust19']/resource_set[@id='o_clust19-3']:
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/configuration/constraints/rsc_order[@id='o_clust19']/resource_set[@id='o_clust19-3']:
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request:
Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=ha14a/cibadmin/2, version=0.607.0)
Apr 03 08:50:30 [22762] ha14acib: info: write_cib_contents: 
Archived previous version as /var/lib/pacemaker/cib/cib-36.raw
Apr 03 08:50:30 [22762] ha14acib: info: write_cib_contents: 
Wrote version 0.607.0 of the CIB to disk (digest: 
1afdb9e480f870a095aa9e39719d29c4)
Apr 03 08:50:30 [22762] ha14acib: info: retrieveCib:Reading 
cluster configuration from: /var/lib/pacemaker/cib/cib.DkIgSs (digest: 
/var/lib/pacemaker/cib/cib.hPwa66)
Apr 03 08:50:30 [22764] ha14a   lrmd: info: process_lrmd_get_rsc_info:  
Resource 'p_mysql_745' not found (17 active resources)
Apr 03 08:50:30 [22764] ha14a   lrmd: info: process_lrmd_rsc_register:  
Added 'p_mysql_745' to the rsc list (18 active resources)
Apr 03 08:50:30 [22767] ha14a   crmd: info: do_lrm_rsc_op:  
Performing key=10:7484:7:91ef4b03-8769-47a1-a364-060569c46e52 
op=p_mysql_745_monitor_0
Apr 03 08:50:30 [22764] ha14a   lrmd: info: process_lrmd_get_rsc_info:  
Resource 'p_mysql_746' not found (18 active resources)
Apr 03 08:50:30 [22764] ha14a   lrmd: info: process_lrmd_rsc_register:  
Added 'p_mysql_746' to the rsc list (19 active resources)
Apr 03 08:50:30 [22767] ha14a   crmd: info: do_lrm_rsc_op:  
Performing key=11:7484:7:91ef4b03-8769-47a1-a364-060569c46e52 
op=p_mysql_746_monitor_0
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
--- 0.607.0 2
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
+++ 0.607.1 (null)
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: +  
/cib:  @num_updates=1
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/status/node_state[@id='ha14b']/lrm[@id='ha14b']/lrm_resources:  

Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++  
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0, 
origin=ha14b/crmd/7665, version=0.607.1)
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
--- 0.607.1 2
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: Diff: 
+++ 0.607.2 (null)
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: +  
/cib:  @num_updates=2
Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++ 
/cib/status/node_state[@id='ha14b']/lrm[@id='ha14b']/lrm_resources:  

Apr 03 08:50:30 [22762] ha14acib: info: cib_perform_op: ++  
  
Apr 03 08:50:30 [22762] ha14acib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0, 
origin=ha14b/crmd/7666, version=0.607.2)
Apr 03 08:50:30 [22767] ha14a   crmd:   notice: process_lrm_event:  
Operation p_mysql_745_monitor_0: not running (node=ha14a, call=142, rc=7, 
cib-update=88, confirmed=true)
Apr 03 08:50:30 [22767] ha14a   crmd:   notice: process_lrm_event:  
ha14a-p_mysql_745_monitor_0:142 [ not started\n ]
Apr 03 08:50:30 [22762] 

Re: [ClusterLabs] Fraud Detection Check?

2017-04-07 Thread Eric Robinson
> I've received your emails without any alteration or flagging as "fraud".
> So I don't think we're doing anything to your emails.

Good to know.

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Fraud Detection Check?

2017-04-07 Thread Eric Robinson
>> You guys got a thing against Office 365?

> doesn't everybody?

Fair enough. 

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Fraud Detection Check?

2017-04-07 Thread Eric Robinson
> On a serious note, I too received your e-mails without any red flags attached.

Thanks for the confirmation. I guess I'm the only one seeing those warnings. 
Maybe Office 365 has a problem with ClusterLabs. ;-)

--
Eric Robinson
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: DRBD and SSD TRIM - Slow!

2017-08-02 Thread Eric Robinson
1) iotop did not show any significant io, just maybe 30k/second of drbd traffic.

2) okay. I've never done that before. I'll give it a shot.

3) I'm not sure what I'm looking at there.

--
Eric Robinson

> -Original Message-
> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> Sent: Tuesday, August 01, 2017 11:28 PM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: DRBD and SSD TRIM - Slow!
> 
> Hi!
> 
> I know little about trim operations, but you could try one of these:
> 
> 1) iotop to see whether some I/O is done during trimming (assuming
> trimming itself is not considered to be I/O)
> 
> 2) Try blocktrace on the affected devices to see what's going on. It's hard to
> set up and to extract the info you are looking for, but it provides deep
> insights
> 
> 3) Watch /sys/block/$BDEV/stat for performance statistics. I don't know how
> well DRBD supports these, however (e.g. MDRAID shows no wait times and
> no busy operations, while a multipath map has it all).
> 
> Regards,
> Ulrich
> 
> >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 02.08.2017 um
> >>> 07:09 in
> Nachricht
> <DM5PR03MB27297014DF96DC01FE849A63FAB00@DM5PR03MB2729.nampr
> d03.prod.outlook.com>
> 
> > Does anyone know why trimming a filesystem mounted on a DRBD volume
> > takes so long? I mean like three days to trim a 1.2TB filesystem.
> >
> > Here are some pertinent details:
> >
> > OS: SLES 12 SP2
> > Kernel: 4.4.74-92.29
> > Drives: 6 x Samsung SSD 840 Pro 512GB
> > RAID: 0 (mdraid)
> > DRBD: 9.0.8
> > Protocol: C
> > Network: Gigabit
> > Utilization: 10%
> > Latency: < 1ms
> > Loss: 0%
> > Iperf test: 900 mbits/sec
> >
> > When I write to a non-DRBD partition, I get 400MB/sec (bypassing caches).
> > When I trim a non-DRBD partition, it completes fast.
> > When I write to a DRBD volume, I get 80MB/sec.
> >
> > When I trim a DRBD volume, it takes bloody ages!
> >
> > --
> > Eric Robinson
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] DRBD and SSD TRIM - Slow!

2017-08-01 Thread Eric Robinson
Does anyone know why trimming a filesystem mounted on a DRBD volume takes so 
long? I mean like three days to trim a 1.2TB filesystem.

Here are some pertinent details:

OS: SLES 12 SP2
Kernel: 4.4.74-92.29
Drives: 6 x Samsung SSD 840 Pro 512GB
RAID: 0 (mdraid)
DRBD: 9.0.8
Protocol: C
Network: Gigabit
Utilization: 10%
Latency: < 1ms
Loss: 0%
Iperf test: 900 mbits/sec

When I write to a non-DRBD partition, I get 400MB/sec (bypassing caches).
When I trim a non-DRBD partition, it completes fast.
When I write to a DRBD volume, I get 80MB/sec.

When I trim a DRBD volume, it takes bloody ages!

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: DRBD and SSD TRIM - Slow! -- RESOLVED!

2017-08-03 Thread Eric Robinson
For anyone else who has this problem, we have reduced the time required to trim 
a 1.3TB volume from 3 days to 1.5 minutes.

Initially, we had used mdraid to build a raid0 array with a 32K chunk size. We 
initialized it as a drbd disk, synced it, built an lvm logical volume on it, 
and created an ext4 filesystem on the volume. Creating the filesystem and 
trimming it took 3 days (each time, every time, across multiple tests). 

When running lsblk -D, we noticed that the DISC-MAX value for the array was 
only 32K, compared to 4GB for the SSD drive itself. We also noticed that the 
number matched the chunk size. We deleted the array and built a new one with a 
4MB chunk size. The DISC-MAX value changed to 4MB, which is the max selectable 
chunk size (but still way below the other DISC-MAX values shown in lsblk -D). 
We realized that, when using mdadm, the DISK-MAX value ends up matching the 
array chunk size. We theorized that the small DISC-MAX value was responsible 
for the slow trim rate across the DRBD link.

Instead of using mdadm to build the array, we used LVM to create a striped 
logical volume and made that the backing device for drbd. Then lsblk -D showed 
a DISC-MAX size of 128MB.  Creating an ext4 filesystem on it and trimming only 
took 1.5 minutes (across multiple tests).

Somebody knowledgeable may be able to explain how DISC-MAX affects the trim 
speed, and why the DISC-MAX value is different when creating the array with 
mdadm versus lvm.

--
Eric Robinson

> -Original Message-
> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> Sent: Wednesday, August 02, 2017 11:36 PM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: Re: Antw: DRBD and SSD TRIM - Slow!
> 
> >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 02.08.2017 um
> >>> 23:20 in
> Nachricht
> <DM5PR03MB2729C66CEC1E3B8B9E297185FAB00@DM5PR03MB2729.nampr
> d03.prod.outlook.com>
> 
> > 1) iotop did not show any significant io, just maybe 30k/second of
> > drbd traffic.
> >
> > 2) okay. I've never done that before. I'll give it a shot.
> >
> > 3) I'm not sure what I'm looking at there.
> 
> See /usr/src/linux/Documentation/block/stat.txt ;-) I wrote an NRPE plugin
> to monitor those with performance data and verbose text output, e.g.:
> CFS_VMs-xen: [delta 120s], 1.15086 IO/s read, 60.7789 IO/s write, 0 req/s
> read merges, 0 req/s write merges, 4.53674 sec/s read, 486.231 sec/s write,
> 2.36844 ms/s read wait, 2702.19 ms/s write wait, 0 req in_flight, 115.987 ms/s
> active, 2704.53 ms/s wait
> 
> Regards,
> Ulrich
> 
> >
> > --
> > Eric Robinson
> >
> >> -Original Message-
> >> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> >> Sent: Tuesday, August 01, 2017 11:28 PM
> >> To: users@clusterlabs.org
> >> Subject: [ClusterLabs] Antw: DRBD and SSD TRIM - Slow!
> >>
> >> Hi!
> >>
> >> I know little about trim operations, but you could try one of these:
> >>
> >> 1) iotop to see whether some I/O is done during trimming (assuming
> >> trimming itself is not considered to be I/O)
> >>
> >> 2) Try blocktrace on the affected devices to see what's going on.
> >> It's hard
> > to
> >> set up and to extract the info you are looking for, but it provides
> >> deep insights
> >>
> >> 3) Watch /sys/block/$BDEV/stat for performance statistics. I don't
> >> know how well DRBD supports these, however (e.g. MDRAID shows no
> wait
> >> times and no busy operations, while a multipath map has it all).
> >>
> >> Regards,
> >> Ulrich
> >>
> >> >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 02.08.2017 um
> >> >>> 07:09 in
> >> Nachricht
> >>
> <DM5PR03MB27297014DF96DC01FE849A63FAB00@DM5PR03MB2729.nampr
> >> d03.prod.outlook.com>
> >>
> >> > Does anyone know why trimming a filesystem mounted on a DRBD
> volume
> >> > takes so long? I mean like three days to trim a 1.2TB filesystem.
> >> >
> >> > Here are some pertinent details:
> >> >
> >> > OS: SLES 12 SP2
> >> > Kernel: 4.4.74-92.29
> >> > Drives: 6 x Samsung SSD 840 Pro 512GB
> >> > RAID: 0 (mdraid)
> >> > DRBD: 9.0.8
> >> > Protocol: C
> >> > Network: Gigabit
> >> > Utilization: 10%
> >> > Latency: < 1ms
> >> > Loss: 0%
> >> > Iperf test: 900 mbits/sec
> >> >
> >> > When I write to a non-DRBD partition, I get 400MB/sec (bypassing
> caches).
> >&g

[ClusterLabs] verify status starts at 100% and stays there?

2017-08-03 Thread Eric Robinson
I have drbd 9.0.8. I started an online verify, and immediately checked status, 
and I see...

ha11a:/ha01_mysql/trimtester # drbdadm status
ha01_mysql role:Primary
  disk:UpToDate
  ha11b role:Secondary
replication:VerifyT peer-disk:UpToDate done:100.00

...which looks like it is finished, but the tail of dmesg says...

[336704.851209] drbd ha01_mysql/0 drbd0 ha11b: repl( Established -> VerifyT )
[336704.851244] drbd ha01_mysql/0 drbd0: Online Verify start sector: 0

...which looks like the verify is still in progress.

So is it done, or is it still in progress? Is this a drbd bug?

--
Eric Robinson


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: verify status starts at 100% and stays there?

2017-08-04 Thread Eric Robinson
Yeah, UpToDate was not of concern to me. The part that threw me off was 
"done:100.00." It did eventually finish, though, and that was shown in the 
dmesg output. However, 'drbdadm status'  said "done:100.00" the whole time, 
from start to finish, which seems weird.  

--
Eric Robinson
   

> -Original Message-
> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> Sent: Thursday, August 03, 2017 11:25 PM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: verify status starts at 100% and stays there?
> 
> >>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 04.08.2017 um
> >>> 06:53 in
> Nachricht
> <DM5PR03MB2729739B8FC91B96F0CD3BE8FAB60@DM5PR03MB2729.nampr
> d03.prod.outlook.com>
> 
> > I have drbd 9.0.8. I started an online verify, and immediately checked
> > status, and I see...
> >
> > ha11a:/ha01_mysql/trimtester # drbdadm status ha01_mysql role:Primary
> >   disk:UpToDate
> >   ha11b role:Secondary
> > replication:VerifyT peer-disk:UpToDate done:100.00
> >
> > ...which looks like it is finished, but the tail of dmesg says...
> >
> > [336704.851209] drbd ha01_mysql/0 drbd0 ha11b: repl( Established ->
> > VerifyT ) [336704.851244] drbd ha01_mysql/0 drbd0: Online Verify start
> > sector: 0
> >
> > ...which looks like the verify is still in progress.
> >
> > So is it done, or is it still in progress? Is this a drbd bug?
> 
> Not deep into DRBD, but I guess "disk:UpToDate" just indicated that up to
> the present moment DRBD thinks the disks are up to date (unless veryfy
> wouold detect otherwise). Maybe there should be an additional status like
> "syncing,verifying, etc."
> 
> Regards,
> Ulrich
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] ClusterLabs.Org Documentation Problem?

2017-08-22 Thread Eric Robinson
The documentation located here...

http://clusterlabs.org/doc/

...is confusing because it offers two combinations:

Pacemaker 1.0 for Corosync 1.x
Pacemaker 1.1 for Corosync 2.x

According to the documentation, if you use Corosync 1.x you need Pacemaker 1.0, 
but if you use Corosync 2.x then you need Pacemaker 1.1.

However, on my Centos 6.9 system, when I do 'yum install pacemaker corosync" I 
get the following versions:

pacemaker-1.1.15-5.el6.x86_64
corosync-1.4.7-5.el6.x86_64

What's the correct answer? Does Pacemaker 1.1.15 work with Corosync 1.4.7? If 
so, is the documentation at ClusterLabs misleading?

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ClusterLabs.Org Documentation Problem?

2017-08-22 Thread Eric Robinson
Thanks for the reply. Yes, it's a bit confusing. I did end up using the 
documentation for Corosync 2.X since that seemed newer, but it also assumed 
CentOS/RHEL7 and systemd-based commands. It also incorporates cman, pcsd, 
psmisc, and policycoreutils-pythonwhich, which are all new to me. If there is 
anything I can do to assist with getting the documentation cleaned up, I'd be 
more than glad to help.

--
Eric Robinson

-Original Message-
From: Ken Gaillot [mailto:kgail...@redhat.com] 
Sent: Tuesday, August 22, 2017 2:08 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>
Subject: Re: [ClusterLabs] ClusterLabs.Org Documentation Problem?

On Tue, 2017-08-22 at 19:40 +, Eric Robinson wrote:
> The documentation located here…
> 
>  
> 
> http://clusterlabs.org/doc/
> 
>  
> 
> …is confusing because it offers two combinations:
> 
>  
> 
> Pacemaker 1.0 for Corosync 1.x
> 
> Pacemaker 1.1 for Corosync 2.x
> 
>  
> 
> According to the documentation, if you use Corosync 1.x you need 
> Pacemaker 1.0, but if you use Corosync 2.x then you need Pacemaker 
> 1.1.
> 
>  
> 
> However, on my Centos 6.9 system, when I do ‘yum install pacemaker 
> corosync” I get the following versions:
> 
>  
> 
> pacemaker-1.1.15-5.el6.x86_64
> 
> corosync-1.4.7-5.el6.x86_64
> 
>  
> 
> What’s the correct answer? Does Pacemaker 1.1.15 work with Corosync 
> 1.4.7? If so, is the documentation at ClusterLabs misleading?
> 
>  
> 
> --
> Eric Robinson

The page actually offers a third option ... "Pacemaker 1.1 for CMAN or Corosync 
1.x". That's the configuration used by CentOS 6.

However, that's still a bit misleading; the documentation set for "Pacemaker 
1.1 for Corosync 2.x" is the only one that is updated, and it's mostly 
independent of the underlying layer, so you should prefer that set.

I plan to reorganize that page in the coming months, so I'll try to make it 
clearer.

--
Ken Gaillot <kgail...@redhat.com>





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?

2017-06-16 Thread Eric Robinson
> > Out of curiosity, what did I say that indicates that we're not using 
> > fencing?
> >
> 
> Same place you said you were new to HA and needed to learn corosync and
> pacemaker to use OpenBSD.
> 

I must have misspoken. I said I stopped using OpenBSD back around the year 2000 
and switched to Linux (because market pressure). I didn't mean to imply that I 
was new to HA.

--Eric

 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?

2017-06-16 Thread Eric Robinson
> > I must have misspoken.
> 
> No, I had invisible  tags all over my last two messages.

Haha, okay. Thought I was going nuts for a moment.

--Eric
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?

2017-06-16 Thread Eric Robinson
> > Out of curiosity, do the openSUSE Leap repos and packages work with
> SLES?
> 
> I know that there are some base system differences that could cause
> problems, things like Leap using systemd/journald for logging while SLES is
> still logging via syslog-ng (IIRC)... so it's possible that you could get into
> problems if you mix versions. And adding the Leap repositories to SLES will
> probably mess things up since both deliver slightly different versions of the
> base system.
> 

Good information.

> For SLES, there's now the Package Hub which has open source packages
> taken from Leap and confirmed not to conflict with SLES, so you can mix a
> supported base system with unsupported open source packages with less
> risk for breaking anything:
> 
> https://packagehub.suse.com/
> 

That sounds like a possibility. We're such freaking cheapos, we have a bunch or 
RHEL servers, but only 1 of them has a subscription, which we keep active so we 
have access to the RHEL knowledge base.  We never call for support. All the 
other servers are kept up to date using the CentOS repos, which works fine. 

I'm thinking of doing similar with SLES with the PackageHub repos, but maybe 
we'll just use Leap. I haven't installed Leap yet. The SLES installer is 
winderful compared to Red Hat's. I like it a lot, especially the GUI disk 
partitioner with its device maps and mount maps and whatnot. If the Leap 
installer is like it, I will be tempted to go with it instead of SLES. 

--Eric 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?

2017-06-16 Thread Eric Robinson
> Jokes (?) aside; Red Hat and SUSE both have paid teams that make sure the
> HA software works well. So if you're new to HA, I strongly recommend
> sticking with one of those two, and SUSE is what you mentioned. If you really
> want to go to BSD or something else, I would recommend learning HA on
> SUSE/RHEL and then, after you know what config works for you, migrate to
> the target OS. That way you have only one set of variables at a time.
> 

I don't know how "new" I am. I've been using HA for a decade or so. Started 
with heartbeat V1. Deploying a Corosync+Pacemaker+DRBD cluster is pretty much a 
slam dunk for me these days. However, there's certainly a lot more that I DON'T 
know than I DO know.

--Eric 
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?

2017-06-16 Thread Eric Robinson
> > Also, use fencing. Seriously, just do it.
> 
> Yeah. Fencing is the only bit that's missing from this picture.
> 

Out of curiosity, what did I say that indicates that we're not using fencing?

--Eric



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?

2017-06-16 Thread Eric Robinson
> > I can understand how SUSE can charge for support, but not for the
> software itself. Corosync, Pacemaker, and DRBD are all open source.
> 
> So why do not you download open source and compile it yourself?
> 

I've done that before and I could if necessary. Rather go with the easiest 
option available.

--Eric
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Installing on SLES 12 -- Where's the Repos?

2017-06-16 Thread Eric Robinson


Ø  You could test it for free, you just need to register

Ø  to https://scc.suse.com/login

Ø  After that, you have an access for 60 days to SLES Repo.


What happens at the end of the trial? Software stops working?

I can understand how SUSE can charge for support, but not for the software 
itself. Corosync, Pacemaker, and DRBD are all open source.

--
Eric Robinson
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Installing on SLES 12 -- Where's the Repos?

2017-06-16 Thread Eric Robinson
We've been a Red Hat/CentOS shop for 10+ years and have installed 
Corosync+Pacemaker+DRBD dozens of times using the repositories, all for free.

We are now trying out our first SLES 12 server, and I'm looking for the repos. 
Where the heck are they? I went looking, and all I can find is the SLES "High 
Availability Extension," which I must pay $700/year for? No freaking way!

This is Linux we're talking about, right? There's got to be an easy way to 
install the cluster without paying for a subscription... right?

Someone talk me off the ledge here.

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Azure Resource Agent

2017-09-18 Thread Eric Robinson
The license would be GPL, I suppose, whatever enthusiasts and community 
contributors usually do. And yes, it would be fun to know I contributed 
something to the repo.

--
Eric Robinson
   

> -Original Message-
> From: Kristoffer Grönlund [mailto:kgronl...@suse.com]
> Sent: Monday, September 18, 2017 3:10 AM
> To: Eric Robinson <eric.robin...@psmnv.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Azure Resource Agent
> 
> Eric Robinson <eric.robin...@psmnv.com> writes:
> 
> > This is obviously beta as it currently only works with a manual failover. I
> need to add some code to handle an actual node crash or power-plug test.
> >
> > Feedback, suggestions, improvements are welcome. If someone who
> knows awk wants to clean up my azure client calls, that would be a good
> place to start.
> 
> Hi,
> 
> Great to see an initial agent for managing IPs on Azure! First of all, I would
> ask: What is your license for the code? Would you be interested in getting an
> agent based on this version included in the upstream resource-agents
> repository?
> 
> Cheers,
> Kristoffer
> 
> >
> > --
> >
> > #!/bin/sh
> > #
> > # OCF parameters are as below
> > # OCF_RESKEY_ip
> >
> >
> ##
> 
> > #
> > # Initialization:
> >
> > : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
> > . ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
> > DEBUG_LEVEL=2
> > MY_HOSTNAME=$(hostname -s)
> > SCRIPT_NAME=$(basename $0)
> >
> >
> ##
> 
> > #
> >
> > meta_data() {
> > logIt "debug1: entered: meta_data()"
> > cat < > 
> >   > name="AZaddr2"> 1.0
> >
> > 
> > Resource agent for managing IP configs in Azure.
> > 
> >
> > Short descrption/
> >
> > 
> >
> >  
> The
> > IPv4 (dotted quad notation) example IPv4 "192.168.1.1".
> > 
> > IPv4 address  > default="" /> 
> >
> > 
> > 
> > 
> > 
> > 
> > 
> >   > timeout="20s" />   END
> > logIt "leaving: exiting: meta_data()"
> > return $OCF_SUCCESS
> > }
> >
> > azip_query() {
> >
> > logIt "debug1: entered: azip_query()"
> > logIt "debug1: checking to determine if an Azure ipconfig 
> > named
> '$AZ_IPCONFIG_NAME' exists for the interface"
> > logIt "debug1: executing: az network nic ip-config show 
> > --name
> $AZ_IPCONFIG_NAME --nic-name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1"
> > R=$(az network nic ip-config show --name $AZ_IPCONFIG_NAME --nic-
> name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1)
> > logIt "debug2: $R"
> > R2=$(echo "$R"|grep "does not exist")
> > if [ -n "$R2" ]; then
> > logIt "debug1: ipconfig named 
> > '$AZ_IPCONFIG_NAME'
> does not exist"
> > return $OCF_NOT_RUNNING
> > else
> > R2=$(echo "$R"|grep "Succeeded")
> > if [ -n "$R2" ]; then
> > logIt "debug1: ipconfig 
> > '$AZ_IPCONFIG_NAME'
> exists"
> > return $OCF_SUCCESS
> > else
> > logIt "debug1: not sure how 
> > this happens"
> > return $OCF_ERR_GENERIC
> > fi
> > fi
> > logIt "debug1: exiting: azip_query()"
> > }
> >
> > azip_usage() {
> > cat < > usage: $0 {start|stop|status|monitor|validate-all|meta-data}
> >
> > Expects to have a fully populated OCF RA-compliant environment set.
> > END
> > return $OCF_SUCCESS
> > }
> >
> > azip_start() {
> >
> > logIt "debug1: entered: azip_start()"
> >
> > #--if a matching ipconfig alrea

Re: [ClusterLabs] Azure Resource Agent

2017-09-16 Thread Eric Robinson
Forgot to mention that it's called AZaddr and is intended to be dependent on 
IPaddr2 (or vice versa) and live in /usr/lib/ocf/resource.d/heartbeat.

--
Eric Robinson


From: Eric Robinson [mailto:eric.robin...@psmnv.com]
Sent: Friday, September 15, 2017 3:56 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>
Subject: [ClusterLabs] Azure Resource Agent


This sender failed our fraud detection checks and may not be who they appear to 
be. Learn about spoofing<http://aka.ms/LearnAboutSpoofing>

Feedback<http://aka.ms/SafetyTipsFeedback>

Greetings, all --

If anyone's interested, I wrote a resource agent that works with Microsoft 
Azure. I'm no expert at shell scripting, so I'm certain it needs a great deal 
of improvement, but I've done some testing and it works with a 2-node cluster 
in my Azure environment. Offhand, I don't know any reason why it wouldn't work 
with larger clusters, too.

My colocation stack looks like this:

mysql -> azure_ip -> cluster_ip -> filesystem -> drbd

Failover takes up to 4 minutes because it takes that long for the Azure IP 
address de-association and re-association to complete. None of the delay is the 
fault of the cluster itself.

Right now the script burps a bunch of debug output to syslog, which is helpful 
if you feel like you're waiting forever for the cluster to failover, you can 
look at /var/log/messages and see that you're waiting for the Azure cloud to 
finish something. To eliminate the debug messages, set DEBUG_LEVEL to 0.

The agent requires the Azure client to be installed and the nodes to have been 
logged into the cloud. It currently only works with one NIC per VM, and two 
ipconfigs per NIC (one of which is the floating cluster IP).

This is obviously beta as it currently only works with a manual failover. I 
need to add some code to handle an actual node crash or power-plug test.

Feedback, suggestions, improvements are welcome. If someone who knows awk wants 
to clean up my azure client calls, that would be a good place to start.

--

#!/bin/sh
#
# OCF parameters are as below
# OCF_RESKEY_ip

###
# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
DEBUG_LEVEL=2
MY_HOSTNAME=$(hostname -s)
SCRIPT_NAME=$(basename $0)

###

meta_data() {
logIt "debug1: entered: meta_data()"
cat <


1.0


Resource agent for managing IP configs in Azure.


Short descrption/





The IPv4 (dotted quad notation)
example IPv4 "192.168.1.1".

IPv4 address













END
logIt "leaving: exiting: meta_data()"
return $OCF_SUCCESS
}

azip_query() {

logIt "debug1: entered: azip_query()"
logIt "debug1: checking to determine if an Azure ipconfig named 
'$AZ_IPCONFIG_NAME' exists for the interface"
logIt "debug1: executing: az network nic ip-config show --name 
$AZ_IPCONFIG_NAME --nic-name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1"
R=$(az network nic ip-config show --name $AZ_IPCONFIG_NAME --nic-name 
$AZ_NIC_NAME -g $AZ_RG_NAME 2>&1)
logIt "debug2: $R"
R2=$(echo "$R"|grep "does not exist")
if [ -n "$R2" ]; then
logIt "debug1: ipconfig named 
'$AZ_IPCONFIG_NAME' does not exist"
return $OCF_NOT_RUNNING
else
R2=$(echo "$R"|grep "Succeeded")
if [ -n "$R2" ]; then
logIt "debug1: ipconfig 
'$AZ_IPCONFIG_NAME' exists"
return $OCF_SUCCESS
else
logIt "debug1: not sure how 
this happens"
return $OCF_ERR_GENERIC
fi
fi
logIt "debug1: exiting: azip_query()"
}

azip_usage() {
cat <___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Azure Resource Agent

2017-09-15 Thread Eric Robinson
Greetings, all --

If anyone's interested, I wrote a resource agent that works with Microsoft 
Azure. I'm no expert at shell scripting, so I'm certain it needs a great deal 
of improvement, but I've done some testing and it works with a 2-node cluster 
in my Azure environment. Offhand, I don't know any reason why it wouldn't work 
with larger clusters, too.

My colocation stack looks like this:

mysql -> azure_ip -> cluster_ip -> filesystem -> drbd

Failover takes up to 4 minutes because it takes that long for the Azure IP 
address de-association and re-association to complete. None of the delay is the 
fault of the cluster itself.

Right now the script burps a bunch of debug output to syslog, which is helpful 
if you feel like you're waiting forever for the cluster to failover, you can 
look at /var/log/messages and see that you're waiting for the Azure cloud to 
finish something. To eliminate the debug messages, set DEBUG_LEVEL to 0.

The agent requires the Azure client to be installed and the nodes to have been 
logged into the cloud. It currently only works with one NIC per VM, and two 
ipconfigs per NIC (one of which is the floating cluster IP).

This is obviously beta as it currently only works with a manual failover. I 
need to add some code to handle an actual node crash or power-plug test.

Feedback, suggestions, improvements are welcome. If someone who knows awk wants 
to clean up my azure client calls, that would be a good place to start.

--

#!/bin/sh
#
# OCF parameters are as below
# OCF_RESKEY_ip

###
# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
DEBUG_LEVEL=2
MY_HOSTNAME=$(hostname -s)
SCRIPT_NAME=$(basename $0)

###

meta_data() {
logIt "debug1: entered: meta_data()"
cat <


1.0


Resource agent for managing IP configs in Azure.


Short descrption/





The IPv4 (dotted quad notation)
example IPv4 "192.168.1.1".

IPv4 address













END
logIt "leaving: exiting: meta_data()"
return $OCF_SUCCESS
}

azip_query() {

logIt "debug1: entered: azip_query()"
logIt "debug1: checking to determine if an Azure ipconfig named 
'$AZ_IPCONFIG_NAME' exists for the interface"
logIt "debug1: executing: az network nic ip-config show --name 
$AZ_IPCONFIG_NAME --nic-name $AZ_NIC_NAME -g $AZ_RG_NAME 2>&1"
R=$(az network nic ip-config show --name $AZ_IPCONFIG_NAME --nic-name 
$AZ_NIC_NAME -g $AZ_RG_NAME 2>&1)
logIt "debug2: $R"
R2=$(echo "$R"|grep "does not exist")
if [ -n "$R2" ]; then
logIt "debug1: ipconfig named 
'$AZ_IPCONFIG_NAME' does not exist"
return $OCF_NOT_RUNNING
else
R2=$(echo "$R"|grep "Succeeded")
if [ -n "$R2" ]; then
logIt "debug1: ipconfig 
'$AZ_IPCONFIG_NAME' exists"
return $OCF_SUCCESS
else
logIt "debug1: not sure how 
this happens"
return $OCF_ERR_GENERIC
fi
fi
logIt "debug1: exiting: azip_query()"
}

azip_usage() {
cat <___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0

2017-09-25 Thread Eric Robinson
Problem:

Under high write load, DRBD exhibits data corruption. In repeated tests over a 
month-long period, file corruption occurred after 700-900 GB of data had been 
written to the DRBD volume.

Testing Platform:

2 x Dell PowerEdge R610 servers
32GB RAM
6 x Samsung SSD 840 Pro 512GB (latest firmware)
Dell H200 JBOD Controller
SUSE Linux Enterprise Server 12 SP2 (kernel 4.4.74-92.32)
Gigabit network, 900 Mbps throughput, < 1ms latency, 0 packet loss

Initial Setup:

Create 2 RAID-0 software arrays using either mdadm or LVM
On Array 1: sda5 through sdf5, create DRBD replicated volume 
(drbd0) with an ext4 filesystem
On Array 2: sda6 through sdf6, create LVM logical volume with 
an ext4 filesystem

Procedure:

Download and build the TrimTester SSD burn-in and TRIM 
verification tool from Algolia (https://github.com/algolia/trimtester).
Run TrimTester against the filesystem on drbd0, wait for 
corruption to occur
Run TrimTester against the non-drbd backed filesystem, wait for 
corruption to occur

Results:

In multiple tests over a period of a month, TrimTester would report file 
corruption when run against the DRBD volume after 700-900 GB of data had been 
written. The error would usually appear within an hour or two. However, when 
running it against the non-DRBD volume on the same physical drives, no 
corruption would occur. We could let the burn-in run for 15+ hours and write 
20+ TB of data without a problem. Results were the same with DRBD 8.4 and 9.0. 
We also tried disabling the TRIM-testing part of TrimTester and using it as a 
simple burn-in tool, just to make sure that SSD TRIM was not a factor.

Conclusion:

We are aware of some controversy surrounding the Samsung SSD 8XX series drives; 
however, the issues related to that controversy were resolved and no longer 
exist as of kernel 4.2. The 840 Pro drives are confirmed to support RZAT. Also, 
the data corruption would only occur when writing through the DRBD layer. It 
never occurred when bypassing the DRBD layer and writing directly to the 
drives, so we must conclude that DRBD has a data corruption bug under high 
write load. However, we would be more than happy to be proved wrong.

--
Eric Robinson







___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0

2017-09-26 Thread Eric Robinson
> I don't know the tool, but isn't the expectation a bit high that the tool 
> will trim
> the correct blocks throuch drbd->LVM/mdadm->device? Why not use the tool
> on the affected devices directly?
> 

I did, and the corruption did not occur. It only happened when writing through 
the DRBD layer. Also, I disabled the TRIM function of the tools and merely used 
it as a drive burn-in without triggering any trim commands. Same results.

--Eric



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft Azure?

2017-08-23 Thread Eric Robinson
I created two nodes on Micrsoft Azure, but I can't get them to join a cluster. 
Any thoughts?

OS: RHEL 6.9
Corosync version: 1.4.7-5.el6.x86_64
Node names: ha001a (172.28.0.4/23), ha001b (172.28.0.5/23)

The nodes are on the same subnet and can ping and ssh to each other just fine 
by either host name or IP address.

I have configured corosync to use unicast.

corosync-cfgtool looks fine...

[root@ha001b corosync]# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id  = 172.28.0.5
status  = ring 0 active with no faults

...but corosync-objctl only shows the local node...

[root@ha001b corosync]# corosync-objctl |grep join
totem.join=60
runtime.totem.pg.mrp.srp.memb_join_tx=1
runtime.totem.pg.mrp.srp.memb_join_rx=1
runtime.totem.pg.mrp.srp.members.2.join_count=1
runtime.totem.pg.mrp.srp.members.2.status=joined

...pcs status shows...

Cluster name: ha001
Stack: cman
Current DC: ha001b (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Wed Aug 23 18:04:33 2017  Last change: Wed Aug 23 
17:51:07 2017 by root via cibadmin on ha001b

2 nodes and 0 resources configured

Online: [ ha001b ]
OFFLINE: [ ha001a ]

No resources


Daemon Status:
  cman: active/disabled
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled

...it shows the opposite on the other node...

[root@ha001a ~]# corosync-objctl |grep join
totem.join=60
runtime.totem.pg.mrp.srp.memb_join_tx=1
runtime.totem.pg.mrp.srp.memb_join_rx=1
runtime.totem.pg.mrp.srp.members.1.join_count=1
runtime.totem.pg.mrp.srp.members.1.status=joined
[root@ha001a ~]# pcs status
Cluster name: ha001
Stack: cman
Current DC: ha001a (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Wed Aug 23 18:06:04 2017  Last change: Wed Aug 23 
17:51:03 2017 by root via cibadmin on ha001a

2 nodes and 0 resources configured

Online: [ ha001a ]
OFFLINE: [ ha001b ]

No resources


Daemon Status:
  cman: active/disabled
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled

...here is my corosync.conf...

compatibility: whitetank

totem {
version: 2
secauth: off
interface {
member {
memberaddr: 172.28.0.4
}
member {
memberaddr: 172.28.0.5
}
ringnumber: 0
bindnetaddr: 172.28.0.0
mcastport: 5405
ttl: 1
}
transport: udpu
}

logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

I used tcpdump and I see a lot of traffic between them on port 2224, but 
nothing else.

Is there an issue because the bindinetaddr is 172.28.0.0 but the members have a 
/23 mask?

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker in Azure

2017-08-24 Thread Eric Robinson
> Don't use Azure? ;)

That would be my preference. But since I'm stuck with Azure (management 
decision) I need to come up with something. It appears there is an Azure API to 
make changes on-the-fly from a Linux box. Maybe I'll write a resource agent to 
change Azure and make IPaddr2 dependent on it. That might work?

--
Eric Robinson


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker in Azure

2017-08-24 Thread Eric Robinson
I deployed a couple of cluster nodes in Azure and found out right away that 
floating a virtual IP address between nodes does not work because Azure does 
not honor IP changes made from within the VMs. IP changes must be made to 
virtual NICs in the Azure portal itself. Anybody know of an easy way around 
this limitation?

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker in Azure

2017-08-24 Thread Eric Robinson
I agree completely. Are you offering to make those changes? Because they would 
expand the capability of resource angent and would be a welcome addition. Also, 
full disclosure, I need to have something in place by the weekend, lol.

From: Ken Gaillot <kgail...@redhat.com>
Sent: Thursday, August 24, 2017 4:45:32 PM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Pacemaker in Azure

That would definitely be of wider interest.

I could see modifying the IPaddr2 RA to take some new arguments for
AWS/Azure parameters, and if those are configured, it would do the
appropriate API requests.

On Thu, 2017-08-24 at 23:27 +, Eric Robinson wrote:
> Leon -- I will pay you one trillion samolians for that resource agent!
> Any way we can get our hands on a copy?
>
>
>
> --
> Eric Robinson
>
>
>
> From: Leon Steffens [mailto:l...@steffensonline.com]
> Sent: Thursday, August 24, 2017 3:48 PM
> To: Cluster Labs - All topics related to open-source clustering
> welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Pacemaker in Azure
>
>
>
> That's what we did in AWS.  The IPaddr2 resource agent does an arp
> broadcast after changing the local IP but this does not work in AWS
> (probably for the same reasons as Azure).
>
>
>
>
> We created our own OCF resource agent that uses the Amazon APIs to
> move the IP in AWS land and made that dependent on the IPaddr2
> resource, and it worked fine.
>
>
>
>
>
>
>
>
> Leon Steffens
>
>
>
>
> On Fri, Aug 25, 2017 at 8:34 AM, Eric Robinson
> <eric.robin...@psmnv.com> wrote:
>
> > Don't use Azure? ;)
>
> That would be my preference. But since I'm stuck with Azure
> (management decision) I need to come up with something. It
> appears there is an Azure API to make changes on-the-fly from
> a Linux box. Maybe I'll write a resource agent to change Azure
> and make IPaddr2 dependent on it. That might work?
>
> --
> Eric Robinson
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

--
Ken Gaillot <kgail...@redhat.com>





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker in Azure

2017-08-25 Thread Eric Robinson
Oh, okay. I thought you meant some different ones. 

--
Eric Robinson
Chief Information Officer
Physician Select Management, LLC
775.885.2211 x 112

-Original Message-
From: Kristoffer Grönlund [mailto:kgronl...@suse.com] 
Sent: Friday, August 25, 2017 9:56 AM
To: Eric Robinson <eric.robin...@psmnv.com>; Cluster Labs - All topics related 
to open-source clustering welcomed <users@clusterlabs.org>
Subject: RE: [ClusterLabs] Pacemaker in Azure

Eric Robinson <eric.robin...@psmnv.com> writes:

> Hi Kristoffer --
>
> If you would be willing to share your AWS ip control agent(s), I think those 
> would be very helpful to us and the community at large. I'll be happy to 
> share whatever we come up with in terms of an Azure agent when we're all done.

I meant the agents that are in resource-agents already:

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/awsvip
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/awseip
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/aws-vpc-route53

You'll probably also be interested in fencing: There are agents for fencing 
both on AWS and Azure in the fence-agents repository.

Cheers,
Kristoffer

>
> --
> Eric Robinson
>
> -Original Message-
> From: Kristoffer Grönlund [mailto:kgronl...@suse.com]
> Sent: Friday, August 25, 2017 3:16 AM
> To: Eric Robinson <eric.robin...@psmnv.com>; Cluster Labs - All topics 
> related to open-source clustering welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Pacemaker in Azure
>
> Eric Robinson <eric.robin...@psmnv.com> writes:
>
>> I deployed a couple of cluster nodes in Azure and found out right away that 
>> floating a virtual IP address between nodes does not work because Azure does 
>> not honor IP changes made from within the VMs. IP changes must be made to 
>> virtual NICs in the Azure portal itself. Anybody know of an easy way around 
>> this limitation?
>
> You will need a custom IP control agent for Azure. We have a series of agents 
> for controlling IP addresses and domain names in AWS, but there is no agent 
> for Azure IP control yet. (At least as far as I am aware).
>
> Cheers,
> Kristoffer
>
>>
>> --
>> Eric Robinson
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> --
> // Kristoffer Grönlund
> // kgronl...@suse.com

--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker in Azure

2017-08-25 Thread Eric Robinson
Thanks. Leon sent me the same one earlier but I hadn't mentioned it yet (just 
got it a short while ago). I'll be able to use it as a template to build one 
for Azure. I have already installed the Azure CLI and it is working from my 
Linux cluster nodes, so I'm maybe a third of the way there. 

--
Eric Robinson
   

> -Original Message-
> From: Oyvind Albrigtsen [mailto:oalbr...@redhat.com]
> Sent: Friday, August 25, 2017 12:17 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Pacemaker in Azure
> 
> There's the awsvip agent that can handle secondary private IP addresses this
> way (to be used with order/colocation constraints with IPaddr2).
> 
> https://github.com/ClusterLabs/resource-
> agents/blob/master/heartbeat/awsvip
> 
> There's also the awseip for Elastic IPs that can assign your Elastic IP to 
> hosts or
> secondary private IPs.
> 
> On 25/08/17 10:13 +1000, Leon Steffens wrote:
> >Unfortunately I can't post the full resource agent here.
> >
> >In our search for solutions we did find a resource agent for managing
> >AWS Elastic IPs:
> >https://github.com/moomindani/aws-eip-resource-
> agent/blob/master/eip.
> >This was not what we wanted, but it will give you an idea of how it can
> work.
> >
> >Our script manages secondary private IPs by using:
> >
> >aws ec2 assign-private-ip-addresses
> >aws ec2 unassign-private-ip-addresses
> >aws ec2 describe-network-interfaces
> >
> >
> >There are a few things to consider:
> >* The AWS call to assign IPs to an EC2 instance is asynchronous (or it
> >was the last time I checked), so you have to wait a bit (or poll
> >AWS/Azure until the IP is ready).
> >* The IP change is slower than a normal VIP change on the machine, so
> >expect a slightly longer outage.
> >
> >
> >Leon
> 
> >___
> >Users mailing list: Users@clusterlabs.org
> >http://lists.clusterlabs.org/mailman/listinfo/users
> >
> >Project Home: http://www.clusterlabs.org Getting started:
> >http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ClusterLabs.Org Documentation Problem?

2017-08-23 Thread Eric Robinson
I have a BIG correction.

If you follow the instructions titled, "Pacemaker 1.1 for Corosync 2.x," and 
NOT the ones entitled, "Pacemaker 1.1 for CMAN or Corosync 1.x," guess what? It 
installs cman anyway, and you spend a couple of days wondering why none of your 
changes to corosync.conf seem to be working.

--
Eric Robinson

-Original Message-
From: Jan Friesse [mailto:jfrie...@redhat.com] 
Sent: Tuesday, August 22, 2017 11:52 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>; kgail...@redhat.com
Subject: Re: [ClusterLabs] ClusterLabs.Org Documentation Problem?

> Thanks for the reply. Yes, it's a bit confusing. I did end up using the 
> documentation for Corosync 2.X since that seemed newer, but it also assumed 
> CentOS/RHEL7 and systemd-based commands. It also incorporates cman, pcsd, 
> psmisc, and policycoreutils-pythonwhich, which are all new to me. If there is 
> anything I can do to assist with getting the documentation cleaned up, I'd be 
> more than glad to help.

Just a small correction.

Documentation shouldn't incorporate cman. Cman was used with corosync 1.x as a 
configuration layer and (more important) quorum provider. With Corosync 2.x 
quorum provider is already in corosync so no need for cman.



>
> --
> Eric Robinson
>
> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Tuesday, August 22, 2017 2:08 PM
> To: Cluster Labs - All topics related to open-source clustering 
> welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] ClusterLabs.Org Documentation Problem?
>
> On Tue, 2017-08-22 at 19:40 +, Eric Robinson wrote:
>> The documentation located here…
>>
>>
>>
>> http://clusterlabs.org/doc/
>>
>>
>>
>> …is confusing because it offers two combinations:
>>
>>
>>
>> Pacemaker 1.0 for Corosync 1.x
>>
>> Pacemaker 1.1 for Corosync 2.x
>>
>>
>>
>> According to the documentation, if you use Corosync 1.x you need 
>> Pacemaker 1.0, but if you use Corosync 2.x then you need Pacemaker 
>> 1.1.
>>
>>
>>
>> However, on my Centos 6.9 system, when I do ‘yum install pacemaker 
>> corosync” I get the following versions:
>>
>>
>>
>> pacemaker-1.1.15-5.el6.x86_64
>>
>> corosync-1.4.7-5.el6.x86_64
>>
>>
>>
>> What’s the correct answer? Does Pacemaker 1.1.15 work with Corosync 
>> 1.4.7? If so, is the documentation at ClusterLabs misleading?
>>
>>
>>
>> --
>> Eric Robinson
>
> The page actually offers a third option ... "Pacemaker 1.1 for CMAN or 
> Corosync 1.x". That's the configuration used by CentOS 6.
>
> However, that's still a bit misleading; the documentation set for "Pacemaker 
> 1.1 for Corosync 2.x" is the only one that is updated, and it's mostly 
> independent of the underlying layer, so you should prefer that set.
>
> I plan to reorganize that page in the coming months, so I'll try to make it 
> clearer.
>
> --
> Ken Gaillot <kgail...@redhat.com>
>
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


___
Users mailing list: Users@clusterlabs.org 
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft Azure?

2017-08-23 Thread Eric Robinson
I figured out the cause. CMAN got installed by yum, and so none of my changes 
to corosync.conf had any effect, including the udpu directive. Now I just have 
to figure out how to enable unicast in cman.

--
Eric Robinson


From: Eric Robinson [mailto:eric.robin...@psmnv.com]
Sent: Wednesday, August 23, 2017 3:16 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>
Subject: [ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft 
Azure?

I created two nodes on Micrsoft Azure, but I can't get them to join a cluster. 
Any thoughts?

OS: RHEL 6.9
Corosync version: 1.4.7-5.el6.x86_64
Node names: ha001a (172.28.0.4/23), ha001b (172.28.0.5/23)

The nodes are on the same subnet and can ping and ssh to each other just fine 
by either host name or IP address.

I have configured corosync to use unicast.

corosync-cfgtool looks fine...

[root@ha001b corosync]# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id  = 172.28.0.5
status  = ring 0 active with no faults

...but corosync-objctl only shows the local node...

[root@ha001b corosync]# corosync-objctl |grep join
totem.join=60
runtime.totem.pg.mrp.srp.memb_join_tx=1
runtime.totem.pg.mrp.srp.memb_join_rx=1
runtime.totem.pg.mrp.srp.members.2.join_count=1
runtime.totem.pg.mrp.srp.members.2.status=joined

...pcs status shows...

Cluster name: ha001
Stack: cman
Current DC: ha001b (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Wed Aug 23 18:04:33 2017  Last change: Wed Aug 23 
17:51:07 2017 by root via cibadmin on ha001b

2 nodes and 0 resources configured

Online: [ ha001b ]
OFFLINE: [ ha001a ]

No resources


Daemon Status:
  cman: active/disabled
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled

...it shows the opposite on the other node...

[root@ha001a ~]# corosync-objctl |grep join
totem.join=60
runtime.totem.pg.mrp.srp.memb_join_tx=1
runtime.totem.pg.mrp.srp.memb_join_rx=1
runtime.totem.pg.mrp.srp.members.1.join_count=1
runtime.totem.pg.mrp.srp.members.1.status=joined
[root@ha001a ~]# pcs status
Cluster name: ha001
Stack: cman
Current DC: ha001a (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Wed Aug 23 18:06:04 2017  Last change: Wed Aug 23 
17:51:03 2017 by root via cibadmin on ha001a

2 nodes and 0 resources configured

Online: [ ha001a ]
OFFLINE: [ ha001b ]

No resources


Daemon Status:
  cman: active/disabled
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled

...here is my corosync.conf...

compatibility: whitetank

totem {
version: 2
secauth: off
interface {
member {
memberaddr: 172.28.0.4
}
member {
memberaddr: 172.28.0.5
}
ringnumber: 0
bindnetaddr: 172.28.0.0
mcastport: 5405
ttl: 1
}
transport: udpu
}

logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

I used tcpdump and I see a lot of traffic between them on port 2224, but 
nothing else.

Is there an issue because the bindinetaddr is 172.28.0.0 but the members have a 
/23 mask?

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft Azure?

2017-08-23 Thread Eric Robinson
I got it.


From: Eric Robinson [mailto:eric.robin...@psmnv.com]
Sent: Wednesday, August 23, 2017 6:51 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>
Subject: Re: [ClusterLabs] Is there a Trick to Making Corosync Work on 
Microsoft Azure?

I figured out the cause. CMAN got installed by yum, and so none of my changes 
to corosync.conf had any effect, including the udpu directive. Now I just have 
to figure out how to enable unicast in cman.

--
Eric Robinson


From: Eric Robinson [mailto:eric.robin...@psmnv.com]
Sent: Wednesday, August 23, 2017 3:16 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org<mailto:users@clusterlabs.org>>
Subject: [ClusterLabs] Is there a Trick to Making Corosync Work on Microsoft 
Azure?

I created two nodes on Micrsoft Azure, but I can't get them to join a cluster. 
Any thoughts?

OS: RHEL 6.9
Corosync version: 1.4.7-5.el6.x86_64
Node names: ha001a (172.28.0.4/23), ha001b (172.28.0.5/23)

The nodes are on the same subnet and can ping and ssh to each other just fine 
by either host name or IP address.

I have configured corosync to use unicast.

corosync-cfgtool looks fine...

[root@ha001b corosync]# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id  = 172.28.0.5
status  = ring 0 active with no faults

...but corosync-objctl only shows the local node...

[root@ha001b corosync]# corosync-objctl |grep join
totem.join=60
runtime.totem.pg.mrp.srp.memb_join_tx=1
runtime.totem.pg.mrp.srp.memb_join_rx=1
runtime.totem.pg.mrp.srp.members.2.join_count=1
runtime.totem.pg.mrp.srp.members.2.status=joined

...pcs status shows...

Cluster name: ha001
Stack: cman
Current DC: ha001b (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Wed Aug 23 18:04:33 2017  Last change: Wed Aug 23 
17:51:07 2017 by root via cibadmin on ha001b

2 nodes and 0 resources configured

Online: [ ha001b ]
OFFLINE: [ ha001a ]

No resources


Daemon Status:
  cman: active/disabled
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled

...it shows the opposite on the other node...

[root@ha001a ~]# corosync-objctl |grep join
totem.join=60
runtime.totem.pg.mrp.srp.memb_join_tx=1
runtime.totem.pg.mrp.srp.memb_join_rx=1
runtime.totem.pg.mrp.srp.members.1.join_count=1
runtime.totem.pg.mrp.srp.members.1.status=joined
[root@ha001a ~]# pcs status
Cluster name: ha001
Stack: cman
Current DC: ha001a (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Wed Aug 23 18:06:04 2017  Last change: Wed Aug 23 
17:51:03 2017 by root via cibadmin on ha001a

2 nodes and 0 resources configured

Online: [ ha001a ]
OFFLINE: [ ha001b ]

No resources


Daemon Status:
  cman: active/disabled
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled

...here is my corosync.conf...

compatibility: whitetank

totem {
version: 2
secauth: off
interface {
member {
memberaddr: 172.28.0.4
}
member {
memberaddr: 172.28.0.5
}
ringnumber: 0
bindnetaddr: 172.28.0.0
mcastport: 5405
ttl: 1
}
transport: udpu
}

logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

I used tcpdump and I see a lot of traffic between them on port 2224, but 
nothing else.

Is there an issue because the bindinetaddr is 172.28.0.0 but the members have a 
/23 mask?

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Cannot connect to the drbdmanaged process using DBus

2017-12-14 Thread Eric Robinson
I'm sure someone has seen this before. What does it mean?

ha11a:~ # drbdmanage init 198.51.100.65

You are going to initialize a new drbdmanage cluster.
CAUTION! Note that:
  * Any previous drbdmanage cluster information may be removed
  * Any remaining resources managed by a previous drbdmanage installation
that still exist on this system will no longer be managed by drbdmanage

Confirm:

  yes/no: yes
Empty drbdmanage control volume initialized on '/dev/drbd0'.
Empty drbdmanage control volume initialized on '/dev/drbd1'.

Error: Cannot connect to the drbdmanaged process using DBus
The DBus subsystem returned the following error description:
org.freedesktop.DBus.Error.Spawn.ChildExited: Launch helper exited with unknown 
return code 1

I'm using...

drbd-9.0.9+git.bffac0d9-72.1.x86_64
drbd-kmp-default-9.0.9+git.bffac0d9_k4.4.76_1-72.1.x86_64
drbdmanage-0.99.5-5.1.noarch
drbd-utils-9.0.0-56.1.x86_64

--Eric

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Where to Find pcs and pcsd for OpenSUSE LEAP 4.23

2017-11-06 Thread Eric Robinson
I installed corosync 2.4.3 and pacemaker 1.1.17 from the openSUSE Leap 4.23 
repos, but I can't find pcs or pcsd. Anybody know where to download them from?

--Eric



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Where to Find pcs and pcsd for OpenSUSE LEAP 4.23

2017-11-06 Thread Eric Robinson
Thanks much. I am experienced with crmsh because I have been using it for 
years, but I recently tried pcs and I really like the way it handles 
constraints. Would be nice if it worked on openSUSE. Oh well.

--Eric


From: Eric Ren [mailto:z...@suse.com]
Sent: Monday, November 06, 2017 10:28 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>; Eric Robinson <eric.robin...@psmnv.com>
Subject: Re: [ClusterLabs] Where to Find pcs and pcsd for OpenSUSE LEAP 4.23


Hi,
On 11/07/2017 05:35 AM, Eric Robinson wrote:
I installed corosync 2.4.3 and pacemaker 1.1.17 from the openSUSE Leap 4.23 
repos, but I can't find pcs or pcsd. Anybody know where to download them from?

openSUSE/SUSE uses CLI tool "crmsh" and web UI "hawk" to mange HA cluster.
Please see "quick start" doc [1] and other HA docs under here [2].

[1] 
https://www.suse.com/documentation/sle-ha-12/install-quick/data/install-quick.html
[2] https://www.suse.com/documentation/sle-ha-12/index.html

Eric

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Where to Find pcs and pcsd for OpenSUSE LEAP 4.23

2017-11-07 Thread Eric Robinson
> Which aspects of its constraints handling do you like, and why?  I'm curious,
> since I wasn't aware that it was significantly different from crmsh in this
> respect.
> 

Well, to be fair, in the past I have always configured my clusters by using 
'crm configure edit' and building the config in a full-screen editor, and then 
saving it. Using that method, I have often had trouble getting my colocations 
and orderings to work properly. You have to use parenthesis and brackets to 
group things, and when you're done and save it, the cluster sometimes re-writes 
your statements for you. When you edit the config again, it looks different 
than what you typed and the resource dependencies are not what you wanted. It's 
very frustrating.

With pcs, the colocation syntax 'constraint add resource1 with resource2' and 
order syntax 'resource2 then resource1" are very intuitive, and cumulative. I 
always get exactly what I want. The first time I configured a cluster with pcs 
I fell in love with it.  

--Eric

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] One volume is trimmable but the other is not?

2018-01-26 Thread Eric Robinson
> > I sent this to the drbd list too, but it’s possible that someone here
> > may know.
> >
> >
> >
> > This is a WEIRD one.
> >
> >
> >
> > Why would one drbd volume be trimmable and the other one not?
> >
> 
> iirc drbd stores some of the config in the meta-data as well - like e.g. some
> block-size I remember in particular - and that doesn't just depend on the
> content of the current config-files but as well on the history (like already
> connected and to whom).
> Don't know if that helps in particular - just saying taking a look at 
> differences
> on the replication-partners might be worth while.
> 
> I know that it shows the maximum discard block-size 0 on one of the drbds
> but that might be a configuration passed down by the lvm layer as well.
> (provisioning_mode?) So searching for differences in the volume-groups or
> volumes might make sense as well.
> 
> Regards,
> Klaus

Thanks for your reply, Klaus. However, I don't think it's possible that 
anything could be getting "passed down" from LVM because the drbd devices are 
built directly on top of the raid arrays, with no LVM layer between...

{
on ha11a {
device /dev/drbd1;
disk /dev/md3;
address 198.51.100.65:7789;
meta-disk internal;
}

on ha11b {
device /dev/drbd1;
disk /dev/md3;
address 198.51.100.66:7789;
meta-disk internal;
}
}

--Eric
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] One volume is trimmable but the other is not?

2018-01-25 Thread Eric Robinson
I sent this to the drbd list too, but it's possible that someone here may know.

This is a WEIRD one.

Why would one drbd volume be trimmable and the other one not?

Here you can see me issuing the trim command against two different filesystems. 
It works on one but fails on the other.

ha11a:~ # fstrim -v /ha01_mysql
/ha01_mysql: 0 B (0 bytes) trimmed

ha11a:~ # fstrim -v /ha02_mysql
fstrim: /ha02_mysql: the discard operation is not supported

Both filesystems are on the same server, two different drbd devices on two 
different mdraid arrays, but the same underlying physical drives.

Yet it can be seen that discard is enabled on drbd0 but not on drbd1...

NAMEDISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda0  512B   4G 1
├─sda1 0  512B   4G 1
│ └─md00  128K 256M 0
├─sda2 0  512B   4G 1
│ └─md10  128K 256M 0
├─sda3 0  512B   4G 1
├─sda4 0  512B   4G 1
├─sda5 0  512B   4G 1
│ └─md201M 256M 0
│   └─drbd001M 128M 0
│ └─vg_on_drbd0-lv_on_drbd0   3932161M 128M 0
└─sda6 0  512B   4G 1
  └─md301M 256M 0
└─drbd100B   0B 0
  └─vg_on_drbd1-lv_on_drbd100B   0B 0


The filesystems are set up the same. (Note that I do not want automatic discard 
so that option is not enabled on either filesystem, but the problem is not the 
filesystem, since that relies on drbd, and you can see from lsblk that the drbd 
volume is the problem.)

ha11a:~ # mount|grep drbd
/dev/mapper/vg_on_drbd1-lv_on_drbd1 on /ha02_mysql type ext4 
(rw,relatime,stripe=160,data=ordered)
/dev/mapper/vg_on_drbd0-lv_on_drbd0 on /ha01_mysql type ext4 
(rw,relatime,stripe=160,data=ordered)





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?

2018-02-12 Thread Eric Robinson
General question. I tried to set up a cman + corosync + pacemaker cluster using 
two corosync rings. When I start the cluster, everything works fine, except 
when I do a 'corosync-cfgtool -s' it only shows one ring. I tried manually 
editing the /etc/cluster/cluster.conf file adding two  sections, but 
then cman complained that I didn't have a multicast address specified, even 
though I did. I tried editing the /etc/corosdync/corosync.conf file, and then I 
could get two rings, but the nodes would not both join the cluster. Bah! I did 
some reading and saw that cman didn't support multiple rings years ago. Did it 
never get updated?

[sig]

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?

2018-02-13 Thread Eric Robinson
Thanks for the  suggestion everyone. I'll give that a try.

> -Original Message-
> From: Jan Friesse [mailto:jfrie...@redhat.com]
> Sent: Monday, February 12, 2018 8:49 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users@clusterlabs.org>; Eric Robinson <eric.robin...@psmnv.com>
> Subject: Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync
> Rings?
> 
> Eric,
> 
> > General question. I tried to set up a cman + corosync + pacemaker
> > cluster using two corosync rings. When I start the cluster, everything
> > works fine, except when I do a 'corosync-cfgtool -s' it only shows one
> > ring. I tried manually editing the /etc/cluster/cluster.conf file
> > adding two 
> 
> AFAIK cluster.conf should be edited so altname is used. So something like in
> this example:
> https://access.redhat.com/documentation/en-
> us/red_hat_enterprise_linux/6/html/cluster_administration/s1-config-rrp-
> cli-ca
> 
> I don't think you have to add altmulticast.
> 
> Honza
> 
> sections, but then cman complained that I didn't have a multicast address
> specified, even though I did. I tried editing the /etc/corosdync/corosync.conf
> file, and then I could get two rings, but the nodes would not both join the
> cluster. Bah! I did some reading and saw that cman didn't support multiple
> rings years ago. Did it never get updated?
> >
> > [sig]
> >
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Does CMAN Still Not Support Multipe CoroSync Rings?

2018-02-14 Thread Eric Robinson
> > Thanks for the  suggestion everyone. I'll give that a try.
> 
> Sorry, I'm late on this, but I wrote a quick start doc describing this (amongs
> other things) some time ago. See the following chapter:
> 
> https://clusterlabs.github.io/PAF/Quick_Start-CentOS-6.html#cluster-
> creation
> 

I scanned through that page but I did not see where it talks about setting up 
multiple corosync rings.

--Eric


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Why Won't Resources Move?

2018-07-31 Thread Eric Robinson
I have what seems to be a healthy cluster, but I can't get resources to move.

Here's what's installed...

[root@001db01a cluster]# yum list installed|egrep "pacem|coro"
corosync.x86_64  2.4.3-2.el7_5.1 @updates
corosynclib.x86_64   2.4.3-2.el7_5.1 @updates
pacemaker.x86_64 1.1.18-11.el7_5.3   @updates
pacemaker-cli.x86_64 1.1.18-11.el7_5.3   @updates
pacemaker-cluster-libs.x86_641.1.18-11.el7_5.3   @updates
pacemaker-libs.x86_641.1.18-11.el7_5.3   @updates

Cluster status looks good...

[root@001db01b cluster]# pcs status
Cluster name: 001db01ab
Stack: corosync
Current DC: 001db01b (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with 
quorum
Last updated: Wed Aug  1 03:44:47 2018
Last change: Wed Aug  1 03:22:18 2018 by root via cibadmin on 001db01a

2 nodes configured
11 resources configured

Online: [ 001db01a 001db01b ]

Full list of resources:

p_vip_clust01  (ocf::heartbeat:IPaddr2):   Started 001db01b
p_azip_clust01 (ocf::heartbeat:AZaddr2):   Started 001db01b
Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db01b ]
 Slaves: [ 001db01a ]
Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db01b ]
 Slaves: [ 001db01a ]
p_fs_clust01   (ocf::heartbeat:Filesystem):Started 001db01b
p_fs_clust02   (ocf::heartbeat:Filesystem):Started 001db01b
p_vip_clust02  (ocf::heartbeat:IPaddr2):   Started 001db01b
p_azip_clust02 (ocf::heartbeat:AZaddr2):   Started 001db01b
p_mysql_001(lsb:mysql_001):Started 001db01b

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Constraints look like this...

[root@001db01b cluster]# pcs constraint
Location Constraints:
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory)
  promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory)
  start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory)
  start p_vip_clust01 then start p_azip_clust01 (kind:Mandatory)
  start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory)
  start p_vip_clust02 then start p_azip_clust02 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_001 (kind:Mandatory)
Colocation Constraints:
  p_azip_clust01 with p_vip_clust01 (score:INFINITY)
  p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master)
  p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master)
  p_vip_clust01 with p_fs_clust01 (score:INFINITY)
  p_vip_clust02 with p_fs_clust02 (score:INFINITY)
  p_azip_clust02 with p_vip_clust02 (score:INFINITY)
  p_mysql_001 with p_vip_clust01 (score:INFINITY)
Ticket Constraints:

But when I issue a move command, nothing at all happens.

I see this in the log on one node...

Aug 01 03:21:57 [16550] 001db01bcib: info: cib_perform_op:  ++ 
/cib/configuration/constraints:  
Aug 01 03:21:57 [16550] 001db01bcib: info: cib_process_request: 
Completed cib_modify operation for section constraints: OK (rc=0, 
origin=001db01a/crm_resource/4, version=0.138.0)
Aug 01 03:21:57 [16555] 001db01b   crmd: info: abort_transition_graph:  
Transition aborted by rsc_location.cli-prefer-ms_drbd0 'create': Configuration 
change | cib=0.138.0 source=te_update_diff:456 
path=/cib/configuration/constraints complete=true

And I see this in the log on the other node...

notice: p_drbd1_monitor_6:69196:stderr [ Error signing on to the CIB 
service: Transport endpoint is not connected ]

Any thoughts?

--Eric



[sig]

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Won't Resources Move?

2018-08-01 Thread Eric Robinson
> -Original Message-
> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ken Gaillot
> Sent: Wednesday, August 01, 2018 2:17 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Why Won't Resources Move?
> 
> On Wed, 2018-08-01 at 03:49 +, Eric Robinson wrote:
> > I have what seems to be a healthy cluster, but I can’t get resources
> > to move.
> >
> > Here’s what’s installed…
> >
> > [root@001db01a cluster]# yum list installed|egrep "pacem|coro"
> > corosync.x86_64  2.4.3-2.el7_5.1 @updates
> > corosynclib.x86_64   2.4.3-2.el7_5.1 @updates
> > pacemaker.x86_64 1.1.18-11.el7_5.3 @updates
> > pacemaker-cli.x86_64 1.1.18-11.el7_5.3 @updates
> > pacemaker-cluster-libs.x86_64    1.1.18-11.el7_5.3 @updates
> > pacemaker-libs.x86_64    1.1.18-11.el7_5.3 @updates
> >
> > Cluster status looks good…
> >
> > [root@001db01b cluster]# pcs status
> > Cluster name: 001db01ab
> > Stack: corosync
> > Current DC: 001db01b (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
> > partition with quorum Last updated: Wed Aug  1 03:44:47 2018 Last
> > change: Wed Aug  1 03:22:18 2018 by root via cibadmin on 001db01a
> >
> > 2 nodes configured
> > 11 resources configured
> >
> > Online: [ 001db01a 001db01b ]
> >
> > Full list of resources:
> >
> > p_vip_clust01  (ocf::heartbeat:IPaddr2):   Started 001db01b
> > p_azip_clust01 (ocf::heartbeat:AZaddr2):   Started 001db01b
> > Master/Slave Set: ms_drbd0 [p_drbd0]
> >  Masters: [ 001db01b ]
> >  Slaves: [ 001db01a ]
> > Master/Slave Set: ms_drbd1 [p_drbd1]
> >  Masters: [ 001db01b ]
> >  Slaves: [ 001db01a ]
> > p_fs_clust01   (ocf::heartbeat:Filesystem):    Started 001db01b
> > p_fs_clust02   (ocf::heartbeat:Filesystem):    Started 001db01b
> > p_vip_clust02  (ocf::heartbeat:IPaddr2):   Started 001db01b
> > p_azip_clust02 (ocf::heartbeat:AZaddr2):   Started 001db01b
> > p_mysql_001    (lsb:mysql_001):    Started 001db01b
> >
> > Daemon Status:
> >   corosync: active/disabled
> >   pacemaker: active/disabled
> >   pcsd: active/enabled
> >
> > Constraints look like this…
> >
> > [root@001db01b cluster]# pcs constraint Location Constraints:
> > Ordering Constraints:
> >   promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory)
> >   promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory)
> >   start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory)
> >   start p_vip_clust01 then start p_azip_clust01 (kind:Mandatory)
> >   start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory)
> >   start p_vip_clust02 then start p_azip_clust02 (kind:Mandatory)
> >   start p_vip_clust01 then start p_mysql_001 (kind:Mandatory)
> > Colocation Constraints:
> >   p_azip_clust01 with p_vip_clust01 (score:INFINITY)
> >   p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master)
> >   p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master)
> >   p_vip_clust01 with p_fs_clust01 (score:INFINITY)
> >   p_vip_clust02 with p_fs_clust02 (score:INFINITY)
> >   p_azip_clust02 with p_vip_clust02 (score:INFINITY)
> >   p_mysql_001 with p_vip_clust01 (score:INFINITY) Ticket Constraints:
> >
> > But when I issue a move command, nothing at all happens.
> >
> > I see this in the log on one node…
> >
> > Aug 01 03:21:57 [16550] 001db01b    cib: info:
> > cib_perform_op:  ++ /cib/configuration/constraints:   > id="cli-prefer-ms_drbd0" rsc="ms_drbd0" role="Started"
> > node="001db01a" score="INFINITY"/>
> > Aug 01 03:21:57 [16550] 001db01b    cib: info:
> > cib_process_request: Completed cib_modify operation for section
> > constraints: OK (rc=0, origin=001db01a/crm_resource/4,
> > version=0.138.0)
> > Aug 01 03:21:57 [16555] 001db01b   crmd: info:
> > abort_transition_graph:  Transition aborted by rsc_location.cli-
> > prefer-ms_drbd0 'create': Configuration change | cib=0.138.0
> > source=te_update_diff:456 path=/cib/configuration/constraints
> > complete=true
> >
> > And I see this in the log on the other node…
> >
> > notice: p_drbd1_monitor_6:69196:stderr [ Error signing on to the
> > CIB service: Transport endpoint is not connected ]
> 
> The message likely came from the resource agent calling crm_attribute to set
> a node attribute. That message usually means the

Re: [ClusterLabs] Why Won't Resources Move?

2018-08-01 Thread Eric Robinson

> > The message likely came from the resource agent calling crm_attribute
> > to set a node attribute. That message usually means the cluster isn't
> > running on that node, so it's highly suspect. The cib might have
> > crashed, which should be in the log as well. I'd look into that first.
> 
> 
> I rebooted the server and afterwards I'm still getting tons of these...
> 
> Aug  2 01:43:40 001db01a drbd(p_drbd1)[18628]: ERROR: ha02_mysql: Called
> /usr/sbin/crm_master -Q -l reboot -v 1 Aug  2 01:43:40 001db01a
> drbd(p_drbd0)[18627]: ERROR: ha01_mysql: Called /usr/sbin/crm_master -Q -l
> reboot -v 1 Aug  2 01:43:40 001db01a drbd(p_drbd0)[18627]: ERROR:
> ha01_mysql: Exit code 107 Aug  2 01:43:40 001db01a drbd(p_drbd1)[18628]:
> ERROR: ha02_mysql: Exit code 107 Aug  2 01:43:40 001db01a
> drbd(p_drbd0)[18627]: ERROR: ha01_mysql: Command output:
> Aug  2 01:43:40 001db01a drbd(p_drbd1)[18628]: ERROR: ha02_mysql:
> Command output:
> Aug  2 01:43:40 001db01a lrmd[2025]:  notice:
> p_drbd0_monitor_6:18627:stderr [ Error signing on to the CIB service:
> Transport endpoint is not connected ] Aug  2 01:43:40 001db01a lrmd[2025]:
> notice: p_drbd1_monitor_6:18628:stderr [ Error signing on to the CIB
> service: Transport endpoint is not connected ]
> 
> 

Ken, 

Ironically, while researching this problem, I ran across the same question 
being asked back in November of 2017, and you made the same comment back then.

https://lists.clusterlabs.org/pipermail/users/2017-November/013975.html

And the solution turned out to be the same for me as it was for that guy. On 
the node where I was getting the errors, SELINUX was enforcing. I set it to 
permissive and the errors went away. 

--Eric
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Why Won't Resources Move?

2018-08-02 Thread Eric Robinson
> Hi!
> 
> I'm not familiar with Redhat, but is tis normal?:
> 
> > >   corosync: active/disabled
> > >   pacemaker: active/disabled
> 
> Regards,
> Ulrich

That's the default after a new install. I had not enabled them to start 
automatically yet. 


> 
> >>> Eric Robinson  schrieb am 02.08.2018 um
> >>> 03:44 in
> Nachricht
>  rd03.prod.outlook.com>
> 
> >>  -Original Message-
> >> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ken
> Gaillot
> >> Sent: Wednesday, August 01, 2018 2:17 PM
> >> To: Cluster Labs - All topics related to open-source clustering
> >> welcomed 
> >> Subject: Re: [ClusterLabs] Why Won't Resources Move?
> >>
> >> On Wed, 2018-08-01 at 03:49 +, Eric Robinson wrote:
> >> > I have what seems to be a healthy cluster, but I can’t get
> >> > resources to move.
> >> >
> >> > Here’s what’s installed…
> >> >
> >> > [root@001db01a cluster]# yum list installed|egrep "pacem|coro"
> >> > corosync.x86_64  2.4.3-2.el7_5.1 @updates
> >> > corosynclib.x86_64   2.4.3-2.el7_5.1 @updates
> >> > pacemaker.x86_64 1.1.18-11.el7_5.3 @updates
> >> > pacemaker-cli.x86_64 1.1.18-11.el7_5.3 @updates
> >> > pacemaker-cluster-libs.x86_641.1.18-11.el7_5.3 @updates
> >> > pacemaker-libs.x86_641.1.18-11.el7_5.3 @updates
> >> >
> >> > Cluster status looks good…
> >> >
> >> > [root@001db01b cluster]# pcs status Cluster name: 001db01ab
> >> > Stack: corosync
> >> > Current DC: 001db01b (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
> >> > partition with quorum Last updated: Wed Aug  1 03:44:47 2018 Last
> >> > change: Wed Aug  1 03:22:18 2018 by root via cibadmin on 001db01a
> >> >
> >> > 2 nodes configured
> >> > 11 resources configured
> >> >
> >> > Online: [ 001db01a 001db01b ]
> >> >
> >> > Full list of resources:
> >> >
> >> > p_vip_clust01  (ocf::heartbeat:IPaddr2):   Started 001db01b
> >> > p_azip_clust01 (ocf::heartbeat:AZaddr2):   Started 001db01b
> >> > Master/Slave Set: ms_drbd0 [p_drbd0]
> >> >  Masters: [ 001db01b ]
> >> >  Slaves: [ 001db01a ]
> >> > Master/Slave Set: ms_drbd1 [p_drbd1]
> >> >  Masters: [ 001db01b ]
> >> >  Slaves: [ 001db01a ]
> >> > p_fs_clust01   (ocf::heartbeat:Filesystem):Started 001db01b
> >> > p_fs_clust02   (ocf::heartbeat:Filesystem):Started 001db01b
> >> > p_vip_clust02  (ocf::heartbeat:IPaddr2):   Started 001db01b
> >> > p_azip_clust02 (ocf::heartbeat:AZaddr2):   Started 001db01b
> >> > p_mysql_001(lsb:mysql_001):Started 001db01b
> >> >
> >> > Daemon Status:
> >> >   corosync: active/disabled
> >> >   pacemaker: active/disabled
> >> >   pcsd: active/enabled
> >> >
> >> > Constraints look like this…
> >> >
> >> > [root@001db01b cluster]# pcs constraint Location Constraints:
> >> > Ordering Constraints:
> >> >   promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory)
> >> >   promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory)
> >> >   start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory)
> >> >   start p_vip_clust01 then start p_azip_clust01 (kind:Mandatory)
> >> >   start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory)
> >> >   start p_vip_clust02 then start p_azip_clust02 (kind:Mandatory)
> >> >   start p_vip_clust01 then start p_mysql_001 (kind:Mandatory)
> >> > Colocation Constraints:
> >> >   p_azip_clust01 with p_vip_clust01 (score:INFINITY)
> >> >   p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master)
> >> >   p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master)
> >> >   p_vip_clust01 with p_fs_clust01 (score:INFINITY)
> >> >   p_vip_clust02 with p_fs_clust02 (score:INFINITY)
> >> >   p_azip_clust02 with p_vip_clust02 (score:INFINITY)
> >> >   p_mysql_001 with p_vip_clust01 (score:INFINITY) Ticket Constraints:
> >> >
> >> > But when I issue a move command, nothing at all happens.
> >> >
> >> > I see this in the log on one node…
> >> >
> >> > Aug 01 03:21:57 [1655

[ClusterLabs] What am I Doing Wrong with Constraints?

2018-08-06 Thread Eric Robinson
I don't understand why a problem with a resource causes other resources above 
it in the dependency stack (or on the same level with it) to fail over.

My dependency stack is:

drbd -> filesystem -> floating_ip -> Azure virtual IP
|
-> MySQL_instance_1
-> MySQL_instance_2

Note that the MySQL instances are dependent on the floating IP, but not on each 
other. However, if one of the MySQL instances has a problem that causes it to 
go into a FAIL status, the whole cluster fails over. Or if the Azure virtual IP 
resource has a problem and I need to run a cleanup, the whole cluster fails 
over. 

Here's what my resources look like...

[root@001db01b mysql]# pcs status
Cluster name: 001db01ab
Stack: corosync
Current DC: 001db01b (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with 
quorum
Last updated: Mon Aug  6 12:52:44 2018
Last change: Mon Aug  6 12:18:38 2018 by root via cibadmin on 001db01a

2 nodes configured
11 resources configured

Online: [ 001db01a 001db01b ]

Full list of resources:

 p_vip_clust01  (ocf::heartbeat:IPaddr2):   Started 001db01b
 p_azip_clust01 (ocf::heartbeat:AZaddr2):   Started 001db01b
 Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db01b ]
 Slaves: [ 001db01a ]
 Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db01a ]
 Slaves: [ 001db01b ]
 p_fs_clust01   (ocf::heartbeat:Filesystem):Started 001db01b
 p_fs_clust02   (ocf::heartbeat:Filesystem):Started 001db01a
 p_vip_clust02  (ocf::heartbeat:IPaddr2):   Started 001db01a
 p_azip_clust02 (ocf::heartbeat:AZaddr2):   Started 001db01a
 p_mysql_001(lsb:mysql_001):Started 001db01b

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled

Here's what my constraints look like...

[root@001db01b mysql]# pcs constraint --full
Location Constraints:
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory) 
(id:order-ms_drbd0-p_fs_clust01-mandatory)
  promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory) 
(id:order-ms_drbd1-p_fs_clust02-mandatory)
  start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory) 
(id:order-p_fs_clust01-p_vip_clust01-mandatory)
  start p_vip_clust01 then start p_azip_clust01 (kind:Mandatory) 
(id:order-p_vip_clust01-p_azip_clust01-mandatory)
  start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory) 
(id:order-p_fs_clust02-p_vip_clust02-mandatory)
  start p_vip_clust02 then start p_azip_clust02 (kind:Mandatory) 
(id:order-p_vip_clust02-p_azip_clust02-mandatory)
  start p_vip_clust01 then start p_mysql_001 (kind:Mandatory) 
(id:order-p_vip_clust01-p_mysql_001-mandatory)
Colocation Constraints:
  p_azip_clust01 with p_vip_clust01 (score:INFINITY) 
(id:colocation-p_azip_clust01-p_vip_clust01-INFINITY)
  p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master) 
(id:colocation-p_fs_clust01-ms_drbd0-INFINITY)
  p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master) 
(id:colocation-p_fs_clust02-ms_drbd1-INFINITY)
  p_vip_clust01 with p_fs_clust01 (score:INFINITY) 
(id:colocation-p_vip_clust01-p_fs_clust01-INFINITY)
  p_vip_clust02 with p_fs_clust02 (score:INFINITY) 
(id:colocation-p_vip_clust02-p_fs_clust02-INFINITY)
  p_azip_clust02 with p_vip_clust02 (score:INFINITY) 
(id:colocation-p_azip_clust02-p_vip_clust02-INFINITY)
  p_mysql_001 with p_vip_clust01 (score:INFINITY) 
(id:colocation-p_mysql_001-p_vip_clust01-INFINITY)
Ticket Constraints:





___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Different Times in the Corosync Log?

2018-08-20 Thread Eric Robinson
The corosync log show different times for lrmd messages than for cib or crmd 
messages. Note the 4 hour difference. What?


Aug 20 13:08:27 [107884] 001store01acib: info: cib_perform_op:  
+  
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='p_replicator']/lrm_rsc_op[@id='p_replicator_monitor_6']:
  @transition-magic=0:0;9:251:0:283f3d6c-2e91-4f61-95dd-306d3e1eb052, 
@call-id=361, @rc-code=0, @op-status=0, @exec-time=5
Aug 20 13:08:27 [107884] 001store01acib: info: cib_process_request: 
Completed cib_modify operation for section status: OK (rc=0, 
origin=001store01a/crmd/2451, version=0.30.5)
Aug 20 13:08:32 [107884] 001store01acib: info: cib_process_ping:
Reporting our current digest to 001store01b: f33ca1999ceeb68d22f3171f03be2638 
for 0.30.5 (0x55b38bd96280 0)
Aug 20 17:08:45 [107886] 001store01a   lrmd:  warning: 
child_timeout_callback:  p_azip_ftpclust01_monitor_0 process (PID 52488) 
timed out
Aug 20 17:08:45 [107886] 001store01a   lrmd:  warning: operation_finished:  
p_azip_ftpclust01_monitor_0:52488 - timed out after 2ms
Aug 20 13:08:45 [107889] 001store01a   crmd:error: process_lrm_event:   
Result of probe operation for p_azip_ftpclust01 on 001store01a: Timed Out | 
call=359 key=p_azip_ftpclust01_monitor_0 timeout=2ms
Aug 20 13:08:45 [107884] 001store01acib: info: cib_process_request: 
Forwarding cib_modify operation for section status to all 
(origin=local/crmd/2452)
Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:  
Diff: --- 0.30.5 2
Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:  
Diff: +++ 0.30.6 (null)
Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:  
+  /cib:  @num_updates=6
Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:  
+  
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='p_azip_ftpclust01']/lrm_rsc_op[@id='p_azip_ftpclust01_last_0']:
  @transition-magic=2:1;3:251:7:283f3d6c-2e91-4f61-95dd-306d3e1eb052, 
@call-id=359, @rc-code=1, @op-status=2, @exec-time=20002
Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:  
++ 
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='p_azip_ftpclust01']:
  https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Different Times in the Corosync Log?

2018-08-21 Thread Eric Robinson
> Hi!
> 
> I could guess that the processes run with different timezone settings (for
> whatever reason).
> 
> Regards,
> Ulrich


That would be my guess, too, but I cannot imagine how they ended up in that 
condition. 

> 
> >>> Eric Robinson  schrieb am 21.08.2018 um
> >>> 02:43 in
> Nachricht
>  3.prod.outlook.com>
> 
> > The corosync log show different times for lrmd messages than for cib
> > or crmd
> 
> > messages. Note the 4 hour difference. What?
> >
> >
> > Aug 20 13:08:27 [107884] 001store01acib: info: cib_perform_op:
> 
> >+
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@i
> d='
> > p_replicator']/lrm_rsc_op[@id='p_replicator_monitor_6']:
> > @transition‑magic=0:0;9:251:0:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052,
> @call‑id=361,
> > @rc‑code=0, @op‑status=0, @exec‑time=5
> > Aug 20 13:08:27 [107884] 001store01acib: info:
> > cib_process_request: Completed cib_modify operation for section
> > status: OK (rc=0, origin=001store01a/crmd/2451, version=0.30.5)
> > Aug 20 13:08:32 [107884] 001store01acib: info: cib_process_ping:
> 
> >Reporting our current digest to 001store01b:
> > f33ca1999ceeb68d22f3171f03be2638 for 0.30.5 (0x55b38bd96280 0)
> > Aug 20 17:08:45 [107886] 001store01a   lrmd:  warning:
> > child_timeout_callback:  p_azip_ftpclust01_monitor_0 process (PID 52488)
> 
> > timed out
> > Aug 20 17:08:45 [107886] 001store01a   lrmd:  warning:
> > operation_finished:  p_azip_ftpclust01_monitor_0:52488 ‑ timed out
> > after 2ms
> > Aug 20 13:08:45 [107889] 001store01a   crmd:error:
> > process_lrm_event:   Result of probe operation for p_azip_ftpclust01 on
> > 001store01a: Timed Out | call=359 key=p_azip_ftpclust01_monitor_0
> > timeout=2ms
> > Aug 20 13:08:45 [107884] 001store01acib: info:
> > cib_process_request: Forwarding cib_modify operation for section
> > status to all (origin=local/crmd/2452)
> > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:
> 
> >Diff: ‑‑‑ 0.30.5 2
> > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:
> 
> >Diff: +++ 0.30.6 (null)
> > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:
> 
> >+  /cib:  @num_updates=6
> > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:
> 
> >+
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@i
> d='
> > p_azip_ftpclust01']/lrm_rsc_op[@id='p_azip_ftpclust01_last_0']:
> > @transition‑magic=2:1;3:251:7:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052,
> @call‑id=359,
> > @rc‑code=1, @op‑status=2, @exec‑time=20002
> > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:
> 
> >++
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@i
> d='
> > p_azip_ftpclust01']:   > operation_key="p_azip_ftpclust01_monitor_0" operation="monitor"
> > crm‑debug‑origin="do_update_resource" crm_feature_set="3.0.14"
> > transition‑key="3:251:7:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052"
> > transition‑magic="2:1;3:251:7:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052"
> > exit‑reason="" on_node="001
> > Aug 20 13:08:45 [107884] 001store01acib: info:
> > cib_process_request: Completed cib_modify operation for section
> > status: OK (rc=0, origin=001store01a/crmd/2452, version=0.30.6)
> > Aug 20 13:08:45 [107889] 001store01a   crmd: info: do_lrm_rsc_op:
> 
> >Performing key=3:252:0:283f3d6c‑2e91‑4f61‑95dd‑306d3e1eb052
> > op=p_azip_ftpclust01_stop_0
> > Aug 20 13:08:45 [107884] 001store01acib: info:
> > cib_process_request: Forwarding cib_modify operation for section
> > status to all (origin=local/crmd/2453)
> > Aug 20 17:08:45 [107886] 001store01a   lrmd: info: log_execute:
> > executing ‑ rsc:p_azip_ftpclust01 action:stop call_id:362
> > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:
> 
> >Diff: ‑‑‑ 0.30.6 2
> > Aug 20 13:08:45 [107884] 001store01acib: info: cib_perform_op:
> 
> >Diff: +++ 0.30.7 (null)
> >
> > [sig]
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Different Times in the Corosync Log?

2018-08-21 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Jan Pokorný
> Sent: Tuesday, August 21, 2018 2:45 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Different Times in the Corosync Log?
> 
> On 21/08/18 08:43 +, Eric Robinson wrote:
> >> I could guess that the processes run with different timezone settings
> >> (for whatever reason).
> >
> > That would be my guess, too, but I cannot imagine how they ended up in
> > that condition.
> 
> Hard to guess, the PIDs indicate the expected state of covering a very short
> interval sequentially (i.e. no intermittent failure recovered with a restart 
> of
> lrmd, AFAICT).  In case it can have any bearing, how do you start pacemaker --
> systemd, initscript, as a corosync plugin, something else?

Depends on how new the cluster is. With these, I start it with 'pcs cluster 
start'.

> 
> --
> Nazdar,
> Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Different Times in the Corosync Log?

2018-08-21 Thread Eric Robinson
> 
> Whoa, I think you win some sort of fubar prize. :-)

It's always nice to feel special. 

> 
> AFAIK, any OS-level time or timezone change affects all processes equally. (I
> occasionally deal with cluster logs where the OS time jumped backward or
> forward, and all logs system-wide are equally
> affected.)
> 

Except when you're visiting insane-world, which I seem to be. 

> Some applications have their own timezone setting that can override the
> system default, but pacemaker isn't one of them. It's even more bizarre when
> you consider that the daemons here are the children of the same process
> (pacemakerd), and thus have an identical set of environment variables and so
> forth. (And as Jan pointed out, they appear to have been started within a
> fraction of a second of each other.)
> 
> Apparently there is a dateshift kernel module that can put particular 
> processes
> in different apparent times, but I assume you'd know if you did that on 
> purpose.
> :-) It does occur to me that the module would be a great prank to play on
> someone (especially combined with a cron job that randomly altered the
> configuration).
> 
> If you figure this out, I'd love to hear what it was. Gremlins ...

You'll be the second to know after me!

> 
> On Tue, 2018-08-21 at 11:45 +0200, Jan Pokorný wrote:
> > On 21/08/18 08:43 +, Eric Robinson wrote:
> > > > I could guess that the processes run with different timezone
> > > > settings (for whatever reason).
> > >
> > > That would be my guess, too, but I cannot imagine how they ended up
> > > in that condition.
> >
> > Hard to guess, the PIDs indicate the expected state of covering a very
> > short interval sequentially (i.e. no intermittent failure recovered
> > with a restart of lrmd, AFAICT).  In case it can have any bearing, how
> > do you start pacemaker -- systemd, initscript, as a corosync plugin,
> > something else?
> --
> Ken Gaillot 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Increasing Token Timeout Safe By Itself?

2019-01-20 Thread Eric Robinson
I have a few corosync+pacemeker clusters in Azure. Occasionally, cluster nodes 
failover, possibly because of intermittent connectivity loss, but more likely 
because one or more nodes experiences high load and is not able to respond in a 
timely fashion. I want to make the clusters a little more resilient to such 
conditions (i.e., allow clusters more time to recover naturally before failing 
over). Is it a simple matter of increasing the totem.token timeout from the 
default value? Or are there other things that should be changes as well? And 
once the value is increased, how do I make it active without restarting the 
cluster?

--Eric



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Increasing Token Timeout Safe By Itself?

2019-01-22 Thread Eric Robinson
> -Original Message-
> From: Jan Friesse 
> Sent: Sunday, January 20, 2019 11:57 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Eric Robinson 
> Subject: Re: [ClusterLabs] Increasing Token Timeout Safe By Itself?
> 
> Eric Robinson napsal(a):
> > I have a few corosync+pacemeker clusters in Azure. Occasionally,
> > cluster nodes failover, possibly because of intermittent connectivity
> > loss, but more likely because one or more nodes experiences high load
> > and is not able to respond in a timely fashion. I want to make the
> > clusters a little more resilient to such conditions (i.e., allow
> > clusters more time to recover naturally before failing over). Is it a
> > simple matter of increasing the totem.token timeout from the default
> > value? Or are
> there other things that should be changes as well? And once the value is
> increased, how do I make it
> 
> Usually it is really enough to increase totem.token. Used token timeout is
> computed based on this value (see corosync.conf man page for more
> details). It's possible to get used value by executing "corosync-cmapctl
>   -g runtime.config.totem.token" command.
> 
> active without restarting the cluster?
> 
> You can ether edit config file (ideally on all nodes) and exec 
> "corosync-cfgtool
> -R" (just on one node) or you can use "corosync-cmapctl  -s totem.token u32
> $REQUIRED_VALUE" (ideally on all nodes). Also pcs/crmshell may also
> support this functionality.
> 
> Honza
> 

Thank you very much for the feedback!

--Eric
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Eric Robinson




> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Wednesday, February 20, 2019 8:51 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When
> Just One Fails?
> 
> 20.02.2019 21:51, Eric Robinson пишет:
> >
> > The following should show OK in a fixed font like Consolas, but the
> following setup is supposed to be possible, and is even referenced in the
> ClusterLabs documentation.
> >
> >
> >
> >
> >
> > +--+
> >
> > |   mysql001   +--+
> >
> > +--+  |
> >
> > +--+  |
> >
> > |   mysql002   +--+
> >
> > +--+  |
> >
> > +--+  |   +-+   ++   +--+
> >
> > |   mysql003   +->+ floating ip +-->+ filesystem +-->+ blockdev |
> >
> > +--+  |   +-+   ++   +--+
> >
> > +--+  |
> >
> > |   mysql004   +--+
> >
> > +--+  |
> >
> > +--+  |
> >
> > |   mysql005   +--+
> >
> > +--+
> >
> >
> >
> > In the layout above, the MySQL instances are dependent on the same
> underlying service stack, but they are not dependent on each other.
> Therefore, as I understand it, the failure of one MySQL instance should not
> cause the failure of other MySQL instances if on-fail=ignore on-fail=stop. At
> least, that’s the way it seems to me, but based on the thread, I guess it does
> not behave that way.
> >
> 
> This works this way for monitor operation if you set on-fail=block.
> Failed resource is left "as is". The only case when it does not work seems to
> be stop operation; even with explicit on-fail=block it still attempts to 
> initiate
> follow up actions. I still consider this a bug.
> 
> If this is not a bug, this needs clear explanation in documentation.
> 
> But please understand that assuming on-fail=block works you effectively
> reduce your cluster to controlled start of resources during boot. As we have

Or failover, correct?

> seen, stopping of resource IP is blocked, meaning pacemaker also cannot
> perform resource level recovery at all. And for mysql resources you explicitly
> ignore any result of monitoring or failure to stop it.
> And not having stonith also prevents pacemaker from handling node failure.
> What leaves is at most restart of resources on another node during graceful
> shutdown.
> 
> It begs a question - what do you need such "cluster" for at all?

Mainly to manage the other relevant resources: drbd, filesystem, and floating 
IP. I'm content to forego resource level recovery for MySQL services and 
monitor their health from outside the cluster and remediate them manually if 
necessary. I don't see an option if I want to avoid the sort of deadlock 
situation we talked about earlier. 

> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Eric Robinson








> -Original Message-

> From: Users  On Behalf Of Ulrich Windl

> Sent: Tuesday, February 19, 2019 11:35 PM

> To: users@clusterlabs.org

> Subject: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When

> Just One Fails?

>

> >>> Eric Robinson mailto:eric.robin...@psmnv.com>> 
> >>> schrieb am 19.02.2019 um

> >>> 21:06 in

> Nachricht

> mailto:mn2pr03mb4845be22fada30b472174b79fa...@mn2pr03mb4845.namprd03.prod.outlook.com>

> d03.prod.outlook.com<mailto:mn2pr03mb4845be22fada30b472174b79fa...@mn2pr03mb4845.namprd03.prod.outlook.com>>

>

> >>  -Original Message-

> >> From: Users 
> >> mailto:users-boun...@clusterlabs.org>> On 
> >> Behalf Of Ken Gaillot

> >> Sent: Tuesday, February 19, 2019 10:31 AM

> >> To: Cluster Labs - All topics related to open-source clustering

> >> welcomed mailto:users@clusterlabs.org>>

> >> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just

> >> One Fails?

> >>

> >> On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote:

> >> > > -Original Message-

> >> > > From: Users 
> >> > > mailto:users-boun...@clusterlabs.org>> 
> >> > > On Behalf Of Andrei

> >> > > Borzenkov

> >> > > Sent: Sunday, February 17, 2019 11:56 AM

> >> > > To: users@clusterlabs.org<mailto:users@clusterlabs.org>

> >> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When

> >> > > Just One Fails?

> >> > >

> >> > > 17.02.2019 0:44, Eric Robinson пишет:

> >> > > > Thanks for the feedback, Andrei.

> >> > > >

> >> > > > I only want cluster failover to occur if the filesystem or drbd

> >> > > > resources fail,

> >> > >

> >> > > or if the cluster messaging layer detects a complete node failure.

> >> > > Is there a

> >> > > way to tell PaceMaker not to trigger a cluster failover if any of

> >> > > the p_mysql resources fail?

> >> > > >

> >> > >

> >> > > Let's look at this differently. If all these applications depend

> >> > > on each other, you should not be able to stop individual resource

> >> > > in the first place - you need to group them or define dependency

> >> > > so that stopping any resource would stop everything.

> >> > >

> >> > > If these applications are independent, they should not share

> >> > > resources.

> >> > > Each MySQL application should have own IP and own FS and own

> >> > > block device for this FS so that they can be moved between

> >> > > cluster nodes independently.

> >> > >

> >> > > Anything else will lead to troubles as you already observed.

> >> >

> >> > FYI, the MySQL services do not depend on each other. All of them

> >> > depend on the floating IP, which depends on the filesystem, which

> >> > depends on DRBD, but they do not depend on each other. Ideally, the

> >> > failure of p_mysql_002 should not cause failure of other mysql

> >> > resources, but now I understand why it happened. Pacemaker wanted

> >> > to start it on the other node, so it needed to move the floating

> >> > IP, filesystem, and DRBD primary, which had the cascade effect of

> >> > stopping the other MySQL resources.

> >> >

> >> > I think I also understand why the p_vip_clust01 resource blocked.

> >> >

> >> > FWIW, we've been using Linux HA since 2006, originally Heartbeat,

> >> > but then Corosync+Pacemaker. The past 12 years have been relatively

> >> > problem free. This symptom is new for us, only within the past year.

> >> > Our cluster nodes have many separate instances of MySQL running, so

> >> > it is not practical to have that many filesystems, IPs, etc. We are

> >> > content with the way things are, except for this new troubling

> >> > behavior.

> >> >

> >> > If I understand the thread correctly, op-fail=stop will not work

> >> > because the cluster will still try to stop the resources that are

> >> > implied dependencies.

> >> >

> >> > Bottom line is, how do we configure the cluster in such a way that

> >> > there are no cascading circumsta

Re: [ClusterLabs] Simulate Failure Behavior

2019-02-22 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Ken Gaillot
> Sent: Friday, February 22, 2019 5:06 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Simulate Failure Behavior
> 
> On Sat, 2019-02-23 at 00:28 +, Eric Robinson wrote:
> > I want to mess around with different on-fail options and see how the
> > cluster responds. I’m looking through the documentation, but I don’t
> > see a way to simulate resource failure and observe behavior without
> > actually failing over the mode. Isn’t there a way to have the cluster
> > MODEL failure and simply report what it WOULD do?
> >
> > --Eric
> 
> Yes, appropriately enough it is called crm_simulate :)

Thanks. I knew about crm_simulate, but I thought that was really old stuff and 
might not apply in the pcs world. 

> 
> The documentation is not exactly great, but you see:
> 
> https://wiki.clusterlabs.org/wiki/Using_crm_simulate
> 
> along with the man page and:
> 
> http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-
> single/Pacemaker_Administration/index.html#s-config-testing-changes
> 
> --
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Simulate Failure Behavior

2019-02-22 Thread Eric Robinson
I want to mess around with different on-fail options and see how the cluster 
responds. I'm looking through the documentation, but I don't see a way to 
simulate resource failure and observe behavior without actually failing over 
the mode. Isn't there a way to have the cluster MODEL failure and simply report 
what it WOULD do?

--Eric



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
: p_mysql_004 (class=lsb type=mysql_004)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_004-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_004-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_004-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_004-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_004-stop-interval-0s)
Resource: p_mysql_005 (class=lsb type=mysql_005)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_005-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_005-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_005-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_005-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_005-stop-interval-0s)
Resource: p_mysql_006 (class=lsb type=mysql_006)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_006-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_006-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_006-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_006-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_006-stop-interval-0s)
Resource: p_mysql_007 (class=lsb type=mysql_007)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_007-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_007-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_007-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_007-start-interval-0s)
 stop interval=0s timeout=15 (p_mysql_007-stop-interval-0s)
Resource: p_mysql_008 (class=lsb type=mysql_008)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_008-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_008-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_008-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_008-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_008-stop-interval-0s)
Resource: p_mysql_622 (class=lsb type=mysql_622)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_622-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_622-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_622-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_622-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_622-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: p_vip_clust02
Enabled on: 001db01b (score:INFINITY) (role: Started) 
(id:cli-prefer-p_vip_clust02)
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory)
  promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory)
  start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory)
  start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_001 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_002 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_003 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_004 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_005 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_006 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_007 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_008 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_622 (kind:Mandatory)
Colocation Constraints:
  p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master)
  p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master)
  p_vip_clust01 with p_fs_clust01 (score:INFINITY)
  p_vip_clust02 with p_fs_clust02 (score:INFINITY)
  p_mysql_001 with p_vip_clust01 (score:INFINITY)
  p_mysql_000 with p_vip_clust01 (score:INFINITY)
  p_mysql_002 with p_vip_clust01 (score:INFINITY)
  p_mysql_003 with p_vip_clust01 (score:INFINITY)
  p_mysql_004 with p_vip_clust01 (score:INFINITY)
  p_mysql_005 with p_vip_clust01 (score:INFINITY)
  p_mysql_006 with p_vip_clust02 (score:INFINITY)
  p_mysql_007 with p_vip_clust02 (score:INFINITY)
  p_mysql_008 with p_vip_clust02 (score:INFINITY)
  p_mysql_622 with p_vip_clust01 (score:INFINITY)
Ticket Constraints:

Alerts:
No alerts defined

Resources Defaults:
resource-stickiness: 100
Operations Defaults:
No defaults set

Cluster Properties:
cluster-infrastructure: corosync
cluster-name: 001db01ab
dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
have-watchdog: false
last-lrm-refresh: 1550347798
maintenance-mode: false
no-quorum-policy: ignore
stonith-enabled: false

--Eric


From: Users  On Behalf Of Eric Robinson
Sent: Saturday, February 16, 2019 12:34 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: [ClusterLabs] Why Do All The Services Go

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
Here are the relevant corosync logs.

It appears that the stop action for resource p_mysql_002 failed, and that 
caused a cascading series of service changes. However, I don't understand why, 
since no other resources are dependent on p_mysql_002.

[root@001db01a cluster]# cat corosync_filtered.log
Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request:  
Forwarding cib_apply_diff operation for section 'all' to all 
(origin=local/cibadmin/2)
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   Diff: 
--- 0.345.30 2
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   Diff: 
+++ 0.346.0 cc0da1b030418ec8b7c72db1115e2af1
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   +  
/cib:  @epoch=346, @num_updates=0
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   ++ 
/cib/configuration/resources/primitive[@id='p_mysql_002']:  
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   ++  
 
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   ++  
   
Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request:  
Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=001db01a/cibadmin/2, version=0.346.0)
Feb 16 14:06:24 [3913] 001db01a   crmd: info: abort_transition_graph:   
Transition aborted by meta_attributes.p_mysql_002-meta_attributes 'create': 
Configuration change | cib=0.346.0 source=te_update_diff:456 
path=/cib/configuration/resources/primitive[@id='p_mysql_002'] complete=true
Feb 16 14:06:24 [3913] 001db01a   crmd:   notice: do_state_transition:  
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph
Feb 16 14:06:24 [3912] 001db01apengine:   notice: unpack_config:On loss 
of CCM Quorum: Ignore
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status:  
Node 001db01b is online
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status:  
Node 001db01a is online
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd0:0 active in master mode on 001db01b
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd1:0 active on 001db01b
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_004 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_005 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd0:1 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd1:1 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_001 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_002 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_002 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: 
p_vip_clust01   (ocf::heartbeat:IPaddr2):   Started 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print:   
Master/Slave Set: ms_drbd0 [p_drbd0]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Masters: [ 001db01a ]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Slaves: [ 001db01b ]
Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print:   
Master/Slave Set: ms_drbd1 [p_drbd1]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Masters: [ 001db01b ]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Slaves: [ 001db01a ]
Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: 
p_fs_clust01(ocf::heartbeat:Filesystem):Started 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: 
p_fs_clust02(ocf::heartbeat:Filesystem):Started 001db01b
Feb 16 14:06:24 [3912] 001db01apengine: 

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
Thanks for the feedback, Andrei.

I only want cluster failover to occur if the filesystem or drbd resources fail, 
or if the cluster messaging layer detects a complete node failure. Is there a 
way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql 
resources fail?  

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Saturday, February 16, 2019 1:34 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> 17.02.2019 0:03, Eric Robinson пишет:
> > Here are the relevant corosync logs.
> >
> > It appears that the stop action for resource p_mysql_002 failed, and that
> caused a cascading series of service changes. However, I don't understand
> why, since no other resources are dependent on p_mysql_002.
> >
> 
> You have mandatory colocation constraints for each SQL resource with VIP. it
> means that to move SQL resource to another node pacemaker also must
> move VIP to another node which in turn means it needs to move all other
> dependent resources as well.
> ...
> > Feb 16 14:06:39 [3912] 001db01apengine:  warning:
> check_migration_threshold:Forcing p_mysql_002 away from 001db01a
> after 100 failures (max=100)
> ...
> > Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
> > Stop
> p_vip_clust01 (   001db01a )   blocked
> ...
> > Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
> > Stop
> p_mysql_001   (   001db01a )   due to colocation with 
> p_vip_clust01
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
> On Sat, Feb 16, 2019 at 09:33:42PM +0000, Eric Robinson wrote:
> > I just noticed that. I also noticed that the lsb init script has a
> > hard-coded stop timeout of 30 seconds. So if the init script waits
> > longer than the cluster resource timeout of 15s, that would cause the
> 
> Yes, you should use higher timeouts in pacemaker (45s for example).
> 
> > resource to fail. However, I don't want cluster failover to be
> > triggered by the failure of one of the MySQL resources. I only want
> > cluster failover to occur if the filesystem or drbd resources fail, or
> > if the cluster messaging layer detects a complete node failure. Is
> > there a way to tell PaceMaker not to trigger cluster failover if any
> > of the p_mysql resources fail?
> 
> You can try playing with the on-fail option but I'm not sure how reliably this
> whole setup will work without some form of fencing/stonith.
> 
> https://clusterlabs.org/pacemaker/doc/en-
> US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html

Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what 
I'm looking for, at least for the MySQL resources. 

> 
> --
> Valentin
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
These are the resources on our cluster.

[root@001db01a ~]# pcs status
Cluster name: 001db01ab
Stack: corosync
Current DC: 001db01a (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with 
quorum
Last updated: Sat Feb 16 15:24:55 2019
Last change: Sat Feb 16 15:10:21 2019 by root via cibadmin on 001db01b

2 nodes configured
18 resources configured

Online: [ 001db01a 001db01b ]

Full list of resources:

p_vip_clust01  (ocf::heartbeat:IPaddr2):   Started 001db01a
Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db01a ]
 Slaves: [ 001db01b ]
Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db01b ]
 Slaves: [ 001db01a ]
p_fs_clust01   (ocf::heartbeat:Filesystem):Started 001db01a
p_fs_clust02   (ocf::heartbeat:Filesystem):Started 001db01b
p_vip_clust02  (ocf::heartbeat:IPaddr2):   Started 001db01b
p_mysql_001(lsb:mysql_001):Started 001db01a
p_mysql_000(lsb:mysql_000):Started 001db01a
p_mysql_002(lsb:mysql_002):Started 001db01a
p_mysql_003(lsb:mysql_003):Started 001db01a
p_mysql_004(lsb:mysql_004):Started 001db01a
p_mysql_005(lsb:mysql_005):Started 001db01a
p_mysql_006(lsb:mysql_006):Started 001db01b
p_mysql_007(lsb:mysql_007):Started 001db01b
p_mysql_008(lsb:mysql_008):Started 001db01b
p_mysql_622(lsb:mysql_622):Started 001db01a

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Why is it that when one of the resources that start with p_mysql_* goes into a 
FAILED state, all the other MySQL services also stop?

[root@001db01a ~]# pcs constraint
Location Constraints:
  Resource: p_vip_clust02
Enabled on: 001db01b (score:INFINITY) (role: Started)
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory)
  promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory)
  start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory)
  start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_001 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_002 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_003 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_004 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_005 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_006 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_007 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_008 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_622 (kind:Mandatory)
Colocation Constraints:
  p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master)
  p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master)
  p_vip_clust01 with p_fs_clust01 (score:INFINITY)
  p_vip_clust02 with p_fs_clust02 (score:INFINITY)
  p_mysql_001 with p_vip_clust01 (score:INFINITY)
  p_mysql_000 with p_vip_clust01 (score:INFINITY)
  p_mysql_002 with p_vip_clust01 (score:INFINITY)
  p_mysql_003 with p_vip_clust01 (score:INFINITY)
  p_mysql_004 with p_vip_clust01 (score:INFINITY)
  p_mysql_005 with p_vip_clust01 (score:INFINITY)
  p_mysql_006 with p_vip_clust02 (score:INFINITY)
  p_mysql_007 with p_vip_clust02 (score:INFINITY)
  p_mysql_008 with p_vip_clust02 (score:INFINITY)
  p_mysql_622 with p_vip_clust01 (score:INFINITY)
Ticket Constraints:

--Eric





___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Valentin Vidic
> Sent: Saturday, February 16, 2019 1:28 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote:
> > Here are the relevant corosync logs.
> >
> > It appears that the stop action for resource p_mysql_002 failed, and
> > that caused a cascading series of service changes. However, I don't
> > understand why, since no other resources are dependent on p_mysql_002.
> 
> The stop failed because of a timeout (15s), so you can try to update that
> value:
> 


I just noticed that. I also noticed that the lsb init script has a hard-coded 
stop timeout of 30 seconds. So if the init script waits longer than the cluster 
resource timeout of 15s, that would cause the resource to fail. However, I 
don't want cluster failover to be triggered by the failure of one of the MySQL 
resources. I only want cluster failover to occur if the filesystem or drbd 
resources fail, or if the cluster messaging layer detects a complete node 
failure. Is there a way to tell PaceMaker not to trigger cluster failover if 
any of the p_mysql resources fail?  


>   Result of stop operation for p_mysql_002 on 001db01a: Timed Out |
> call=1094 key=p_mysql_002_stop_0 timeout=15000ms
> 
> After the stop failed it should have fenced that node, but you don't have
> fencing configured so it tries to move mysql_002 and all the other resources
> related to it (vip, fs, drbd) to the other node.
> Since other mysql resources depend on the same (vip, fs, drbd) they need to
> be stopped first.
> 
> --
> Valentin
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
I'm looking through the docs but I don't see how to set the on-fail value for a 
resource. 


> -Original Message-
> From: Users  On Behalf Of Eric Robinson
> Sent: Saturday, February 16, 2019 1:47 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> > On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote:
> > > I just noticed that. I also noticed that the lsb init script has a
> > > hard-coded stop timeout of 30 seconds. So if the init script waits
> > > longer than the cluster resource timeout of 15s, that would cause
> > > the
> >
> > Yes, you should use higher timeouts in pacemaker (45s for example).
> >
> > > resource to fail. However, I don't want cluster failover to be
> > > triggered by the failure of one of the MySQL resources. I only want
> > > cluster failover to occur if the filesystem or drbd resources fail,
> > > or if the cluster messaging layer detects a complete node failure.
> > > Is there a way to tell PaceMaker not to trigger cluster failover if
> > > any of the p_mysql resources fail?
> >
> > You can try playing with the on-fail option but I'm not sure how
> > reliably this whole setup will work without some form of fencing/stonith.
> >
> > https://clusterlabs.org/pacemaker/doc/en-
> >
> US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html
> 
> Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what 
> I'm
> looking for, at least for the MySQL resources.
> 
> >
> > --
> > Valentin
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Ken Gaillot
> Sent: Tuesday, February 19, 2019 10:31 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote:
> > > -Original Message-
> > > From: Users  On Behalf Of Andrei
> > > Borzenkov
> > > Sent: Sunday, February 17, 2019 11:56 AM
> > > To: users@clusterlabs.org
> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just
> > > One Fails?
> > >
> > > 17.02.2019 0:44, Eric Robinson пишет:
> > > > Thanks for the feedback, Andrei.
> > > >
> > > > I only want cluster failover to occur if the filesystem or drbd
> > > > resources fail,
> > >
> > > or if the cluster messaging layer detects a complete node failure.
> > > Is there a
> > > way to tell PaceMaker not to trigger a cluster failover if any of
> > > the p_mysql resources fail?
> > > >
> > >
> > > Let's look at this differently. If all these applications depend on
> > > each other, you should not be able to stop individual resource in
> > > the first place - you need to group them or define dependency so
> > > that stopping any resource would stop everything.
> > >
> > > If these applications are independent, they should not share
> > > resources.
> > > Each MySQL application should have own IP and own FS and own block
> > > device for this FS so that they can be moved between cluster nodes
> > > independently.
> > >
> > > Anything else will lead to troubles as you already observed.
> >
> > FYI, the MySQL services do not depend on each other. All of them
> > depend on the floating IP, which depends on the filesystem, which
> > depends on DRBD, but they do not depend on each other. Ideally, the
> > failure of p_mysql_002 should not cause failure of other mysql
> > resources, but now I understand why it happened. Pacemaker wanted to
> > start it on the other node, so it needed to move the floating IP,
> > filesystem, and DRBD primary, which had the cascade effect of stopping
> > the other MySQL resources.
> >
> > I think I also understand why the p_vip_clust01 resource blocked.
> >
> > FWIW, we've been using Linux HA since 2006, originally Heartbeat, but
> > then Corosync+Pacemaker. The past 12 years have been relatively
> > problem free. This symptom is new for us, only within the past year.
> > Our cluster nodes have many separate instances of MySQL running, so it
> > is not practical to have that many filesystems, IPs, etc. We are
> > content with the way things are, except for this new troubling
> > behavior.
> >
> > If I understand the thread correctly, op-fail=stop will not work
> > because the cluster will still try to stop the resources that are
> > implied dependencies.
> >
> > Bottom line is, how do we configure the cluster in such a way that
> > there are no cascading circumstances when a MySQL resource fails?
> > Basically, if a MySQL resource fails, it fails. We'll deal with that
> > on an ad-hoc basis. I don't want the whole cluster to barf. What about
> > op-fail=ignore? Earlier, you suggested symmetrical=false might also do
> > the trick, but you said it comes with its own can or worms.
> > What are the downsides with op-fail=ignore or asymmetrical=false?
> >
> > --Eric
> 
> Even adding on-fail=ignore to the recurring monitors may not do what you
> want, because I suspect that even an ignored failure will make the node less
> preferable for all the other resources. But it's worth testing.
> 
> Otherwise, your best option is to remove all the recurring monitors from the
> mysql resources, and rely on external monitoring (e.g. nagios, icinga, monit,
> ...) to detect problems.

This is probably a dumb question, but can we remove just the monitor operation 
but leave the resource configured in the cluster? If a node fails over, we do 
want the resources to start automatically on the new primary node.

> --
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Sunday, February 17, 2019 11:56 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> 17.02.2019 0:44, Eric Robinson пишет:
> > Thanks for the feedback, Andrei.
> >
> > I only want cluster failover to occur if the filesystem or drbd resources 
> > fail,
> or if the cluster messaging layer detects a complete node failure. Is there a
> way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql
> resources fail?
> >
> 
> Let's look at this differently. If all these applications depend on each 
> other,
> you should not be able to stop individual resource in the first place - you
> need to group them or define dependency so that stopping any resource
> would stop everything.
> 
> If these applications are independent, they should not share resources.
> Each MySQL application should have own IP and own FS and own block
> device for this FS so that they can be moved between cluster nodes
> independently.
> 
> Anything else will lead to troubles as you already observed.

FYI, the MySQL services do not depend on each other. All of them depend on the 
floating IP, which depends on the filesystem, which depends on DRBD, but they 
do not depend on each other. Ideally, the failure of p_mysql_002 should not 
cause failure of other mysql resources, but now I understand why it happened. 
Pacemaker wanted to start it on the other node, so it needed to move the 
floating IP, filesystem, and DRBD primary, which had the cascade effect of 
stopping the other MySQL resources.

I think I also understand why the p_vip_clust01 resource blocked. 

FWIW, we've been using Linux HA since 2006, originally Heartbeat, but then 
Corosync+Pacemaker. The past 12 years have been relatively problem free. This 
symptom is new for us, only within the past year. Our cluster nodes have many 
separate instances of MySQL running, so it is not practical to have that many 
filesystems, IPs, etc. We are content with the way things are, except for this 
new troubling behavior.

If I understand the thread correctly, op-fail=stop will not work because the 
cluster will still try to stop the resources that are implied dependencies.

Bottom line is, how do we configure the cluster in such a way that there are no 
cascading circumstances when a MySQL resource fails? Basically, if a MySQL 
resource fails, it fails. We'll deal with that on an ad-hoc basis. I don't want 
the whole cluster to barf. What about op-fail=ignore? Earlier, you suggested 
symmetrical=false might also do the trick, but you said it comes with its own 
can or worms. What are the downsides with op-fail=ignore or asymmetrical=false?

--Eric






> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stupid DRBD/LVM Global Filter Question

2019-10-30 Thread Eric Robinson
Roger --

Thank you, sir. That does help.

-Original Message-
From: Roger Zhou 
Sent: Wednesday, October 30, 2019 2:56 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] Stupid DRBD/LVM Global Filter Question


On 10/30/19 6:17 AM, Eric Robinson wrote:
> If I have an LV as a backing device for a DRBD disk, can someone
> explain why I need an LVM filter? It seems to me that we would want
> the LV to be always active under both the primary and secondary DRBD
> devices, and there should be no need or desire to have the LV
> activated or deactivated by Pacemaker. What am I missing?

Your understanding is correct. No need to use LVM resource agent from Pacemaker 
in your case.

--Roger

>
> --Eric
>
> Disclaimer : This email and any files transmitted with it are
> confidential and intended solely for intended recipients. If you are
> not the named addressee you should not disseminate, distribute, copy
> or alter this email. Any views or opinions presented in this email are
> solely those of the author and might not represent those of Physician
> Select Management. Warning: Although Physician Select Management has
> taken reasonable precautions to ensure no viruses are present in this
> email, the company cannot accept responsibility for any loss or damage
> arising from the use of this email or attachments.
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Stupid DRBD/LVM Global Filter Question

2019-10-29 Thread Eric Robinson
If I have an LV as a backing device for a DRBD disk, can someone explain why I 
need an LVM filter? It seems to me that we would want the LV to be always 
active under both the primary and secondary DRBD devices, and there should be 
no need or desire to have the LV activated or deactivated by Pacemaker. What am 
I missing?

--Eric




Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov ; users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
>  wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb
> >> 5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer
> >with id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> >-> S_POLICY_ENGINE
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> >Quorum: Ignore
> >>
> >> From 001db01b:
> >>
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> >(10.51.14.34:960) was formed. Members left: 1
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC
> >(001db01a) is dead
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
> >the leave message. failed: 1
> >> Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2 Feb
> >> 5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer
> >with id=1 and/or uname=001db01a from the membership cache
> >> Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> >S_NOT_DC -> S_ELECTION
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is
> >now lost
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is
> >now lost
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer
> >001db01a
> >> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with
> >id=1 and/or uname=00

[ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson
The two servers 001db01a and 001db01b were up and responsive. Neither had been 
rebooted and neither were under heavy load. There's no indication in the logs 
of loss of network connectivity. Any ideas on why both nodes seem to think the 
other one is at fault?

(Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an option 
at this time.)

Log from 001db01a:

Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed, forming 
new configuration.
Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership 
(10.51.14.33:960) was formed. Members left: 2
Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive the leave 
message. failed: 2
Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is now lost
Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b attributes 
for peer loss
Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is now lost
Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2 and/or 
uname=001db01b from the membership cache
Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with id=2 and/or 
uname=001db01b from the membership cache
Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to be 
down
Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b state is now 
lost
Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not 
matched
Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1
Feb  5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer with id=2 
and/or uname=001db01b from the membership cache
Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b state is now 
lost
Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is now lost
Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect node 2 to be 
down
Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not 
matched
Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM Quorum: Ignore

>From 001db01b:

Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership 
(10.51.14.34:960) was formed. Members left: 1
Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC (001db01a) is 
dead
Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a state is now 
lost
Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive the leave 
message. failed: 1
Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2
Feb  5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer with id=1 
and/or uname=001db01a from the membership cache
Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a state is now 
lost
Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_NOT_DC -> 
S_ELECTION
Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is now lost
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is now lost
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a attributes 
for peer loss
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer 001db01a
Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with id=1 and/or 
uname=001db01a from the membership cache
Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition S_ELECTION -> 
S_INTEGRATION
Feb  5 08:01:03 001db01b cib[1688]:  notice: Node 001db01a state is now lost
Feb  5 08:01:03 001db01b cib[1688]:  notice: Purged 1 peer with id=1 and/or 
uname=001db01a from the membership cache
Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: [cib_diff_notify] Patch 
aborted: Application of an update diff failed (-206)
Feb  5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC received in 
state S_INTEGRATION from do_election_check
Feb  5 08:01:03 001db01b pengine[1692]:  notice: On loss of CCM Quorum: Ignore


-Eric



Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson




> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Wednesday, February 5, 2020 12:14 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> 05.02.2020 20:55, Eric Robinson пишет:
> > The two servers 001db01a and 001db01b were up and responsive. Neither
> had been rebooted and neither were under heavy load. There's no indication
> in the logs of loss of network connectivity. Any ideas on why both nodes
> seem to think the other one is at fault?
>
> The very fact that nodes lost connection to each other *is* indication of
> network problems. Your logs start too late, after any problem already
> happened.
>

All the log messages before those are just normal repetitive stuff that always 
gets logged, even during normal production. The snippet I provided shows the 
first indication of anything unusual. Also, there is no other indication of 
network connectivity loss, and both servers are in Azure.

> >
> > (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not an
> > option at this time.)
> >
> > Log from 001db01a:
> >
> > Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> forming new configuration.
> > Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> > (10.51.14.33:960) was formed. Members left: 2 Feb  5 08:01:03 001db01a
> > corosync[1306]: [TOTEM ] Failed to receive the leave message. failed:
> > 2 Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state
> > is now lost Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing
> > all 001db01b attributes for peer loss Feb  5 08:01:03 001db01a
> > cib[1522]:  notice: Node 001db01b state is now lost Feb  5 08:01:03
> > 001db01a cib[1522]:  notice: Purged 1 peer with id=2 and/or
> > uname=001db01b from the membership cache Feb  5 08:01:03 001db01a
> > attrd[1525]:  notice: Purged 1 peer with id=2 and/or uname=001db01b
> > from the membership cache Feb  5 08:01:03 001db01a crmd[1527]:
> > warning: No reason to expect node 2 to be down Feb  5 08:01:03 001db01a
> stonith-ng[1523]:  notice: Node 001db01b state is now lost Feb  5 08:01:03
> 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not matched
> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb  5
> 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> > Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer with
> > id=2 and/or uname=001db01b from the membership cache Feb  5 08:01:03
> > 001db01a pacemakerd[1491]:  notice: Node 001db01b state is now lost
> > Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> > -> S_POLICY_ENGINE Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node
> > 001db01b state is now lost Feb  5 08:01:03 001db01a crmd[1527]:
> > warning: No reason to expect node 2 to be down Feb  5 08:01:03
> > 001db01a crmd[1527]:  notice: Stonith/shutdown of 001db01b not matched
> > Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> > Quorum: Ignore
> >
> > From 001db01b:
> >
> > Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> > (10.51.14.34:960) was formed. Members left: 1 Feb  5 08:01:03 001db01b
> > crmd[1693]:  notice: Our peer on the DC (001db01a) is dead Feb  5
> > 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a state is
> > now lost Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to
> > receive the leave message. failed: 1 Feb  5 08:01:03 001db01b
> corosync[1455]: [QUORUM] Members[1]: 2 Feb  5 08:01:03 001db01b
> corosync[1455]: [MAIN  ] Completed service synchronization, ready to
> provide service.
> > Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer with
> > id=1 and/or uname=001db01a from the membership cache Feb  5 08:01:03
> > 001db01b pacemakerd[1678]:  notice: Node 001db01a state is now lost
> > Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
> > S_NOT_DC -> S_ELECTION Feb  5 08:01:03 001db01b crmd[1693]:  notice:
> > Node 001db01a state is now lost Feb  5 08:01:03 001db01b attrd[1691]:
> > notice: Node 001db01a state is now lost Feb  5 08:01:03 001db01b
> > attrd[1691]:  notice: Removing all 001db01a attributes for peer loss
> > Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer
> > 001db01a Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer
> > with id=1 and/or uname=001db01a from the membership cache Feb  5
> > 08:01:03 001db01b crmd[1693]:  notice: State transition S_ELECTION ->
> > S_INTEGRATION Feb  5 08:01:03 001db01b cib[1688

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson
Hi Strahil –

I can’t prove there was no network loss, but:


  1.  There were no dmesg indications of ethernet link loss.
  2.  Other than corosync, there are no other log messages about connectivity 
issues.
  3.  Wouldn’t pcsd say something about connectivity loss?
  4.  Both servers are in Azure.
  5.  There are many other servers in the same Azure subscription, including 
other corosync clusters, none of which had issues.

So I guess it’s possible, but it seems unlikely.

--Eric

From: Users  On Behalf Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 3:13 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Andrei Borzenkov 
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

Hi Erik,

what has led you to think that there was no network loss ?

Best Regards,
Strahil Nikolov

В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:



> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; 
> users@clusterlabs.org<mailto:users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
> mailto:arvidj...@gmail.com>> wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1 Feb
> >> 5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
> >synchronization, ready to provide service.
> >> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer
> >with id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b
> >state is now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
> >-> S_POLICY_ENGINE
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
> >node 2 to be down
> >> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
> >001db01b not matched
> >> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
> >Quorum: Ignore
> >>
> >> From 001db01b:
> >>
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
> >(10.51.14.34:960) was formed. Members left: 1
> >> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC
> >(001db01a) is dead
> >> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a
> >state is now lost
> >> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
> >the leave message. fai

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Eric Robinson
Hi Strahil –

I think you may be right about the token timeouts being too short. I’ve also 
noticed that periods of high load can cause drbd to disconnect. What would you 
recommend for changes to the timeouts?

I’m running Red Hat’s Corosync Cluster Engine, version 2.4.3. The config is 
relatively simple.

Corosync config looks like this…

totem {
version: 2
cluster_name: 001db01ab
secauth: off
transport: udpu
}

nodelist {
node {
ring0_addr: 001db01a
nodeid: 1
}

node {
ring0_addr: 001db01b
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}


From: Users  On Behalf Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 6:39 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Andrei Borzenkov 
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

Hi Andrei,

don't trust Azure so much :D . I've seen stuff that was way more unbelievable.
Can you check other systems in the same subnet reported any issues. Yet, pcs 
most probably won't report any short-term issues. I have noticed that RHEL7 
defaults for token and consensus are quite small and any short-term disruption 
could cause an issue.
Actually when I tested live migration on oVirt - the other hosts fenced the 
node that was migrated.
What is your corosync config and OS version ?

Best Regards,
Strahil Nikolov

В четвъртък, 6 февруари 2020 г., 01:44:55 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:



Hi Strahil –



I can’t prove there was no network loss, but:



  1.  There were no dmesg indications of ethernet link loss.
  2.  Other than corosync, there are no other log messages about connectivity 
issues.
  3.  Wouldn’t pcsd say something about connectivity loss?
  4.  Both servers are in Azure.
  5.  There are many other servers in the same Azure subscription, including 
other corosync clusters, none of which had issues.



So I guess it’s possible, but it seems unlikely.



--Eric



From: Users 
mailto:users-boun...@clusterlabs.org>> On Behalf 
Of Strahil Nikolov
Sent: Wednesday, February 5, 2020 3:13 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 
mailto:users@clusterlabs.org>>; Andrei Borzenkov 
mailto:arvidj...@gmail.com>>
Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?



Hi Erik,



what has led you to think that there was no network loss ?



Best Regards,

Strahil Nikolov



В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson 
mailto:eric.robin...@psmnv.com>> написа:





> -Original Message-
> From: Users 
> mailto:users-boun...@clusterlabs.org>> On 
> Behalf Of Strahil Nikolov
> Sent: Wednesday, February 5, 2020 1:59 PM
> To: Andrei Borzenkov mailto:arvidj...@gmail.com>>; 
> users@clusterlabs.org<mailto:users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
> mailto:arvidj...@gmail.com>> wrote:
> >05.02.2020 20:55, Eric Robinson пишет:
> >> The two servers 001db01a and 001db01b were up and responsive. Neither
> >had been rebooted and neither were under heavy load. There's no
> >indication in the logs of loss of network connectivity. Any ideas on
> >why both nodes seem to think the other one is at fault?
> >
> >The very fact that nodes lost connection to each other *is* indication
> >of network problems. Your logs start too late, after any problem
> >already happened.
> >
> >>
> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
> >an option at this time.)
> >>
> >> Log from 001db01a:
> >>
> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
> >forming new configuration.
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
> >(10.51.14.33:960) was formed. Members left: 2
> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
> >the leave message. failed: 2
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
> >attributes for peer loss
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
> >now lost
> >> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
> >and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
> >id=2 and/or uname=001db01b from the membership cache
> >> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expe

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-06 Thread Eric Robinson
Hi Nikolov --

> Defaults are 1s  token,  1.2s  consensus which is too small.
> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
> With these settings, cluster  will not react   for 22s.
>
> I think it's a good start for your cluster .
> Don't forget to put  the cluster  in maintenance (pcs property set
> maintenance-mode=true) before restarting the stack ,  or  even better - get
> some downtime.
>
> You can use the following article to run a simulation before removing the
> maintenance:
> https://www.suse.com/support/kb/doc/?id=7022764
>


Thanks for the suggestions. Any thoughts on timeouts for DRBD?

--Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Why Do Nodes Leave the Cluster?

2020-02-06 Thread Eric Robinson
> >
> > I've done that with all my other clusters, but these two servers are
> > in Azure, so the network is out of our control.
>
> Is a normal cluster supported to use corosync over Internet? I'm not sure
> (because of the delays and possible packet losses).
>
>

As with most things, the main concern is latency and loss. The latency between 
these two nodes is < 1ms, and loss is always 0%.

--Eric


Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Verifying DRBD Run-Time Configuration

2020-04-11 Thread Eric Robinson
If I want to know the current DRBD runtime settings such as timeout, ping-int, 
or connect-int, how do I check that? I'm assuming they may not be the same as 
what shows in the config file.

--Eric




Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-04-11 Thread Eric Robinson


Hi Strahil --

I hope you won't mind if I revive this old question. In your comments below, 
you suggested using a 1s  token with a 1.2s consensus. I currently have 2-node 
clusters (will soon install a qdevice). I was reading in the corosync.conf man 
page where it says...

"For  two  node  clusters,  a  consensus larger than the join timeout but less 
than token is safe.  For three node or larger clusters, consensus should be 
larger than token."

Do you still think the consensus should be 1.2 * token in a 2-node cluster? Why 
is a smaller consensus considered safe for 2-node clusters? Should I use a 
larger consensus anyway?

--Eric


> -Original Message-
> From: Strahil Nikolov 
> Sent: Thursday, February 6, 2020 1:07 PM
> To: Eric Robinson ; Cluster Labs - All topics
> related to open-source clustering welcomed ;
> Andrei Borzenkov 
> Subject: RE: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
> On February 6, 2020 7:35:53 PM GMT+02:00, Eric Robinson
>  wrote:
> >Hi Nikolov --
> >
> >> Defaults are 1s  token,  1.2s  consensus which is too small.
> >> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
> >> With these settings, cluster  will not react   for 22s.
> >>
> >> I think it's a good start for your cluster .
> >> Don't forget to put  the cluster  in maintenance (pcs property set
> >> maintenance-mode=true) before restarting the stack ,  or  even better
> >- get
> >> some downtime.
> >>
> >> You can use the following article to run a simulation before removing
> >the
> >> maintenance:
> >> https://www.suse.com/support/kb/doc/?id=7022764
> >>
> >
> >
> >Thanks for the suggestions. Any thoughts on timeouts for DRBD?
> >
> >--Eric
> >
> >Disclaimer : This email and any files transmitted with it are
> >confidential and intended solely for intended recipients. If you are
> >not the named addressee you should not disseminate, distribute, copy or
> >alter this email. Any views or opinions presented in this email are
> >solely those of the author and might not represent those of Physician
> >Select Management. Warning: Although Physician Select Management has
> >taken reasonable precautions to ensure no viruses are present in this
> >email, the company cannot accept responsibility for any loss or damage
> >arising from the use of this email or attachments.
>
> Hi Eric,
>
> The timeouts can be treated as 'how much time to wait before  taking any
> action'. The workload is not very important (HANA  is something different).
>
> You can try with 10s (token) , 12s (consensus) and if needed  you can adjust.
>
> Warning: Use a 3 node cluster or at least 2 drbd nodes + qdisk. The 2 node
> cluster is vulnerable to split brain, especially when one of the nodes  is
> syncing  (for example after a patching) and the source is
> fenced/lost/disconnected. It's very hard to extract data from a semi-synced
> drbd.
>
> Also, if you need guidance for the SELINUX, I can point you to my guide in the
> centos forum.
>
> Best Regards,
> Strahil Nikolov
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] qdevice up and running -- but questions

2020-04-11 Thread Eric Robinson
  1.  What command can I execute on the qdevice node which tells me which 
client nodes are connected and alive?


  1.  In the output of the pcs qdevice status command, what is the meaning of...


Vote:   ACK (ACK)


  1.  In the output of the  pcs quorum status Command, what is the meaning of...

Membership information
--
Nodeid  VotesQdevice Name
 1  1A,V,NMW 001db03a
 2  1A,V,NMW 001db03b (local)


--Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


  1   2   >