Re: [ClusterLabs] Node add doesn't add node?

2019-01-17 Thread Israel Brewster

On Jan 11, 2019, at 3:53 AM, Jan Pokorný 
mailto:jpoko...@redhat.com>> wrote:

On 11/01/19 00:16 +, Israel Brewster wrote:
On Jan 10, 2019, at 10:57 AM, Israel Brewster 
mailto:ibrews...@flyravn.com><mailto:ibrews...@flyravn.com>>
 wrote:

So in my ongoing work to upgrade my cluster to CentOS 7, I got one
box up and running on CentOS 7, with the cluster fully configured
and functional, and moved all my services over to it. Now I'm trying
to add a second node, following the directions here:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-clusternodemanage-haar#s2-nodeadd-HAAR

However, it doesn't appear to be working. The existing node is named
"follow3", and the new node I am trying to add is named "follow1":

- The auth command run from follow3 returns "follow1: Authorized", so that 
looks good.
- The "pcs cluster node add follow1" command, again run on follow3, gives the 
following output:

Disabling SBD service...
follow1: sbd disabled
Sending remote node configuration files to 'follow1'
follow1: successful distribution of the file 'pacemaker_remote authkey'
follow3: Corosync updated
Setting up corosync...
follow1: Succeeded
Synchronizing pcsd certificates on nodes follow1...
follow1: Success
Restarting pcsd on the nodes in order to reload the certificates...
follow1: Success

...So it would appear that that worked as well. I then issued the
"pcs cluster start --all" command, which gave the following output:

[root@follow3 ~]# pcs cluster start --all
follow3: Starting Cluster (corosync)...
follow1: Starting Cluster (corosync)...
follow3: Starting Cluster (pacemaker)...
follow1: Starting Cluster (pacemaker)...

So again, everything looks good (to me). However, when I run "pcs
status" on the existing node, I get the following:

[root@follow3 ~]# pcs status
Cluster name: follow
Stack: corosync
Current DC: follow3 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with 
quorum
Last updated: Thu Jan 10 10:47:33 2019
Last change: Wed Jan  9 21:39:37 2019 by root via cibadmin on follow3

1 node configured
29 resources configured

Online: [ follow3 ]

Full list of resources:

which would seem to indicate that it doesn't know about the node I
just added (follow1). Meanwhile, follow1 "pcs status" shows this:

[root@follow1 ~]# pcs status
Cluster name: follow
Stack: corosync
Current DC: follow1 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition WITHOUT 
quorum
Last updated: Thu Jan 10 10:54:25 2019
Last change: Thu Jan 10 10:54:13 2019 by root via cibadmin on follow1

2 nodes configured
0 resources configured

Online: [ follow1 ]
OFFLINE: [ follow3 ]

No resources


Daemon Status:
 corosync: active/disabled
 pacemaker: active/disabled
 pcsd: active/enabled

So it got at least *some* of the config, but apparently not the full
thing (no resources), and it shows follow3 as offline, even though
it is online and reachable. Oddly "pcs cluster status" shows both
follow1 and follow3 pcsd status as online. What am I missing here?

As a follow-up to the above, restarting corosync on the functioning
node (follow3) at least allows the second node (follow1) to show up
when I do a pcs status, however the second node still shows as
OFFLINE (and follow3 shows as offline on follow1), and follow1 is
still missing pretty much all of the config. If I try to remove and
re-add follow1, the removal works as expected (node count on follow3
drops to 1), but the add behaves exactly the same as before, with
pcs status not acknowledging the added node.

What do the logs on follow1 have to say about this?
E.g. journalctl -b --no-hostname -u corosync -u pacemaker, focusing
on the respective suspect time.

If there's nothing sufficiently explaining what actually happened,
you can still review the underlying pcs communication itself if you
pass --debug to it.

I suspect that simply one corosync instance doesn't see the other
for whatever reason (firewall, bad addresses or not on the same
network at all, addresses out of sync between particular nodes,
in corosync.conf, or possibly even in /etc/hosts or DNS source,
...).


So apparently this was something messed up on Follow3, although I don't know 
what. I ended up doing the following, which worked:

1) Set up a new VM ('follow4')
2) cluster it with follow1
3) Dump JUST the resources and constraints from follow3
4) load the above .xml files to the new cluster (follow1 and follow4)

Once I did the above, I was able to add an additional node (follow2) to the new 
follow1/follow4 cluster with no problems. So while I don't know what was going 
on with follow3, at least I now have a properly functioning cluster again!

--
Nazdar,
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org>
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: 

Re: [ClusterLabs] Node add doesn't add node?

2019-01-10 Thread Israel Brewster
On Jan 10, 2019, at 10:57 AM, Israel Brewster 
mailto:ibrews...@flyravn.com>> wrote:

So in my ongoing work to upgrade my cluster to CentOS 7, I got one box up and 
running on CentOS 7, with the cluster fully configured and functional, and 
moved all my services over to it. Now I'm trying to add a second node, 
following the directions here:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-clusternodemanage-haar#s2-nodeadd-HAAR

However, it doesn't appear to be working. The existing node is named "follow3", 
and the new node I am trying to add is named "follow1":

- The auth command run from follow3 returns "follow1: Authorized", so that 
looks good.
- The "pcs cluster node add follow1" command, again run on follow3, gives the 
following output:

Disabling SBD service...
follow1: sbd disabled
Sending remote node configuration files to 'follow1'
follow1: successful distribution of the file 'pacemaker_remote authkey'
follow3: Corosync updated
Setting up corosync...
follow1: Succeeded
Synchronizing pcsd certificates on nodes follow1...
follow1: Success
Restarting pcsd on the nodes in order to reload the certificates...
follow1: Success

...So it would appear that that worked as well. I then issued the "pcs cluster 
start --all" command, which gave the following output:

[root@follow3 ~]# pcs cluster start --all
follow3: Starting Cluster (corosync)...
follow1: Starting Cluster (corosync)...
follow3: Starting Cluster (pacemaker)...
follow1: Starting Cluster (pacemaker)...

So again, everything looks good (to me). However, when I run "pcs status" on 
the existing node, I get the following:

[root@follow3 ~]# pcs status
Cluster name: follow
Stack: corosync
Current DC: follow3 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with 
quorum
Last updated: Thu Jan 10 10:47:33 2019
Last change: Wed Jan  9 21:39:37 2019 by root via cibadmin on follow3

1 node configured
29 resources configured

Online: [ follow3 ]

Full list of resources:

which would seem to indicate that it doesn't know about the node I just added 
(follow1). Meanwhile, follow1 "pcs status" shows this:

[root@follow1 ~]# pcs status
Cluster name: follow
Stack: corosync
Current DC: follow1 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition WITHOUT 
quorum
Last updated: Thu Jan 10 10:54:25 2019
Last change: Thu Jan 10 10:54:13 2019 by root via cibadmin on follow1

2 nodes configured
0 resources configured

Online: [ follow1 ]
OFFLINE: [ follow3 ]

No resources


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

So it got at least *some* of the config, but apparently not the full thing (no 
resources), and it shows follow3 as offline, even though it is online and 
reachable. Oddly "pcs cluster status" shows both follow1 and follow3 pcsd 
status as online. What am I missing here?
---
Israel Brewster
Systems Analyst II
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---

As a follow-up to the above, restarting corosync on the functioning node 
(follow3) at least allows the second node (follow1) to show up when I do a pcs 
status, however the second node still shows as OFFLINE (and follow3 shows as 
offline on follow1), and follow1 is still missing pretty much all of the 
config. If I try to remove and re-add follow1, the removal works as expected 
(node count on follow3 drops to 1), but the add behaves exactly the same as 
before, with pcs status not acknowledging the added node.
---
Israel Brewster
Systems Analyst II
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---














___
Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org>
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Node add doesn't add node?

2019-01-10 Thread Israel Brewster
So in my ongoing work to upgrade my cluster to CentOS 7, I got one box up and 
running on CentOS 7, with the cluster fully configured and functional, and 
moved all my services over to it. Now I'm trying to add a second node, 
following the directions here:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-clusternodemanage-haar#s2-nodeadd-HAAR

However, it doesn't appear to be working. The existing node is named "follow3", 
and the new node I am trying to add is named "follow1":

- The auth command run from follow3 returns "follow1: Authorized", so that 
looks good.
- The "pcs cluster node add follow1" command, again run on follow3, gives the 
following output:

Disabling SBD service...
follow1: sbd disabled
Sending remote node configuration files to 'follow1'
follow1: successful distribution of the file 'pacemaker_remote authkey'
follow3: Corosync updated
Setting up corosync...
follow1: Succeeded
Synchronizing pcsd certificates on nodes follow1...
follow1: Success
Restarting pcsd on the nodes in order to reload the certificates...
follow1: Success

...So it would appear that that worked as well. I then issued the "pcs cluster 
start --all" command, which gave the following output:

[root@follow3 ~]# pcs cluster start --all
follow3: Starting Cluster (corosync)...
follow1: Starting Cluster (corosync)...
follow3: Starting Cluster (pacemaker)...
follow1: Starting Cluster (pacemaker)...

So again, everything looks good (to me). However, when I run "pcs status" on 
the existing node, I get the following:

[root@follow3 ~]# pcs status
Cluster name: follow
Stack: corosync
Current DC: follow3 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with 
quorum
Last updated: Thu Jan 10 10:47:33 2019
Last change: Wed Jan  9 21:39:37 2019 by root via cibadmin on follow3

1 node configured
29 resources configured

Online: [ follow3 ]

Full list of resources:

which would seem to indicate that it doesn't know about the node I just added 
(follow1). Meanwhile, follow1 "pcs status" shows this:

[root@follow1 ~]# pcs status
Cluster name: follow
Stack: corosync
Current DC: follow1 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition WITHOUT 
quorum
Last updated: Thu Jan 10 10:54:25 2019
Last change: Thu Jan 10 10:54:13 2019 by root via cibadmin on follow1

2 nodes configured
0 resources configured

Online: [ follow1 ]
OFFLINE: [ follow3 ]

No resources


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

So it got at least *some* of the config, but apparently not the full thing (no 
resources), and it shows follow3 as offline, even though it is online and 
reachable. Oddly "pcs cluster status" shows both follow1 and follow3 pcsd 
status as online. What am I missing here?
---
Israel Brewster
Systems Analyst II
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---

[cid:da9e47ee-066d-4dcf-888a-fd9cd23813c9@flyravn.com]



[cid:f03eb32f-5ff5-49b8-926e-5120f8aef3a2@flyravn.com]







BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrading from CentOS 6 to CentOS 7

2019-01-09 Thread Israel Brewster
> On Jan 3, 2019, at 1:56 PM, Ken Gaillot  wrote:
> 
> On Thu, 2019-01-03 at 19:40 +0000, Israel Brewster wrote:
> 
>> If I do need to build a new CentOS cluster, how can I get it fully
>> set up with all the resources, but NOT let it start anything until I
>> perform the cutover? Obviously it would be a bad thing for both the
>> CentOS 7 box and the existing CentOS 6 boxes to have the same IP's! 
> 
> I'd make the new configuration as identical as possible to the old one,
> but use some test IPs until it's ready to go live.
> 
> You can get the old config with "pcs cluster cib  --config",
> copy that to the new nodes, edit it as needed with "pcs -f 
> ...", then activate it with "pcs cluster cib-push  --config".

Thanks, that worked like a charm. I actually edited the XML file directly to 
replace all instances of "lsb" with "systemd", and changed the IP addresses to 
a different subnet, and the edited XML file loaded properly and (upon fixing 
the launch of a couple of services) everything came up as expected. If I 
understand things correctly, as I move forward I can simply add new nodes to 
the cluster, and the config will copy across automatically, so that should be 
good!

> 
>> Can I copy *any* of the config from the existing CentOS 6 boxes, or
>> do I have to fully re-create all the resources from scratch on the
>> CentOS 7 box? I'm assuming that initially having a "single node"
>> cluster (until I can rebuild the other CentOS 6 box to CentOS 7)
>> won't be an issue.
> 
> Single-node clusters are fine.
> 
> I don't know what your resources are, but you probably have some data
> that will need to be copied from the old to the new when going live.
> (Stop old -> sync data -> start new.)

Actually, the data is all stored on a separate database server, so no issues 
there. In fact, everything is designed so multiple nodes can use the same data 
at the same time, so I can have the new one up and running (aside from IP's) at 
the same time as the old one with no problems.

> 
> Everything will be the same whether you want to upgrade the existing
> cluster one node at a time, or replace it with a new cluster. But one
> node at a time would mean you don't have high availability during the
> upgrade.

Right. Which *should* be fine. Of course, famous last words...

> 
> For more tips, see:
> 
> http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/#_upgrading
> 
> (You're stuck with the "complete cluster shutdown" method since you're
> updating the OS and changing corosync major versions.)

Thanks. Good information.

> 
> 
>> Thanks for any input you can provide!
>> ---
>> Israel Brewster
>> Systems Analyst II
>> 5245 Airport Industrial Rd
>> Fairbanks, AK 99709
>> (907) 450-7293
>> ---
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Upgrading from CentOS 6 to CentOS 7

2019-01-03 Thread Israel Brewster
I currently have a cluster set up using pacemaker/corosync on two CentOS 6 
boxes (fully updated). It has been requested that I "upgrade" the boxes to 
CentOS 7 (full rebuild on separate machine, obviously). What is it going to 
take to migrate the cluster with minimal downtime?

The cluster is currently a mix of ipaddr2 and lsb type resources, with a couple 
of other built-in ocf types (redis, for example). Presumably the IPAddr and 
redis resources would move across no problem, but the lsb type resources would 
need to be converted to systemd resources. Given that (and perhaps other 
things?) would I be correct in assuming that I will not be able to simply add 
the new CentOS 7 box to the existing CentOS 6 cluster and have the config and 
everything move across automatically?

If I do need to build a new CentOS cluster, how can I get it fully set up with 
all the resources, but NOT let it start anything until I perform the cutover? 
Obviously it would be a bad thing for both the CentOS 7 box and the existing 
CentOS 6 boxes to have the same IP's! Can I copy *any* of the config from the 
existing CentOS 6 boxes, or do I have to fully re-create all the resources from 
scratch on the CentOS 7 box? I'm assuming that initially having a "single node" 
cluster (until I can rebuild the other CentOS 6 box to CentOS 7) won't be an 
issue.

Thanks for any input you can provide!
---
Israel Brewster
Systems Analyst II
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---

[cid:aefec794-a2c3-4c18-b9f4-9d7337d75346@flyravn.com]



[cid:21205301-49fb-4904-91e9-db839809dbfb@flyravn.com]







BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Random failure with clone of IPaddr2

2016-12-19 Thread Israel Brewster

> On Dec 19, 2016, at 11:36 AM, al...@amisw.com wrote:
> 
>> Maybe I'm missing something here, and if so, my apologies, but to me it
>> looks like you are trying to put the same IP address on three different
>> machines SIMULTANEOUSLY.
> 
> Yes it what I do. But it's seem normal for me, I just follow guide like
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_clone_the_ip_address.html
>  
> 

Ah, I see. I was missing something. Specifically this:

"The IPaddr2 resource agent has built-in intelligence for when it is configured 
as a clone..."

I was unaware of that. So yes, it does seem that it should be working, given 
the way the resource agent works. Sorry about that. Disregard my earlier 
comments.

> 
> and work fine in a 2 nodes configurations. For me, this work with arp
> multicast, who give same "virtual" arp to different hosts, and work with
> iptable CLUSTERIP special rule (in very shortcut). But may be I totally
> misunderstand the stuff, but I work fine with that for the last 4 years so
> ... ?
> 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Random failure with clone of IPaddr2

2016-12-19 Thread Israel Brewster
Maybe I'm missing something here, and if so, my apologies, but to me it looks like you are trying to put the same IP address on three different machines SIMULTANEOUSLY. This will never work from a networking standpoint - it has nothing to do with pacemaker, etc, other than that it is responsible for creating the situation (since you told it to). The three machines will constantly be arguing over who is really responding to that IP address. Depending on your network hardware and IP stack, the result may vary, but "random failures" is a good description of the behavior I would expect.A given IP address should be assigned to one, and only one, machine at any given time. Feel free to move it around to other machines at will, but it should never be active on more than one machine (on a given network segment) at any given time, or you *will* have issues. As such, a clone set is not good for use with an IPAddr resource.
---Israel BrewsterSystems Analyst IIRavn Alaska5245 Airport Industrial RdFairbanks, AK 99709(907) 450-7293---BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD


On Dec 19, 2016, at 5:41 AM, al...@amisw.com wrote:Hi,My problem is still here. I search but don't find. I try to change networkcable to put the 3 hosts together on same switch, but same problem.So with this:primitive ip_apache_localnet ocf:heartbeat:IPaddr2 \  params ip="10.0.0.99" \  cidr_netmask="32" op monitor interval="30s"clone cl_ip_apache_localnet ip_apache_localnet \  meta globally-unique="true" clone-max="3" clone-node-max="3"3 Nodes A B C.If resource on:A + B => okOnly A => okOnly B => okOnly C => okA + C => random failB + C => random failA + B + C => random failWhen I say random fail, I do a curl http://10.0.0.99. I can see requestwith tcpdump. I can reach all the three hosts. But 1 time on 6 or 7, thecurl request hang. I see with tcpdump the request get in, but no hostanswer. I suspect host C but can't find why he don't do the job. If Ictrl-c & redo the request, I got answer.I check all firewall / log and don't see any error msg. If someone have aclue, he's very welcome !___Users mailing list: Users@clusterlabs.orghttp://lists.clusterlabs.org/mailman/listinfo/usersProject Home: http://www.clusterlabs.orgGetting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: http://bugs.clusterlabs.org___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Bug in ocf-shellfuncs, ocf_local_nodename function?

2016-11-17 Thread Israel Brewster
This refers specifically to build version 5434e9646462d2c3c8f7aad2609d0ef1875839c7 of the ocf-shellfuncs file, on CentOS 6.8, so it might not be an issue on later builds (if any) or different operating systems, but it would appear that the ocf_local_nodename function can have issues with certain configurations. Specially, I was debugging an issue I was having with a resource agent that I traced down to that function returning the FQDN of the machine rather than the actual node name, which in my case was a short name.In looking at the code, I see that the function is looking for a pacemaker version greater than 1.1.8, in which case it uses crm_node (which works), otherwise it just uses "uname -n", which returns the FQDN (at least in my configuration). To get the current version, it runs the command:local version=$(pacemakerd -$ | grep "Pacemaker .*" | awk '{ print $2 }')Which on CentOS 6.8 returns (as of today, at least):1.1.14-8.el6_8.1Unfortunately, when that string is passed to the ocf_version_cmp function to compare against 1.1.8, it returns 3, for "bad format", and so falls back to using "uname -n", even though the version *is* greater than 1.1.8, and crm_node would return the proper value.Of course, if you always set up your cluster to use the FQDN of the servers as the node name, or more specifically always set them up such that the output of uname -n is the node name, then there isn't an issue other than perhaps a undetectably slight loss of efficiency. However, as I accidentally proved by doing otherwise, there is no actual requirement when setting up a cluster that the node names match uname -n (although perhaps it is considered "best practice"?), as long as they resolve to an IP.I've worked around this in my installation by simply modifying the resource agent to call crm_node directly (since I know I am running on a version greater than 1.1.8), but I figured I might mention it, since I don't get any results when trying to google the issue.
---Israel BrewsterSystems Analyst IIRavn Alaska5245 Airport Industrial RdFairbanks, AK 99709(907) 450-7293---BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Locate resource with functioning member of clone set?

2016-11-17 Thread Israel Brewster
I have a resource that is set up as a clone set across my cluster, partly for pseudo-load balancing (If someone wants to perform an action that will take a lot of resources, I can have them do it on a different node than the primary one), but also simply because the resource can take several seconds to start, and by having it already running as a clone set, I can failover in the time it takes to move an IP resource - essentially zero down time.This is all well and good, but I ran into a problem the other day where the process on one of the nodes stopped working properly. Pacemaker caught the issue, and tried to fix it by restarting the resource, but was unable to because the old instance hadn't actually exited completely and was still tying up the TCP port, thereby preventing the new instance that pacemaker launched from being able to start.So this leaves me with two questions: 1) is there a way to set up a "kill script", such that before trying to launch a new copy of a process, pacemaker will run this script, which would be responsible for making sure that there are no other instances of the process running?2) Even in the above situation, where pacemaker couldn't launch a good copy of the resource on the one node, the situation could have been easily "resolved" by pacemaker moving the virtual IP resource to another node where the cloned resource was running correctly, and notifying me of the problem. I know how to make colocation constraints in general, but how do I do a colocation constraint with a cloned resource where I just need the virtual IP running on *any* node where there clone is working properly? Or is it the same as any other colocation resource, and pacemaker is simply smart enough to both try to restart the failed resource and move the virtual IP resource at the same time?As an addendum to question 2, I'd be interested in any methods there may be to be notified of changes in the cluster state, specifically things like when a resource fails on a node - my current nagios/icinga setup doesn't catch that when pacemaker properly moves the resource to a different node, because the resource remains up (which, of course, is the whole point), but it would still be good to know something happened so I could look into it and see if something needs fixed on the failed node to allow the resource to run there properly.Thanks!
---Israel BrewsterSystems Analyst IIRavn Alaska5245 Airport Industrial RdFairbanks, AK 99709(907) 450-7293---BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] set start-failure-is-fatal per resource?

2016-10-17 Thread Israel Brewster
I have one resource agent (redis, to be exact) that sometimes apparently fails to start on the first attempt. In every case, simply running a 'pcs resource cleanup' such that pacemaker tries to start it again successfully starts the process. Now, obviously, the proper thing to do is to figure out why redis sometimes fails to start, and fix that (it sort of feels like there may be a race condition going on somewhere to me), which I fully intend to spend some time investigating. However, in the meantime the quick "fix" is to simply have Pacemaker try starting the resource again if it fails the first time. This is easily accomplished by setting the property start-failure-is-fatal to false, which works beautifully for the redis resource. However, this is a global setting, and for most resources it doesn't make sense - if they don't start on the first try, I'm probably going to need to fix something before they will start.So the question is, is there a way to set the start-failure-is-fatal property to false only for one resource, or in some other way configure the resources such that the effect is the same - i.e. the one resource I want will retry start, but the others will give up after a single failure? Thanks.
---Israel BrewsterSystems Analyst IIRavn Alaska5245 Airport Industrial RdFairbanks, AK 99709(907) 450-7293---BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Replicated PGSQL woes [solved]

2016-10-14 Thread Israel Brewster
> 
> On Oct 14, 2016, at 12:30 AM, Keisuke MORI <keisuke.mori...@gmail.com 
> <mailto:keisuke.mori...@gmail.com>> wrote:
>> 
>> 2016-10-14 2:04 GMT+09:00 Israel Brewster <isr...@ravnalaska.net 
>> <mailto:isr...@ravnalaska.net>>:
>>> Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql
>>> starts initially, but failover never happens.
>> 
>>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not
>>> exist.
>>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is
>>> out-of-date. status=DISCONNECT
>>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not
>>> exist.
>>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is
>>> out-of-date. status=DISCONNECT
>>> 
>>> Those last two lines repeat indefinitely, but there is no indication that
>>> the cluster ever tries to promote centtest1 to master. Even if I completely
>>> shut down the cluster, and bring it back up only on centtest1, pacemaker
>>> refuses to start postgresql on centtest1 as a master.
>> 
>> This is because the data on centtest1 is considered "out-of-date"-ed
>> (as it says :) and and promoting the node to master might corrupt your
>> database.
> 
> Ok, that makes sense. So the problem is why the cluster thinks the data is 
> out-of-date

Turns out the problem was a simple typo in my resource creation command: I had 
typed centest1.ravnalaska.net <http://centest1.ravnalaska.net/> in the 
node_list rather than centtest1.ravnalaska.net 
<http://centtest1.ravnalaska.net/> (note the missing t in the middle). So when 
trying to get the status, it never got a status for centtest1, which meant it 
defaulted to DISCONNECT and HS:alone. Once I fixed that typo, failover worked, 
at least so far, and I can even bring the old master back up as a slave after 
deleting the lock file that the RA leaves behind. Wow, that was annoying to 
track down! Maybe I need to be more careful about picking machine names - 
choose something that's harder to mess up :-)

Thanks for all the suggestions and help!
---
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---

> 
>> 
>>> 
>>> What can I do to fix this? What troubleshooting steps can I follow? Thanks.
>>> 
>> 
>> It seems that the latest data should be only on centtest2 so the
>> recovering steps should be something like:
>> - start centtest2 as master
>> - take the basebackup from centtest2 to centtest1
>> - start centtest1 as slave
>> - make sure the replications is working properly
> 
> I've done that. Several times. The replication works properly with either 
> node as the master. Initially I had started centtest1 as master, because 
> that's where I was planning to *have* the master, however when pacemaker keep 
> insisting on starting centtest2 as the master, I also tried setting things up 
> that way. No luck: everything works fine, but no failover.
> 
>> 
>> see below for details.
>> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster 
>> <http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster>
> Yep, that's where I started from on this little adventure :-)
> 
>> 
>> 
>> Also, it would be helpful to check 'pgsql-data-status' and
>> 'pgsql-status' attributes displayed by 'crm_mon -A' to diagnose
>> whether the replications is going well or not.
>> 
>> The slave node should have the attributes like below, otherwise the
>> replications is going something wrong and the node will never be
>> promoted because it does not have the proper data.
>> 
>> ```
>> * Node node2:
>>+ master-pgsql  : 100
>>+ pgsql-data-status : STREAMING|SYNC
>>+ pgsql-status  : HS:sync
>> ```
> 
> Now THAT is interesting. I get this:
> 
> Node Attributes:
> * Node centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>:
> + master-pgsql_96   : -INFINITY 
> + pgsql_96-data-status  : DISCONNECT
> + pgsql_96-status   : HS:alone  
> * Node centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>:
> + master-pgsql_96   : 1000
> + pgsql_96-data-status  : LATEST
> + pgsql_96-master-baseline  : 070171D0
> + pgsql_96-stat

Re: [ClusterLabs] Antw: Replicated PGSQL woes

2016-10-14 Thread Israel Brewster
On Oct 13, 2016, at 11:36 PM, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> 
wrote:
> 
>>>> Israel Brewster <isr...@ravnalaska.net> schrieb am 13.10.2016 um 19:04 in
> Nachricht <34091524-d35e-4e28-9c3e-dda6c6a1e...@ravnalaska.net>:
> [...]
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: State transition S_IDLE -> 
>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
>> origin=abort_transition_graph ]
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: On loss of CCM Quorum: 
>> Ignore
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Stop
>> virtual_ip#011(centtest2.ravnalaska.net)
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Demote  
>> pgsql_96:0#011(Master -> Stopped centtest2.ravnalaska.net)
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Calculated Transition 
>> 193: /var/lib/pacemaker/pengine/pe-input-500.bz2
> 
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 43: 
>> notify pgsql_96_pre_notify_demote_0 on centtest2.ravnalaska.net
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 45: 
>> notify pgsql_96_pre_notify_demote_0 on centtest1.ravnalaska.net (local)
> 
> The above section looks wrong, because if one resource is master and the 
> other is slave, both cannot be demoted (AFAIK).. I'm also surprised that the 
> cluster tries to demote a failed master; maybe you have no fencing configured?

Well, technically it's not a "failure" the way I'm testing, it's a clean 
shutdown. So no fencing is needed, because the system knows I shut down that 
node. Effectively, I manually fenced the node. FWIW, I've also tried doing a 
complete shutdown of the node (not just the cluster software, but the actual 
OS). Still never promotes on the other machine. From further investigation, it 
*looks* like it might be because it doesn't think the other machine is 
replicating properly, and as such shouldn't be trusted to be a master.

And no, I don't have fencing configured yet. I know it is important, but these 
are just test VM's I'm working on without fencing hardware, trying to get the 
basic operation working. The final deployment will, of course, have proper 
fencing (once I figure out how to make *that* work, but that's a different 
subject)

> 
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Operation 
>> pgsql_96_notify_0: ok (node=centtest1.ravnalaska.net, call=230, rc=0, 
>> cib-update=0, confirmed=true)
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 6: demote 
>> pgsql_96_demote_0 on centtest2.ravnalaska.net
> 
> "action 6": Where does it come from? We had 43 and 45!
> 
> [...]
> 
> Ulrich
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Replicated PGSQL woes

2016-10-14 Thread Israel Brewster
On Oct 14, 2016, at 12:30 AM, Keisuke MORI <keisuke.mori...@gmail.com> wrote:
> 
> 2016-10-14 2:04 GMT+09:00 Israel Brewster <isr...@ravnalaska.net>:
>> Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql
>> starts initially, but failover never happens.
> 
>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not
>> exist.
>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is
>> out-of-date. status=DISCONNECT
>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not
>> exist.
>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is
>> out-of-date. status=DISCONNECT
>> 
>> Those last two lines repeat indefinitely, but there is no indication that
>> the cluster ever tries to promote centtest1 to master. Even if I completely
>> shut down the cluster, and bring it back up only on centtest1, pacemaker
>> refuses to start postgresql on centtest1 as a master.
> 
> This is because the data on centtest1 is considered "out-of-date"-ed
> (as it says :) and and promoting the node to master might corrupt your
> database.

Ok, that makes sense. So the problem is why the cluster thinks the data is 
out-of-date

> 
>> 
>> What can I do to fix this? What troubleshooting steps can I follow? Thanks.
>> 
> 
> It seems that the latest data should be only on centtest2 so the
> recovering steps should be something like:
> - start centtest2 as master
> - take the basebackup from centtest2 to centtest1
> - start centtest1 as slave
> - make sure the replications is working properly

I've done that. Several times. The replication works properly with either node 
as the master. Initially I had started centtest1 as master, because that's 
where I was planning to *have* the master, however when pacemaker keep 
insisting on starting centtest2 as the master, I also tried setting things up 
that way. No luck: everything works fine, but no failover.

> 
> see below for details.
> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster 
> <http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster>
Yep, that's where I started from on this little adventure :-)

> 
> 
> Also, it would be helpful to check 'pgsql-data-status' and
> 'pgsql-status' attributes displayed by 'crm_mon -A' to diagnose
> whether the replications is going well or not.
> 
> The slave node should have the attributes like below, otherwise the
> replications is going something wrong and the node will never be
> promoted because it does not have the proper data.
> 
> ```
> * Node node2:
>+ master-pgsql  : 100
>+ pgsql-data-status : STREAMING|SYNC
>+ pgsql-status  : HS:sync
> ```

Now THAT is interesting. I get this:

Node Attributes:
* Node centtest1.ravnalaska.net:
+ master-pgsql_96   : -INFINITY 
+ pgsql_96-data-status  : DISCONNECT
+ pgsql_96-status   : HS:alone  
* Node centtest2.ravnalaska.net:
+ master-pgsql_96   : 1000
+ pgsql_96-data-status  : LATEST
+ pgsql_96-master-baseline  : 070171D0
+ pgsql_96-status   : PRI

...Which seems to indicate that pacemaker doesn't think centtest1 is connected 
to or replicating centtest2 (if I am interpreting that correctly). And yet, it 
is: From postgres itself:

[root@CentTest2 ~]# /usr/pgsql-9.6/bin/psql -h centtest2 -U postgres
psql (9.6.0)
Type "help" for help.

postgres=# SELECT * FROM pg_replication_slots;
slot_name| plugin | slot_type | datoid | database | active | active_pid 
| xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
-++---++--+++--+--+-+-
 centtest_2_slot || physical  ||  | t  |  27230 
| 1685 |  | 0/7017438   | 
(1 row)

postgres=# 

Notice that "active" is true, indicating that the slot is connected and, well, 
active. Plus, from the postgresql log on centtest1:

< 2016-10-14 08:19:38.278 AKDT > LOG:  entering standby mode
< 2016-10-14 08:19:38.285 AKDT > LOG:  consistent recovery state reached at 
0/7017358
< 2016-10-14 08:19:38.285 AKDT > LOG:  redo starts at 0/7017358
< 2016-10-14 08:19:38.285 AKDT > LOG:  invalid record length at 0/7017438: 
wanted 24, got 0
< 2016-10-14 08:19:38.286 AKDT > LOG:  database system is ready to accept read 
only connections
< 2016-10-14 08:19:38.292 AKDT > LOG:  started streaming WAL from primary at 
0/700 on timeline 1

And furthermore, if I insert/change records on centtest2, those changes *do* 
show up on centtest1. So

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-14 Thread Israel Brewster
On Oct 14, 2016, at 1:39 AM, Jehan-Guillaume de Rorthais <j...@dalibo.com> 
wrote:
> 
> On Thu, 13 Oct 2016 14:11:06 -0800
> Israel Brewster <isr...@ravnalaska.net> wrote:
> 
>> On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais <j...@dalibo.com>
>> wrote:
>>> 
>>> On Thu, 13 Oct 2016 10:05:33 -0800
>>> Israel Brewster <isr...@ravnalaska.net> wrote:
>>> 
>>>> On Oct 13, 2016, at 9:41 AM, Ken Gaillot <kgail...@redhat.com> wrote:  
>>>>> 
>>>>> On 10/13/2016 12:04 PM, Israel Brewster wrote:
>>> [...]
>>> 
>>>>>> But whatever- this is a cluster, it doesn't really matter which node
>>>>>> things are running on, as long as they are running. So the cluster is
>>>>>> working - postgresql starts, the master process is on the same node as
>>>>>> the IP, you can connect, etc, everything looks good. Obviously the next
>>>>>> thing to try is failover - should the master node fail, the slave node
>>>>>> should be promoted to master. So I try testing this by shutting down the
>>>>>> cluster on the primary server: "pcs cluster stop"
>>>>>> ...and nothing happens. The master shuts down (uncleanly, I might add -
>>>>>> it leaves behind a lock file that prevents it from starting again until
>>>>>> I manually remove said lock file), but the slave is never promoted to
>>>>> 
>>>>> This definitely needs to be corrected. What creates the lock file, and
>>>>> how is that entity managed?
>>>> 
>>>> The lock file entity is created/managed by the postgresql process itself.
>>>> On launch, postgres creates the lock file to say it is running, and
>>>> deletes said lock file when it shuts down. To my understanding, its role
>>>> in life is to prevent a restart after an unclean shutdown so the admin is
>>>> reminded to make sure that the data is in a consistent state before
>>>> starting the server again.  
>>> 
>>> What is the name of this lock file? Where is it?
>>> 
>>> PostgreSQL does not create lock file. It creates a "postmaster.pid" file,
>>> but it does not forbid a startup if the new process doesn't find another
>>> process with the pid and shm shown in the postmaster.pid.
>>> 
>>> As far as I know, the pgsql resource agent create such a lock file on
>>> promote and delete it on graceful stop. If the PostgreSQL instance couldn't
>>> be stopped correctly, the lock files stays and the RA refuse to start it
>>> the next time.  
>> 
>> Ah, you're right. Looking auth the RA I see where it creates the file in
>> question. The delete appears to be in the pgsql_real_stop() function (which
>> makes sense), wrapped in an if block that checks for $1 being master and
>> $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little
>> debugging code in there I see that when it hits that block on a cluster stop,
>> $OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net
>> <http://centtest1.ravnalaska.net/>, not a space, so the lock file is not
>> removed:
>> 
>>if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " "
>> ]; then ocf_log info "Removing $PGSQL_LOCK."
>>rm -f $PGSQL_LOCK
>>fi 
>> 
>> It doesn't look like there is anywhere else where the file would be removed.
> 
> This is quite wrong to me for two reasons (I'll try to be clear):
> 
> 1) the resource agent (RA) make sure the timeline (TL) will not be incremented
> during promotion.
> 
> As there is no documentation about that, I'm pretty sure this contortion comes
> from limitations in very old versions of PostgreSQL (<= 9.1):
> 
>  * a slave wasn't able to cross a timeline (TL) from streaming replication,
>only from WAL archives. That means crossing a TL was requiring to restart
>the slave or cutting the streaming rep temporary to force it to get back to
>the archives
>  * moreover, it was possible a standby miss some transactions on after a clean
>master shutdown. That means the old master couldn't get back to the
>cluster as a slave safely, as the TL is still the same...
> 
> See slide 35->37: 
> http://www.slideshare.net/takmatsuo/2012929-pg-study-16012253
> 
> In my understanding, that's why we make sure there's no slave around before
> shutting down the master: should 

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-13 Thread Israel Brewster
On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais <j...@dalibo.com> 
wrote:
> 
> On Thu, 13 Oct 2016 10:05:33 -0800
> Israel Brewster <isr...@ravnalaska.net> wrote:
> 
>> On Oct 13, 2016, at 9:41 AM, Ken Gaillot <kgail...@redhat.com> wrote:
>>> 
>>> On 10/13/2016 12:04 PM, Israel Brewster wrote:  
> [...]
> 
>>>> But whatever- this is a cluster, it doesn't really matter which node
>>>> things are running on, as long as they are running. So the cluster is
>>>> working - postgresql starts, the master process is on the same node as
>>>> the IP, you can connect, etc, everything looks good. Obviously the next
>>>> thing to try is failover - should the master node fail, the slave node
>>>> should be promoted to master. So I try testing this by shutting down the
>>>> cluster on the primary server: "pcs cluster stop"
>>>> ...and nothing happens. The master shuts down (uncleanly, I might add -
>>>> it leaves behind a lock file that prevents it from starting again until
>>>> I manually remove said lock file), but the slave is never promoted to  
>>> 
>>> This definitely needs to be corrected. What creates the lock file, and
>>> how is that entity managed?  
>> 
>> The lock file entity is created/managed by the postgresql process itself. On
>> launch, postgres creates the lock file to say it is running, and deletes said
>> lock file when it shuts down. To my understanding, its role in life is to
>> prevent a restart after an unclean shutdown so the admin is reminded to make
>> sure that the data is in a consistent state before starting the server again.
> 
> What is the name of this lock file? Where is it?
> 
> PostgreSQL does not create lock file. It creates a "postmaster.pid" file, but
> it does not forbid a startup if the new process doesn't find another process
> with the pid and shm shown in the postmaster.pid.
> 
> As far as I know, the pgsql resource agent create such a lock file on promote
> and delete it on graceful stop. If the PostgreSQL instance couldn't be stopped
> correctly, the lock files stays and the RA refuse to start it the next time.

Ah, you're right. Looking auth the RA I see where it creates the file in 
question. The delete appears to be in the pgsql_real_stop() function (which 
makes sense), wrapped in an if block that checks for $1 being master and 
$OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little 
debugging code in there I see that when it hits that block on a cluster stop, 
$OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net 
<http://centtest1.ravnalaska.net/>, not a space, so the lock file is not 
removed:

if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " " ]; 
then
ocf_log info "Removing $PGSQL_LOCK."
rm -f $PGSQL_LOCK
fi 

It doesn't look like there is anywhere else where the file would be removed.

> 
> [...]
>>>> What can I do to fix this? What troubleshooting steps can I follow? Thanks.
> 
> I can not find the result of the stop operation in your log files, maybe the
> log from CentTest2 would be more useful.

Sure. I was looking at centtest1 because I was trying to figure out why it 
wouldn't promote, but if centtest2 never really stopped (properly) that could 
explain things. Here's the log from 2 when calling pcs cluster stop:

Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sending flush op to all hosts 
for: standby (true)
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sent update 26: standby=true
Oct 13 14:05:14 CentTest2 pacemaker: Waiting for shutdown of managed resources
Oct 13 14:05:14 CentTest2 crmd[9426]:   notice: Operation pgsql_96_notify_0: ok 
(node=centtest2.ravnalaska.net, call=21, rc=0, cib-update=0, confirmed=true)
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sending flush op to all hosts 
for: master-pgsql_96 (-INFINITY)
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sent update 28: 
master-pgsql_96=-INFINITY
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sending flush op to all hosts 
for: pgsql_96-master-baseline ()
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sent delete 30: 
node=centtest2.ravnalaska.net, attr=pgsql_96-master-baseline, id=, 
set=(null), section=status
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sent delete 32: 
node=centtest2.ravnalaska.net, attr=pgsql_96-master-baseline, id=, 
set=(null), section=status
Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: Stopping PostgreSQL on 
demote.
Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: stop_escalate(or 
stop_escalate_in_slave) time is adjusted to 50 based on the configured timeout.
Oct 13 14:05:14 Cen

[ClusterLabs] Replicated PGSQL woes

2016-10-13 Thread Israel Brewster
Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql starts initially, but failover never happens.Details:I'm trying to get a cluster set up with Postgresql 9.6 in a streaming replication using named slots scenario. I'm using the latest pgsql Resource Agent, which does appear to support the named replication slot feature, and I've pulled in the various utility functions the RA uses that weren't available in my base install, so the RA itself no longer gives me errors.Setup: Two machines, centtest1 and centtest2. Both are running CentOS 6.8. Centtest1 has an IP of 10.211.55.100, and centtest2 has an IP of 10.211.55.101. The cluster is set up and functioning, with a shared virtual IP resource at 10.211.55.200. Postgresql has been set up and tested functioning properly on both nodes with centtest1 as the master and centtest2 as the streaming replica slave. I then set up the postgresql master/slave resource using the following commands:pcs resource create pgsql_96 pgsql \pgctl="/usr/pgsql-9.6/bin/pg_ctl" \logfile="/var/log/pgsql/test2.log" \psql="/usr/pgsql-9.6/bin/psql" \pgdata="/pgsql96/data" \rep_mode="async" \repuser="postgres" \node_list="tcentest1.ravnalaska.net centtest2.ravnalaska.net" \master_ip="10.211.55.200" \archive_cleanup_command="" \restart_on_promote="true" \replication_slot_name="centtest_2_slot" \monitor_user="postgres" \monitor_password="SuperSecret" \op start timeout="60s" interval="0s" on-fail="restart" \op monitor timeout="60s" interval="4s" on-fail="restart" \op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \op promote timeout="60s" interval="0s" on-fail="restart" \op demote timeout="60s" interval="0s" on-fail=stop \op stop timeout="60s" interval="0s" on-fail="block" \op notify timeout="60s" interval="0s";pcs resource master msPostgresql pgsql_96 master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=truepcs constraint colocation add virtual_ip with Master msPostgresql INFINITYpcs constraint order promote msPostgresql then start virtual_ip symmetrical=false score=INFINITYpcs constraint order demote  msPostgresql then stop  virtual_ip symmetrical=false score=0My preference would be that the master runs on centtest1, so I add the following constraint as well:pcs constraint location --master msPostgresql prefers centtest1.ravnalaska.net=50When I then start the cluster, I first see *both* machines come up as "slave", which I feel is somewhat odd, however the cluster software quickly figures things out and promotes centtest2 to master. I've tried this a dozen different times, and it *always* promotes centtest2 to master - even if I put INFINITY in for the location constraint.But whatever- this is a cluster, it doesn't really matter which node things are running on, as long as they are running. So the cluster is working - postgresql starts, the master process is on the same node as the IP, you can connect, etc, everything looks good. Obviously the next thing to try is failover - should the master node fail, the slave node should be promoted to master. So I try testing this by shutting down the cluster on the primary server: "pcs cluster stop"...and nothing happens. The master shuts down (uncleanly, I might add - it leaves behind a lock file that prevents it from starting again until I manually remove said lock file), but the slave is never promoted to master. Neither pcs status or crm_mon show any errors, but centtest1 never becomes master.If instead of stoping the cluster on centtest2, I try to simply move the master using the command "pcs resource move --master msPostgresql", I first run into the aforementioned unclean shutdown issue (lock file left behind that has to be manually removed), and after removing the lock file, I wind up with *both* nodes being slaves, and no master node. "pcs resource clear --master msPostgresql" re-promotes centtest2 to master.What it looks like is that for some reason pacemaker/corosync is absolutely refusing to ever make centtest1 a master - even when I explicitly tell it to, or when it is the only node left.Looking at the messages log when I do the node shutdown test I see this:Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: On loss of CCM Quorum: IgnoreOct 13 08:29:39 CentTest1 pengine[30095]:   notice: Stop    virtual_ip#011(centtest2.ravnalaska.net)Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Demote  pgsql_96:0#011(Master -> Stopped centtest2.ravnalaska.net)Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Calculated Transition 193: /var/lib/pacemaker/pengine/pe-input-500.bz2Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 43: notify pgsql_96_pre_notify_demote_0 on centtest2.ravnalaska.netOct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 45: notify pgsql_96_pre_notify_demote_0 on 

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Israel Brewster
On Oct 4, 2016, at 3:38 PM, Digimer <li...@alteeve.ca> wrote:
> 
> On 04/10/16 07:09 PM, Israel Brewster wrote:
>> On Oct 4, 2016, at 3:03 PM, Digimer <li...@alteeve.ca> wrote:
>>> 
>>> On 04/10/16 06:50 PM, Israel Brewster wrote:
>>>> On Oct 4, 2016, at 2:26 PM, Ken Gaillot <kgail...@redhat.com
>>>> <mailto:kgail...@redhat.com>> wrote:
>>>>> 
>>>>> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>>>>>> I sent this a week ago, but never got a response, so I'm sending it
>>>>>> again in the hopes that it just slipped through the cracks. It seems to
>>>>>> me that this should just be a simple mis-configuration on my part
>>>>>> causing the issue, but I suppose it could be a bug as well.
>>>>>> 
>>>>>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>>>>>> 6.8. One cluster is simply sharing an IP, while the other one has
>>>>>> numerous services and IP's set up between the two machines in the
>>>>>> cluster. Both appear to be working fine. However, I was poking around
>>>>>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>>>>>> and fenced were using "significant" amounts of processing power - 25%
>>>>>> for corosync on the current primary node, with fenced and stonithd often
>>>>>> showing 1-2% (not horrible, but more than any other process). In looking
>>>>>> at my logs, I see that they are dumping messages like the following to
>>>>>> the messages log every second or two:
>>>>>> 
>>>>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>>>>>> No match for //@st_delegate in /st-reply
>>>>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>>>>>> Operation reboot of fai-dbs1 by fai-dbs2 for
>>>>>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>>>>>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>>>>>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>>>>>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>>>>>> stonith_admin.cman.15835
>>>>>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>>>>>> fai-dbs2 (reset)
>>>>> 
>>>>> The above shows that CMAN is asking pacemaker to fence a node. Even
>>>>> though fencing is disabled in pacemaker itself, CMAN is configured to
>>>>> use pacemaker for fencing (fence_pcmk).
>>>> 
>>>> I never did any specific configuring of CMAN, Perhaps that's the
>>>> problem? I missed some configuration steps on setup? I just followed the
>>>> directions
>>>> here: 
>>>> http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
>>>> which disabled stonith in pacemaker via the
>>>> "pcs property set stonith-enabled=false" command. Is there separate CMAN
>>>> configs I need to do to get everything copacetic? If so, can you point
>>>> me to some sort of guide/tutorial for that?
>>> 
>>> Disabling stonith is not possible in cman, and very ill advised in
>>> pacemaker. This is a mistake a lot of "tutorials" make when the author
>>> doesn't understand the role of fencing.
>>> 
>>> In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
>>> agent, as it should. So when something went wrong, corosync detected it,
>>> informed cman which then requested pacemaker to fence the peer. With
>>> pacemaker not having stonith configured and enabled, it could do
>>> nothing. So pacemaker returned that the fence failed and cman went into
>>> an infinite loop trying again and again to fence (as it should have).
>>> 
>>> You must configure stonith (exactly how depends on your hardware), then
>>> enable stonith in pacemaker.
>>> 
>> 
>> Gotcha. There is nothing special about the hardware, it's just two physical 
>> boxes connected to the network. So I guess I've got a choice of either a) 
>> live with the logging/load situation (since the system does work perfectly 
>> as-is other than the excessive logging), or b) spend some time researching 
>> stonith to figure out what it does and how to configu

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Israel Brewster
On Oct 4, 2016, at 2:26 PM, Ken Gaillot <kgail...@redhat.com> wrote:
> 
> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>> I sent this a week ago, but never got a response, so I'm sending it
>> again in the hopes that it just slipped through the cracks. It seems to
>> me that this should just be a simple mis-configuration on my part
>> causing the issue, but I suppose it could be a bug as well.
>> 
>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>> 6.8. One cluster is simply sharing an IP, while the other one has
>> numerous services and IP's set up between the two machines in the
>> cluster. Both appear to be working fine. However, I was poking around
>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>> and fenced were using "significant" amounts of processing power - 25%
>> for corosync on the current primary node, with fenced and stonithd often
>> showing 1-2% (not horrible, but more than any other process). In looking
>> at my logs, I see that they are dumping messages like the following to
>> the messages log every second or two:
>> 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>> Operation reboot of fai-dbs1 by fai-dbs2 for
>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>> stonith_admin.cman.15835
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>> fai-dbs2 (reset)
> 
> The above shows that CMAN is asking pacemaker to fence a node. Even
> though fencing is disabled in pacemaker itself, CMAN is configured to
> use pacemaker for fencing (fence_pcmk).

I never did any specific configuring of CMAN, Perhaps that's the problem? I 
missed some configuration steps on setup? I just followed the directions here: 
http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs
 
<http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs>,
 which disabled stonith in pacemaker via the "pcs property set 
stonith-enabled=false" command. Is there separate CMAN configs I need to do to 
get everything copacetic? If so, can you point me to some sort of 
guide/tutorial for that?

> 
>> Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args:
>> Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request:
>> Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot)
>> 'fai-dbs2' with device '(any)'
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
>> initiate_remote_stonith_op: Initiating remote operation reboot for
>> fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
>> stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:error: remote_op_done:
>> Operation reboot of fai-dbs2 by fai-dbs1 for
>> stonith_admin.cman.15394@fai-dbs1.bc3f5d73: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No
>> such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client
>> stonith_admin.cman.15394
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2
>> (reset) failed with rc=237
>> 
>> After seeing this one the one cluster, I checked the logs on the other
>> and sure enough I'm seeing the same thing there. As I mentioned, both
>> nodes in both clusters *appear* to be operating correctly. For example,
>> the output of "pcs status" on the small cluster is this:
>> 
>> [root@fai-dbs1 ~]# pcs status
>> Cluster name: dbs_cluster
>> Last updated: Tue Sep 27 08:59:44 2016
>> Last change: Thu Mar  3 06:11:00 2016
>> Stack: cman
>> Current DC: fai-dbs1 - partition with quorum
>> Version: 1.1.11-97629de
>> 2 Nodes configured
>> 1 Resources configured
>> 
>> 
>> Online: [ fai-dbs1 fai-dbs2 ]
>> 
>> Full list of resources:
>> 
>> virtual_ip(ocf::heartbeat:IPaddr2):Started fai-dbs1
>> 
&

[ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Israel Brewster
I sent this a week ago, but never got a response, so I'm sending it again in the hopes that it just slipped through the cracks. It seems to me that this should just be a simple mis-configuration on my part causing the issue, but I suppose it could be a bug as well.I have two two-node clusters set up using corosync/pacemaker on CentOS 6.8. One cluster is simply sharing an IP, while the other one has numerous services and IP's set up between the two machines in the cluster. Both appear to be working fine. However, I was poking around today, and I noticed that on the single IP cluster, corosync, stonithd, and fenced were using "significant" amounts of processing power - 25% for corosync on the current primary node, with fenced and stonithd often showing 1-2% (not horrible, but more than any other process). In looking at my logs, I see that they are dumping messages like the following to the messages log every second or two:Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object: No match for //@st_delegate in /st-replySep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done: Operation reboot of fai-dbs1 by fai-dbs2 for stonith_admin.cman.15835@fai-dbs2.c5161517: No such deviceSep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify: Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client stonith_admin.cman.15835Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence fai-dbs2 (reset)Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args: Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request: Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot) 'fai-dbs2' with device '(any)'Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object: No match for //@st_delegate in /st-replySep 27 08:51:50 fai-dbs1 stonith-ng[4851]:    error: remote_op_done: Operation reboot of fai-dbs2 by fai-dbs1 for stonith_admin.cman.15394@fai-dbs1.bc3f5d73: No such deviceSep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify: Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client stonith_admin.cman.15394Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2 (reset) failed with rc=237After seeing this one the one cluster, I checked the logs on the other and sure enough I'm seeing the same thing there. As I mentioned, both nodes in both clusters *appear* to be operating correctly. For example, the output of "pcs status" on the small cluster is this:[root@fai-dbs1 ~]# pcs statusCluster name: dbs_clusterLast updated: Tue Sep 27 08:59:44 2016Last change: Thu Mar  3 06:11:00 2016Stack: cmanCurrent DC: fai-dbs1 - partition with quorumVersion: 1.1.11-97629de2 Nodes configured1 Resources configuredOnline: [ fai-dbs1 fai-dbs2 ]Full list of resources: virtual_ip	(ocf::heartbeat:IPaddr2):	Started fai-dbs1And on the larger cluster, it has services running across both nodes of the cluster, and I've been able to move stuff back and forth without issue. Both nodes have the stonith-enabled property set to false, and no-quorum-policy set to ignore (since they are only two nodes in the cluster).What could be causing the log messages? Is the CPU usage normal, or might there be something I can do about that as well? Thanks.
---Israel BrewsterSystems Analyst IIRavn Alaska5245 Airport Industrial RdFairbanks, AK 99709(907) 450-7293---BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] stonithd/fenced filling up logs

2016-09-27 Thread Israel Brewster
I have two two-node clusters set up using corosync/pacemaker on CentOS 6.8. One cluster is simply sharing an IP, while the other one has numerous services and IP's set up between the two machines in the cluster. Both appear to be working fine. However, I was poking around today, and I noticed that on the single IP cluster, corosync, stonithd, and fenced were using "significant" amounts of processing power - 25% for corosync on the current primary node, with fenced and stonithd often showing 1-2% (not horrible, but more than any other process). In looking at my logs, I see that they are dumping messages like the following to the messages log every second or two:Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object: No match for //@st_delegate in /st-replySep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done: Operation reboot of fai-dbs1 by fai-dbs2 for stonith_admin.cman.15835@fai-dbs2.c5161517: No such deviceSep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify: Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client stonith_admin.cman.15835Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence fai-dbs2 (reset)Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args: Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request: Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot) 'fai-dbs2' with device '(any)'Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object: No match for //@st_delegate in /st-replySep 27 08:51:50 fai-dbs1 stonith-ng[4851]:    error: remote_op_done: Operation reboot of fai-dbs2 by fai-dbs1 for stonith_admin.cman.15394@fai-dbs1.bc3f5d73: No such deviceSep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify: Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client stonith_admin.cman.15394Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2 (reset) failed with rc=237After seeing this one the one cluster, I checked the logs on the other and sure enough I'm seeing the same thing there. As I mentioned, both nodes in both clusters *appear* to be operating correctly. For example, the output of "pcs status" on the small cluster is this:[root@fai-dbs1 ~]# pcs statusCluster name: dbs_clusterLast updated: Tue Sep 27 08:59:44 2016Last change: Thu Mar  3 06:11:00 2016Stack: cmanCurrent DC: fai-dbs1 - partition with quorumVersion: 1.1.11-97629de2 Nodes configured1 Resources configuredOnline: [ fai-dbs1 fai-dbs2 ]Full list of resources: virtual_ip	(ocf::heartbeat:IPaddr2):	Started fai-dbs1And on the larger cluster, it has services running across both nodes of the cluster, and I've been able to move stuff back and forth without issue. Both nodes have the stonith-enabled property set to false, and no-quorum-policy set to ignore (since they are only two nodes in the cluster).What could be causing the log messages? Is the CPU usage normal, or might there be something I can do about that as well? Thanks.
---Israel BrewsterSystems Analyst IIRavn Alaska5245 Airport Industrial RdFairbanks, AK 99709(907) 450-7293---BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org