[ClusterLabs] Two-node cluster stops resources when second node is running alone

2020-02-22 Thread Reynolds, John F - San Mateo, CA - Contractor
I have a two-node cluster, meqfc0 and meqfc1.  When both nodes are up, the 
cluster will run OK on either meqfc0 or meqfc1.   

My practice for OS patching is to patch the inactive node, migrate, then patch 
the formerly active node.  Patching requires a reboot.

The cluster has run peacefully with meqfc0 active, so I patched and rebooted 
meqfc1.  The cluster stayed active.

Then I migrated the resources to meqfc1.  The cluster stabilized and ran OK.

I patched and rebooted mepfc0.  As soon as it shut down, all the cluster 
resources on meqfc1 stopped.  The cluster was still up, crm status listed 
meqfc1 as online and meqfc0 as offline.  All the resources showed Stopped on 
meqfc1.

When meqfc0 finished rebooting and rejoined the cluster, the resources migrated 
themselves over to meqfc0 and started up.

This does not make sense to me, that the cluster can run as A+B or A, but not B.

Basic configuration from  cib.xml:

cib crm_feature_set="3.0.13" validate-with="pacemaker-2.7" epoch="93" 
num_updates="0" admin_epoch="0" cib-last-written="Wed Feb 19 15:36:41 2020" 
update-origin="meqfc1" update-client="crmd" update-user="hacluster" 
have-quorum="1" dc-uuid="1">
  

  







  


  
  


(about 200 lines removed)



  


  


  


  

  



There is one anomalous entry in cib.xml, the line:



That syntax is wrong, and there should be an opening and closing constraint, 
shouldn't there?

Please advise, thank you.

John Reynolds 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] SLES cluster join fails with TLS handshake error

2020-01-01 Thread Reynolds, John F - San Mateo, CA - Contractor
I have reworked csync2's SSL keys, and I was able to use ha-cluster-join to add 
the second node to the cluster.   Thank you for the guidance!


However, not all the resources are happy with this.  

eagnmnmeqfc1:/var/lib/pacemaker/cib # crm status
Stack: corosync
Current DC: eagnmnmeqfc0 (version 1.1.16-4.8-77ea74d) - partition with quorum
Last updated: Tue Dec 31 15:16:14 2019
Last change: Tue Dec 31 15:01:34 2019 by hacluster via crmd on eagnmnmeqfc0

2 nodes configured
16 resources configured

Online: [ eagnmnmeqfc0 eagnmnmeqfc1 ]

Full list of resources:

 Resource Group: grp_ncoa
 ncoa_dg_mqm(ocf::heartbeat:LVM):   Started eagnmnmeqfc0
 ncoa_dg_a01(ocf::heartbeat:LVM):   Started eagnmnmeqfc0
 ncoa_dg_a02(ocf::heartbeat:LVM):   Started eagnmnmeqfc0
 ncoa_dg_a03(ocf::heartbeat:LVM):   Started eagnmnmeqfc0
 ncoa_dg_a04(ocf::heartbeat:LVM):   Started eagnmnmeqfc0
 ncoa_dg_a05(ocf::heartbeat:LVM):   Started eagnmnmeqfc0
 ncoa_mqm   (ocf::heartbeat:Filesystem):Started eagnmnmeqfc0
 ncoa_a01shared (ocf::heartbeat:Filesystem):Started eagnmnmeqfc0
 ncoa_a02shared (ocf::heartbeat:Filesystem):Started eagnmnmeqfc0
 ncoa_a03shared (ocf::heartbeat:Filesystem):Started eagnmnmeqfc0
 ncoa_a04shared (ocf::heartbeat:Filesystem):Started eagnmnmeqfc0
 ncoa_a05shared (ocf::heartbeat:Filesystem):Started eagnmnmeqfc0
 IP_56.76.161.36(ocf::heartbeat:IPaddr2):   Started eagnmnmeqfc0
 ncoa_apache(systemd:apache2):  Started eagnmnmeqfc0
 ncoa_dg_a00(ocf::heartbeat:LVM):   FAILED[ eagnmnmeqfc0 
eagnmnmeqfc1 ]
 ncoa_a00shared (ocf::heartbeat:Filesystem):FAILED eagnmnmeqfc0 
(blocked)

Failed Actions:
* ncoa_a00shared_stop_0 on eagnmnmeqfc0 'unknown error' (1): call=206, 
status=complete, exitreason='Couldn't unmount /ncoa/qncoa/a00shared, giving 
up!',
last-rc-change='Tue Dec 31 15:01:35 2019', queued=0ms, exec=7478ms
* ncoa_dg_a00_monitor_0 on eagnmnmeqfc1 'unknown error' (1): call=141, 
status=complete, exitreason='WARNING: vg_qncoa_noncloned-a00 is active without 
the cluster tag, "pacemaker"',
last-rc-change='Tue Dec 31 15:01:34 2019', queued=0ms, exec=287ms

eagnmnmeqfc1:/var/lib

The PV and VG are present on both servers.  The resource is defined in cib.xml 
as :



  


  
  

  

  

  


  



  
  

  


Which is, as far as I can see, the same as one of the resources that is 
working: 


  


  
  

  

  

  


  


  
  

  

  

  


"crm resource cleanup' doesn’t fix the problem.


Now, the /ncoa/qncoa/a00shared filesystem can't be unmounted because there are 
open files on it.  Could it be that the problem is simply that the cluster-add 
wanted to unmount and remount all the disk resources, and, since it couldn't do 
it, tossed it as an error?  


Thank you.

John Reynolds
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] SLES cluster join fails with TLS handshake error

2019-11-26 Thread Reynolds, John F - San Mateo, CA - Contractor
Hello.

I am trying to setup  a two-node cluster of SLES12SP4 servers.  The two nodes 
are named 'eagnmnmeqfc0', IP 56.76.161.34, and 'eagnmnmeqfc1', IP 56.76.161.35

The ha-cluster-init on fc0 went fine.  It is set up for unicast, as multicast 
is blocked on our networks.

The cluster-join on fc1 failed.  It looks OK, but at the end, there is a TLS 
handshake error.  The log is:


eagnmnmeqfc1:/var/log # cat  ha-cluster-bootstrap.log
+ systemctl reload rsyslog.service

2019-11-25 15:28:52-06:00 /usr/sbin/crm cluster join -c 56.76.161.34 
--interface=bond0 -y

+ systemctl enable sshd.service
+ mkdir -m 700 -p /root/.ssh
# Retrieving SSH keys - This may prompt for root@56.76.161.34:
+ scp -oStrictHostKeyChecking=no  root@56.76.161.34:'/root/.ssh/id_*' 
/tmp/crmsh_IlBXAY/
[login header]
+ mv /tmp/crmsh_IlBXAY/id_rsa* /root/.ssh/
+ cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
# One new SSH key installed
+ ssh root@56.76.161.34 ha-cluster-init ssh_remote
Done (log saved to /var/log/ha-cluster-bootstrap.log)
[login header]
# Configuring csync2
+ rm -f /var/lib/csync2/eagnmnmeqfc1.db3
+ ssh root@56.76.161.34 ha-cluster-init csync2_remote eagnmnmeqfc1
Done (log saved to /var/log/ha-cluster-bootstrap.log)
[login header]
+ scp root@56.76.161.34:'/etc/csync2/{csync2.cfg,key_hagroup}' /etc/csync2
[login header]
+ systemctl enable csync2.socket
+ ssh root@56.76.161.34 "csync2 -mr / ; csync2 -fr / ; csync2 -xv"
[login header]
Marking file as dirty: /etc/corosync/authkey
Connecting to host eagnmnmeqfc1 (SSL) ...
Connect to 56.76.161.35:30865 (eagnmnmeqfc1).
SSL: failed to use key file /etc/csync2/csync2_ssl_key.pem and/or certificate 
file /etc/csync2/csync2_ssl_cert.pem: Error while reading file. 
(GNUTLS_E_FILE_ERROR)
ARNING: csync2 run failed - some files may not be sync'd
# Merging known_hosts
parallax.call ['eagnmnmeqfc0', 'eagnmnmeqfc1'] : [ -e /root/.ssh/known_hosts ] 
&& cat /root/.ssh/known_hosts || true
parallax.copy ['eagnmnmeqfc0', 'eagnmnmeqfc1'] : 56.76.161.35 
ecdsa-sha2-nistp256 
E2VjZHNhLXNoYTItbmlzdHAyNTYIbmlzdHAyNTYAAABBBA1NplEqVWzby0/wwQED0s8wPrNhk0zzkZz4NIWOlU/Z4td75heNmPgpEhh5z6i9Jdc3hWnuhPbiP9Wso5qsJMs=
eagnmnmeqfc0,56.76.161.34 ecdsa-sha2-nistp256 
E2VjZHNhLXNoYTItbmlzdHAyNTYIbmlzdHAyNTYAAABBBA1NplEqVWzby0/wwQED0s8wPrNhk0zzkZz4NIWOlU/Z4td75heNmPgpEhh5z6i9Jdc3hWnuhPbiP9Wso5qsJMs=
eagnmnmeqfc1 ecdsa-sha2-nistp256 
E2VjZHNhLXNoYTItbmlzdHAyNTYIbmlzdHAyNTYAAABBBA1NplEqVWzby0/wwQED0s8wPrNhk0zzkZz4NIWOlU/Z4td75heNmPgpEhh5z6i9Jdc3hWnuhPbiP9Wso5qsJMs=
eagnmnmeqfc1,56.76.161.35 ecdsa-sha2-nistp256 
E2VjZHNhLXNoYTItbmlzdHAyNTYIbmlzdHAyNTYAAABBBA1NplEqVWzby0/wwQED0s8wPrNhk0zzkZz4NIWOlU/Z4td75heNmPgpEhh5z6i9Jdc3hWnuhPbiP9Wso5qsJMs=
eagnmnmeqfca,56.76.161.44 ecdsa-sha2-nistp256 
E2VjZHNhLXNoYTItbmlzdHAyNTYIbmlzdHAyNTYAAABBBA1NplEqVWzby0/wwQED0s8wPrNhk0zzkZz4NIWOlU/Z4td75heNmPgpEhh5z6i9Jdc3hWnuhPbiP9Wso5qsJMs=
eagnmnmeqfcb,56.76.161.45 ecdsa-sha2-nistp256 
E2VjZHNhLXNoYTItbmlzdHAyNTYIbmlzdHAyNTYAAABBBA1NplEqVWzby0/wwQED0s8wPrNhk0zzkZz4NIWOlU/Z4td75heNmPgpEhh5z6i9Jdc3hWnuhPbiP9Wso5qsJMs=
# Probing for new partitions...
+ partprobe /dev/sde /dev/sdf /dev/sdb /dev/sdc /dev/sda /dev/sdd /dev/sdg 
/dev/sdm /dev/sdn /dev/sdq /dev/sdr /dev/sdh /dev/sdk /dev/sdi /dev/sdl 
/dev/sdp /dev/sds /dev/sdj /dev/sdu /dev/sdt /dev/sdv /dev/sdo /dev/sdx 
/dev/sdw /dev/mapper/3697197200928533030333644 
/dev/mapper/3697197200928533030324134 
/dev/mapper/3697197200928533030324135 
/dev/mapper/3697197200928533030333645 
/dev/mapper/3697197200498533031374344 
/dev/mapper/3697197200498533030324637 
/dev/mapper/3697197200498533030324639 
/dev/mapper/3697197200498533030324638 /dev/sdy /dev/sdz /dev/sdaa 
/dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf 
/dev/mapper/vg_qncoa_noncloned--a00-lv_a00shared 
/dev/mapper/vg_rootdisk-lv_export /dev/mapper/vg_rootdisk-lv_patrol 
/dev/mapper/vg_rootdisk-lv_root /dev/mapper/vg_rootdisk-lv_swap 
/dev/mapper/vg_rootdisk-lv_var /dev/mapper/vg_rootdisk-lv_var_log
# done
+ mkdir -p /ncoa/qncoa/a00shared
+ mkdir -p /mqm/qncoa/u00
+ mkdir -p /ncoa/qncoa/a01shared
+ mkdir -p /ncoa/qncoa/a02shared
+ mkdir -p /ncoa/qncoa/a03shared
+ mkdir -p /ncoa/qncoa/a04shared
+ mkdir -p /ncoa/qncoa/a05shared
+ ssh root@56.76.161.34 systemctl is-enabled sbd.service
disabled
[login header]
+ rm -f /var/lib/heartbeat/crm/* /var/lib/pacemaker/cib/*
+ systemctl enable hawk.service
+ systemctl start hawk.service
#   Hawk cluster interface is now running. To see cluster status, open:
# https://56.76.161.35:7630/
#   Log in with username 'hacluster'
+ systemctl disable sbd.service
+ systemctl enable pacemaker.service
+ systemctl start pacemaker.service
# Waiting for cluster...
# done
+ csync2 -rm /etc/corosync/corosync.conf
+ csync2 -rf /etc/corosync/cor

Re: [ClusterLabs] Apache doesn't start under corosync with systemd

2019-10-18 Thread Reynolds, John F - San Mateo, CA - Contractor
With respect, I've given up on the ocf:heartbeat:apache module.

I've set up my Apache resource with:

# systemctl disable apache2
# crm configure primitive ncoa_apache systemd:apache2
# crm configure modgroup grp_ncoa add ncoa_apache

# crm configure show ncoa_apache
primitive ncoa_apache systemd:apache2
#

Apache doesn't start until the cluster is up.When the cluster starts, 
Apache starts up on the active node.  The webserver migrates with the cluster 
when I move it from one node to another.  That's really all I want.

Tell me why this is a bad idea.

What options or other configurations should I add to the primitive?  Please 
give the command syntax.

John Reynolds


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Apache doesn't start under corosync with systemd

2019-10-16 Thread Reynolds, John F - San Mateo, CA - Contractor


>mailto:kgail...@redhat.com] 
>Sent: Monday, October 14, 2019 12:02 PM
>
>If you have SELinux enabled, check for denials. The cluster processes have a 
>different SELinux context than systemd, so policies might not be set up 
>correctly.
>--
>Ken Gaillot 

Alas, SELinux is not in use.


I am thinking that the apache OCF module is not starting up apache with the 
modules that it needs.  

 Again, startup with 'systemctl start apache' brings up the http daemons, so we 
know that the Apache configuration is clean.  

But  if I enable trace and run the ocf script by hand:

export OCF_TRACE_RA=1
/usr/lib/ocf/resource.d/heartbeat/apache start ; echo $?

Part of the output is Apache syntax errors that aren't flagged in the regular 
startup:

+ 14:57:10: ocf_run:443: ocf_log err 'AH00526: Syntax error on line 22 of 
/etc/apache2/vhosts.d/aqvslookup.conf: Invalid command '\''Order'\'', perhaps 
misspelled or defined by a module not included in the server configuration '

The 'Allow' and ' AuthLDAPURL' commands are also flagged as invalid.

The /etc/sysconfig/apache2 module parameter includes the relevant modules:

APACHE_MODULES="actions alias auth_basic authn_file authz_host authz_groupfile 
authz_core authz_user autoindex cgi dir env expires include log_config mime 
negotiation setenvif ssl socache_shmcb userdir reqtimeout authn_core php5 
rewrite ldap authnz_ldap status access_compat"


Why are they invoked properly from systemctl but not from ocf?

John Reynolds 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Apache doesn't start under corosync with systemd

2019-10-12 Thread Reynolds, John F - San Mateo, CA - Contractor
>  If pacemaker is managing a resource, the service should not be enabled to 
> start on boot (regardless of init or systemd). Pacemaker will start and stop 
> the service as needed according to the cluster configuration.

Apache startup is disabled in systemctl, and there is no apache script in 
/etc/init.d

>Additionally, your pacemaker configuration is using the apache OCF script, so 
>the cluster won't use /etc/init.d/apache2 at all (it invokes the httpd binary 
>directly).
>
>Keep in mind that the httpd monitor action requires the status module to be 
>enabled -- I assume that's already in place.

Yes, that is enabled, according to apache2ctl -M.


The resource configuration is

Primitive  ncoa_apache apache \
Params configfile="/etc/apache2/httpd.conf"\
Op monitor internval=40s timeout=60s\
Meta target-role=Started

When I start the resource, crm status shows it in 'starting' mode, but never 
gets to 'Started'.

There is one process running "/bin/sh /usr/lib/ocf/resources.d/heartbeat/apache 
start"  but the httpd processes never come up.  What's worse, with that process 
running, the cluster resource can't migrate; I have to kill it before the 
cluster will finish cleanup and start  on the new node.  'crm resource cleanup 
ncoa_apache' hangs, as well.

Apache starts up just fine from the systemctl command, so it's not the Apache 
config that's broken.

Suggestions?

John Reynolds SMUnix




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Apache doesn't start under corosync with systemd

2019-10-04 Thread Reynolds, John F - San Mateo, CA - Contractor
Good morning.

I've just upgraded a two-node active-passive cluster from SLES11 to SLES12.  
This means that I've gone from /etc/init.d scripts to systemd services.

On the SLES11 server, this worked:


  

  
  

  


I had to tweak /etc/init.d/apache2 to make sure it only started on the active 
node, but that's OK.

On the SLES12 server, the resource is the same:


  

  
  

  


and the cluster believes the resource is started:


eagnmnmep19c1:/var/lib/pacemaker/cib # crm status
Stack: corosync
Current DC: eagnmnmep19c0 (version 1.1.16-4.8-77ea74d) - partition with quorum
Last updated: Fri Oct  4 09:02:52 2019
Last change: Thu Oct  3 10:55:03 2019 by root via crm_resource on eagnmnmep19c0

2 nodes configured
16 resources configured

Online: [ eagnmnmep19c0 eagnmnmep19c1 ]

Full list of resources:

Resource Group: grp_ncoa
  (edited out for brevity)
 ncoa_a05shared (ocf::heartbeat:Filesystem):Started eagnmnmep19c1
 IP_56.201.217.146  (ocf::heartbeat:IPaddr2):   Started eagnmnmep19c1
 ncoa_apache(ocf::heartbeat:apache):Started eagnmnmep19c1

eagnmnmep19c1:/var/lib/pacemaker/cib #


But the httpd daemons aren't started.  I can start them by hand, but that's not 
what I need.

I have gone through the ClusterLabs and SLES docs for setting up apache 
resources, and through this list's archive; haven't found my answer.   I'm 
missing something in corosync, apache, or systemd.  Please advise.


John Reynolds, Contractor
San Mateo Unix

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/