Hi, I've created 2-node active-active HA Cluster with NFS resource. The resources are active on both nodes. The Cluster passes failover test with pcs standby command but does not work when "real" node shutdown occure.
Test scenario with cluster standby: - start cluster - mount nfs share on client1 - start copy file from client1 to nfs share - during the copy put node1/node2 to standby mode (pcs cluster standby nfsnode2) - the copy continue - unstandby node1/node2 - the copy continue and the storage re-sync (drbd) - the copy finish with no errors I can standby and unstandby the cluster many times and it works. The problem begins when I do a "true" failover test by hard-shutting down one of the nodes. Test results: - start cluster - mount nfs share on client1 - start copy file from client1 to nfs share - during the copy shutdown node2 by stoping the node's virtual machine (hard stop) - the system hangs: <Start copy file at client1> # rsync -a --bwlimit=2000 /root/testfile.dat /mnt/nfsshare/ <everything works ok. There is temp file .testfile.dat.9780fH> [root@nfsnode1 nfs]# ls -lah razem 9,8M drwxr-xr-x 2 root root 3,8K 07-10 11:07 . drwxr-xr-x 4 root root 3,8K 07-10 08:20 .. -rw-r--r-- 1 root root 9 07-10 08:20 client1.txt -rw-r----- 1 root root 0 07-10 11:07 .rmtab -rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH [root@nfsnode1 nfs]# pcs status Cluster name: nfscluster Stack: corosync Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Mon Jul 10 11:07:29 2017 Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1 2 nodes and 15 resources configured Online: [ nfsnode1 nfsnode2 ] Full list of resources: Master/Slave Set: StorageClone [Storage] Masters: [ nfsnode1 nfsnode2 ] Clone Set: dlm-clone [dlm] Started: [ nfsnode1 nfsnode2 ] vbox-fencing (stonith:fence_vbox): Started nfsnode1 Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2 ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1 Clone Set: StorageFS-clone [StorageFS] Started: [ nfsnode1 nfsnode2 ] Clone Set: WebSite-clone [WebSite] Started: [ nfsnode1 nfsnode2 ] Clone Set: nfs-group-clone [nfs-group] Started: [ nfsnode1 nfsnode2 ] <Hard poweroff vm machine: nfsnode2> [root@nfsnode1 nfs]# pcs status Cluster name: nfscluster Stack: corosync Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Mon Jul 10 11:07:43 2017 Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1 2 nodes and 15 resources configured Node nfsnode2: UNCLEAN (offline) Online: [ nfsnode1 ] Full list of resources: Master/Slave Set: StorageClone [Storage] Storage (ocf::linbit:drbd): Master nfsnode2 (UNCLEAN) Masters: [ nfsnode1 ] Clone Set: dlm-clone [dlm] dlm (ocf::pacemaker:controld): Started nfsnode2 (UNCLEAN) Started: [ nfsnode1 ] vbox-fencing (stonith:fence_vbox): Started nfsnode1 Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2 (UNCLEAN) ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1 Clone Set: StorageFS-clone [StorageFS] StorageFS (ocf::heartbeat:Filesystem): Started nfsnode2 (UNCLEAN) Started: [ nfsnode1 ] Clone Set: WebSite-clone [WebSite] WebSite (ocf::heartbeat:apache): Started nfsnode2 (UNCLEAN) Started: [ nfsnode1 ] Clone Set: nfs-group-clone [nfs-group] Resource Group: nfs-group:1 nfs (ocf::heartbeat:nfsserver): Started nfsnode2 (UNCLEAN) nfs-export (ocf::heartbeat:exportfs): Started nfsnode2 (UNCLEAN) Started: [ nfsnode1 ] <ssh console hangs on client1> [root@nfsnode1 nfs]# ls -lah <nothing happen> <drbd status is ok in this situation> [root@nfsnode1 ~]# drbdadm status storage role:Primary disk:UpToDate nfsnode2 connection:Connecting <the nfs export is still active on node1> [root@nfsnode1 ~]# exportfs /mnt/drbd/nfs 10.0.2.0/255.255.255.0 <After ssh to client1 the nfs mount is not accessible> login as: root root@127.0.0.1's password: Last login: Mon Jul 10 07:48:17 2017 from 10.0.2.2 # cd /mnt/ # ls <console hangs> # mount 10.0.2.7:/ on /mnt/nfsshare type nfs4 (rw,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.20,local_lock=none,addr=10.0.2.7) <Power on vm machine nfsnode2> <After nfsnode2 boot, console an nfsnode1 start respond but coping is not proceeding> <The temp file is visible but not active> [root@nfsnode1 ~]# ls -lah razem 9,8M drwxr-xr-x 2 root root 3,8K 07-10 11:07 . drwxr-xr-x 4 root root 3,8K 07-10 08:20 .. -rw-r--r-- 1 root root 9 07-10 08:20 client1.txt -rw-r----- 1 root root 0 07-10 11:16 .rmtab -rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH <Coping at client1 hangs> <Cluster status:> [root@nfsnode1 ~]# pcs status Cluster name: nfscluster Stack: corosync Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Mon Jul 10 11:17:19 2017 Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1 2 nodes and 15 resources configured Online: [ nfsnode1 nfsnode2 ] Full list of resources: Master/Slave Set: StorageClone [Storage] Masters: [ nfsnode1 ] Stopped: [ nfsnode2 ] Clone Set: dlm-clone [dlm] Started: [ nfsnode1 nfsnode2 ] vbox-fencing (stonith:fence_vbox): Started nfsnode1 Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Stopped ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1 Clone Set: StorageFS-clone [StorageFS] Started: [ nfsnode1 ] Stopped: [ nfsnode2 ] Clone Set: WebSite-clone [WebSite] Started: [ nfsnode1 ] Stopped: [ nfsnode2 ] Clone Set: nfs-group-clone [nfs-group] Resource Group: nfs-group:0 nfs (ocf::heartbeat:nfsserver): Started nfsnode1 nfs-export (ocf::heartbeat:exportfs): FAILED nfsnode1 Stopped: [ nfsnode2 ] Failed Actions: * nfs-export_monitor_30000 on nfsnode1 'unknown error' (1): call=61, status=Timed Out, exitreason='none', last-rc-change='Mon Jul 10 11:11:50 2017', queued=0ms, exec=0ms * vbox-fencing_monitor_60000 on nfsnode1 'unknown error' (1): call=22, status=Error, exitreason='none', last-rc-change='Mon Jul 10 11:06:41 2017', queued=0ms, exec=11988ms <Try to cleanup> # pcs resource cleanup # pcs status Cluster name: nfscluster Stack: corosync Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Mon Jul 10 11:20:38 2017 Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1 2 nodes and 15 resources configured Online: [ nfsnode1 nfsnode2 ] Full list of resources: Master/Slave Set: StorageClone [Storage] Masters: [ nfsnode1 ] Stopped: [ nfsnode2 ] Clone Set: dlm-clone [dlm] Started: [ nfsnode1 nfsnode2 ] vbox-fencing (stonith:fence_vbox): Stopped Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Stopped ClusterIP:1 (ocf::heartbeat:IPaddr2): Stopped Clone Set: StorageFS-clone [StorageFS] Stopped: [ nfsnode1 nfsnode2 ] Clone Set: WebSite-clone [WebSite] Stopped: [ nfsnode1 nfsnode2 ] Clone Set: nfs-group-clone [nfs-group] Stopped: [ nfsnode1 nfsnode2 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled <Reboot of both nfsnode1 and nfsnode2> <After reboot:> # pcs status Cluster name: nfscluster Stack: corosync Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Mon Jul 10 11:24:10 2017 Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1 2 nodes and 15 resources configured Online: [ nfsnode1 nfsnode2 ] Full list of resources: Master/Slave Set: StorageClone [Storage] Slaves: [ nfsnode2 ] Stopped: [ nfsnode1 ] Clone Set: dlm-clone [dlm] Started: [ nfsnode1 nfsnode2 ] vbox-fencing (stonith:fence_vbox): Stopped Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Stopped ClusterIP:1 (ocf::heartbeat:IPaddr2): Stopped Clone Set: StorageFS-clone [StorageFS] Stopped: [ nfsnode1 nfsnode2 ] Clone Set: WebSite-clone [WebSite] Stopped: [ nfsnode1 nfsnode2 ] Clone Set: nfs-group-clone [nfs-group] Stopped: [ nfsnode1 nfsnode2 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled <Eventually the cluster was recovered after:> pcs cluster stop --all <Solve drbd split-brain> pcs cluster start --all The client1 could not be rebooted with 'reboot' due to mount hung (as I preasume). It has to be rebooted hard-way by virtualbox hypervisor. What's wrong with this configuration? I can send CIB configuration if necessary. --------------- Full cluster configuration (working state): # pcs status --full Cluster name: nfscluster Stack: corosync Current DC: nfsnode1 (1) (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Mon Jul 10 12:44:03 2017 Last change: Mon Jul 10 11:37:13 2017 by root via crm_attribute on nfsnode1 2 nodes and 15 resources configured Online: [ nfsnode1 (1) nfsnode2 (2) ] Full list of resources: Master/Slave Set: StorageClone [Storage] Storage (ocf::linbit:drbd): Master nfsnode1 Storage (ocf::linbit:drbd): Master nfsnode2 Masters: [ nfsnode1 nfsnode2 ] Clone Set: dlm-clone [dlm] dlm (ocf::pacemaker:controld): Started nfsnode1 dlm (ocf::pacemaker:controld): Started nfsnode2 Started: [ nfsnode1 nfsnode2 ] vbox-fencing (stonith:fence_vbox): Started nfsnode1 Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2 ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1 Clone Set: StorageFS-clone [StorageFS] StorageFS (ocf::heartbeat:Filesystem): Started nfsnode1 StorageFS (ocf::heartbeat:Filesystem): Started nfsnode2 Started: [ nfsnode1 nfsnode2 ] Clone Set: WebSite-clone [WebSite] WebSite (ocf::heartbeat:apache): Started nfsnode1 WebSite (ocf::heartbeat:apache): Started nfsnode2 Started: [ nfsnode1 nfsnode2 ] Clone Set: nfs-group-clone [nfs-group] Resource Group: nfs-group:0 nfs (ocf::heartbeat:nfsserver): Started nfsnode1 nfs-export (ocf::heartbeat:exportfs): Started nfsnode1 Resource Group: nfs-group:1 nfs (ocf::heartbeat:nfsserver): Started nfsnode2 nfs-export (ocf::heartbeat:exportfs): Started nfsnode2 Started: [ nfsnode1 nfsnode2 ] Node Attributes: * Node nfsnode1 (1): + master-Storage : 10000 * Node nfsnode2 (2): + master-Storage : 10000 Migration Summary: * Node nfsnode1 (1): * Node nfsnode2 (2): PCSD Status: nfsnode1: Online nfsnode2: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled ]# pcs resource --full Master: StorageClone Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=2 clone-node-max=1 Resource: Storage (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=storage Operations: start interval=0s timeout=240 (Storage-start-interval-0s) promote interval=0s timeout=90 (Storage-promote-interval-0s) demote interval=0s timeout=90 (Storage-demote-interval-0s) stop interval=0s timeout=100 (Storage-stop-interval-0s) monitor interval=60s (Storage-monitor-interval-60s) Clone: dlm-clone Meta Attrs: clone-max=2 clone-node-max=1 Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) monitor interval=60s (dlm-monitor-interval-60s) Clone: ClusterIP-clone Meta Attrs: clona-node-max=2 clone-max=2 globally-unique=true clone-node-max=2 Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.0.2.7 cidr_netmask=32 clusterip_hash=sourceip Meta Attrs: resource-stickiness=0 Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s) stop interval=0s timeout=20s (ClusterIP-stop-interval-0s) monitor interval=5s (ClusterIP-monitor-interval-5s) Clone: StorageFS-clone Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2 Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s) stop interval=0s timeout=60 (StorageFS-stop-interval-0s) monitor interval=20 timeout=40 (StorageFS-monitor-interval-20) Clone: WebSite-clone Resource: WebSite (class=ocf provider=heartbeat type=apache) Attributes: configfile=/etc/httpd/conf/httpd.conf statusurl= http://localhost/server-status Operations: start interval=0s timeout=40s (WebSite-start-interval-0s) stop interval=0s timeout=60s (WebSite-stop-interval-0s) monitor interval=1min (WebSite-monitor-interval-1min) Clone: nfs-group-clone Meta Attrs: interleave=true Group: nfs-group Resource: nfs (class=ocf provider=heartbeat type=nfsserver) Attributes: nfs_ip=10.0.2.7 nfs_no_notify=true Operations: start interval=0s timeout=40 (nfs-start-interval-0s) stop interval=0s timeout=20s (nfs-stop-interval-0s) monitor interval=30s (nfs-monitor-interval-30s) Resource: nfs-export (class=ocf provider=heartbeat type=exportfs) Attributes: clientspec=10.0.2.0/255.255.255.0 options=rw,sync,no_root_squash directory=/mnt/drbd/nfs fsid=0 Operations: start interval=0s timeout=40 (nfs-export-start-interval-0s) stop interval=0s timeout=120 (nfs-export-stop-interval-0s) monitor interval=30s (nfs-export-monitor-interval-30s) # pcs constraint --full Location Constraints: Ordering Constraints: start ClusterIP-clone then start WebSite-clone (kind:Mandatory) (id:order-ClusterIP-WebSite-mandatory) promote StorageClone then start StorageFS-clone (kind:Mandatory) (id:order-StorageClone-StorageFS-mandatory) start StorageFS-clone then start WebSite-clone (kind:Mandatory) (id:order-StorageFS-WebSite-mandatory) start dlm-clone then start StorageFS-clone (kind:Mandatory) (id:order-dlm-clone-StorageFS-mandatory) start StorageFS-clone then start nfs-group-clone (kind:Mandatory) (id:order-StorageFS-clone-nfs-group-clone-mandatory) Colocation Constraints: WebSite-clone with ClusterIP-clone (score:INFINITY) (id:colocation-WebSite-ClusterIP-INFINITY) StorageFS-clone with StorageClone (score:INFINITY) (with-rsc-role:Master) (id:colocation-StorageFS-StorageClone-INFINITY) WebSite-clone with StorageFS-clone (score:INFINITY) (id:colocation-WebSite-StorageFS-INFINITY) StorageFS-clone with dlm-clone (score:INFINITY) (id:colocation-StorageFS-dlm-clone-INFINITY) StorageFS-clone with nfs-group-clone (score:INFINITY) (id:colocation-StorageFS-clone-nfs-group-clone-INFINITY)
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org