Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.
On Tue, Aug 1, 2017 at 2:05 AM, Stephen Carville (HA List) < 62d2a...@opayq.com> wrote: > On 07/31/2017 11:13 PM, Ulrich Windl [Masked] wrote: > > I guess you have no fencing configured, right? > > No. I didn't realize it was necessary unless there was shared storage > involved. I guess it is time to go back to the drawing board. Can > clustering even be done reliably on CentOS 6? Yes, it can. I have a number of CentOS 6 clusters running with corosync and pacemaker, and CentOS 6, while obviously not the latest version, is still maintained and will be for at least a couple more years. But yes, you have to have fencing to have a cluster. I believe there is a way to manually tell one node of the cluster that the other node has been reset (using stonith_admin I think), but without fencing you are likely to end up in the state where you have to manually reset things to get the cluster going again any time something goes wrong, which is not exactly the high availability that you build a cluster for in the first place. --Greg ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] odd cluster failure
(Apologies if this is a duplicate. I accidentally posted to the old linux-ha.org address, and I couldn't tell from the auto-reply whether my message was actually posted to the list or not). For the second time in a few weeks, we have had one node of a particular cluster getting fenced. It isn't totally clear why this is happening. On the surviving node I see: Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence (reboot) vmc2.ucar.edu: static-list Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence (reboot) vmc2.ucar.edu: static-list Feb 2 16:49:00 vmc1 kernel: igb :03:00.1 eth3: igb: eth3 NIC Link is Down Feb 2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state Feb 2 16:49:01 vmc1 corosync[2846]: [TOTEM ] A processor failed, forming new configuration. OK, so from this point of view, it looks like the link was lost between the two hosts, resulting in fencing. The link is a crossover cable, so no networking hardware other than the host NICs and the cable. On the other side I see: Feb 2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state Feb 2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) Feb 2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state Feb 2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state Feb 2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) Feb 2 16:46:49 vmc2 crmd[4191]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts for: fail-count-VM-radnets (1) Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 37: fail-count-VM-radnets=1 Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts for: last-failure-VM-radnets (1486079209) Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 39: last-failure-VM-radnets=1486079209 Feb 2 16:46:50 vmc2 pengine[4190]: notice: On loss of CCM Quorum: Ignore Feb 2 16:46:50 vmc2 pengine[4190]: warning: Processing failed op monitor for VM-radnets on vmc2.ucar.edu: not running (7) Feb 2 16:46:50 vmc2 pengine[4190]: notice: Recover VM-radnets#011(Started vmc2.ucar.edu) Feb 2 16:46:50 vmc2 pengine[4190]: notice: Calculated Transition 2914: /var/lib/pacemaker/pengine/pe-input-317.bz2 Feb 2 16:46:50 vmc2 crmd[4191]: notice: Initiating action 15: stop VM-radnets_stop_0 on vmc2.ucar.edu (local) Feb 2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will be stopped (timeout: 80s) Feb 2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode Feb 2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is not ready Feb 2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10, protocol 1 (x86_64-abi) Feb 2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready Feb 2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link becomes ready Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state Feb 2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state (and then there are a bunch of null bytes, and the log resumes with reboot) More messages about networking, except that xenbr1 is not the bridge device associated with the NIC in question. I don't see any reason why the link between the hosts should suddenly stop working, so I am suspecting a hardware problem that only crops up rarely (but will most likely get worse over time). Is there anything anyone can see in the log that would suggest otherwise? Thank you, --Greg ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] DRBD failover in Pacemaker
On Tue, Sep 6, 2016 at 1:04 PM, Devin Ortner < devin.ort...@gtshq.onmicrosoft.com> wrote: > Master/Slave Set: ClusterDBclone [ClusterDB] > Masters: [ node1 ] > Slaves: [ node2 ] > ClusterFS (ocf::heartbeat:Filesystem):Started node1 > As Digimer said, you really need fencing when you are using DRBD. Otherwise it's only a matter of time before your shared filesystem gets corrupted. You also need an order constraint to be sure that the ClusterFS Filesystem does not start until after the Master DRBD resource, and a colocation constraint to ensure these are on the same node. --Greg ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Error When Creating LVM Resource
On Fri, Aug 26, 2016 at 9:32 AM, Jason A Ramseywrote: > Failed Actions: > > * gctvanas-lvm_start_0 on node1 'not running' (7): call=42, > status=complete, exitreason='LVM: targetfs did not activate correctly', > > last-rc-change='Fri Aug 26 10:57:22 2016', queued=0ms, exec=577ms > > * gctvanas-lvm_start_0 on node2 'unknown error' (1): call=34, > status=complete, exitreason='Volume group [targetfs] does not exist or > contains error! Volume group "targetfs" not found', > > last-rc-change='Fri Aug 26 10:57:21 2016', queued=0ms, exec=322ms > > > I think you need a colocation constraint to prevent it from trying to start the LVM resource on the DRBD secondary node. I used to run LVM-over-DRBD clusters but don't any more (switched to NFS backend storage), so I don't remember the exact syntax, but you certainly don't want the LVM resource to start on node2 at this point because it will certainly fail. It may not be running on node1 because it failed on node2, so if you can get the proper colocation constraint in place, things may work after you do a resource cleanup. (I stand ready to be corrected by someone more knowledgeable who can spot a configuration problem that I missed). If you still get failure and the constraint is correct, then I would try running the lvcreate command manually on the DRBD primary node to make sure that works. --Greg ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Setup problem: couldn't find command: tcm_node
On Wed, Jul 20, 2016 at 10:09 AM, Andrei Borzenkovwrote: > tcm_node is part of lio-utils. I am not familiar with RedHat packages, > but I presume that searching for "lio" should reveal something. > I checked on both Fedora and CentOS, and there is no such package and no package provides a file called "tcm_node". I also looked at rpmfind.net and the only RPMs I found are for various versions of OpenSUSE. Looks like something slipped in that is SuSE-specific. --Greg ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org