Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.

2017-08-01 Thread Greg Woods
On Tue, Aug 1, 2017 at 2:05 AM, Stephen Carville (HA List) <
62d2a...@opayq.com> wrote:

> On 07/31/2017 11:13 PM, Ulrich Windl [Masked] wrote:
>
>  I guess you have no fencing configured, right?
>
> No. I didn't realize it was necessary unless there was shared storage
> involved.  I guess it is time to go back to the drawing board.  Can
> clustering even be done reliably on CentOS 6?


Yes, it can. I have a number of CentOS 6 clusters running with corosync and
pacemaker, and CentOS 6, while obviously not the latest version, is still
maintained and will be for at least a couple more years. But yes, you have
to have fencing to have a cluster. I believe there is a way to manually
tell one node of the cluster that the other node has been reset (using
stonith_admin I think), but without fencing you are likely to end up in the
state where you have to manually reset things to get the cluster going
again any time something goes wrong, which is not exactly the high
availability that  you build a cluster for in the first place.

--Greg
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] odd cluster failure

2017-02-03 Thread Greg Woods
(Apologies if this is a duplicate. I accidentally posted to the old
linux-ha.org address, and I couldn't tell from the auto-reply whether my
message was actually posted to the list or not).

For the second time in a few weeks, we have had one node of a particular
cluster getting fenced. It isn't totally clear why this is happening. On
the surviving node I see:

Feb  2 16:48:52 vmc1 stonith-ng[4331]:   notice: stonith-vm2 can fence
(reboot) vmc2.ucar.edu: static-list
Feb  2 16:48:52 vmc1 stonith-ng[4331]:   notice: stonith-vm2 can fence
(reboot) vmc2.ucar.edu: static-list
Feb  2 16:49:00 vmc1 kernel: igb :03:00.1 eth3: igb: eth3 NIC Link is
Down
Feb  2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state
Feb  2 16:49:01 vmc1 corosync[2846]:   [TOTEM ] A processor failed, forming
new configuration.

OK, so from this point of view, it looks like the link was lost between the
two hosts, resulting in fencing. The link is a crossover cable, so no
networking hardware other than the host NICs and the cable.

On the other side I see:

Feb  2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb  2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb  2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb  2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb  2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode
Feb  2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb  2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb  2 16:46:49 vmc2 crmd[4191]:   notice: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sending flush op to all hosts
for: fail-count-VM-radnets (1)
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sent update 37:
fail-count-VM-radnets=1
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sending flush op to all hosts
for: last-failure-VM-radnets (1486079209)
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sent update 39:
last-failure-VM-radnets=1486079209
Feb  2 16:46:50 vmc2 pengine[4190]:   notice: On loss of CCM Quorum: Ignore
Feb  2 16:46:50 vmc2 pengine[4190]:  warning: Processing failed op monitor
for VM-radnets on vmc2.ucar.edu: not running (7)
Feb  2 16:46:50 vmc2 pengine[4190]:   notice: Recover
VM-radnets#011(Started vmc2.ucar.edu)
Feb  2 16:46:50 vmc2 pengine[4190]:   notice: Calculated Transition 2914:
/var/lib/pacemaker/pengine/pe-input-317.bz2
Feb  2 16:46:50 vmc2 crmd[4191]:   notice: Initiating action 15: stop
VM-radnets_stop_0 on vmc2.ucar.edu (local)
Feb  2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will
be stopped (timeout: 80s)
Feb  2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode
Feb  2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is
not ready
Feb  2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10,
protocol 1 (x86_64-abi)
Feb  2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready
Feb  2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link
becomes ready
Feb  2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state
Feb  2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state
Feb  2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state

 (and then there are a bunch of null bytes, and the log resumes with reboot)

More messages about networking, except that xenbr1 is not the bridge device
associated with the NIC in question.

I don't see any reason why the link between the hosts should suddenly stop
working, so I am suspecting a hardware problem that only crops up rarely
(but will most likely get worse over time).
Is there anything anyone can see in the log that would suggest otherwise?

Thank you,
--Greg
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DRBD failover in Pacemaker

2016-09-07 Thread Greg Woods
On Tue, Sep 6, 2016 at 1:04 PM, Devin Ortner <
devin.ort...@gtshq.onmicrosoft.com> wrote:

> Master/Slave Set: ClusterDBclone [ClusterDB]
>  Masters: [ node1 ]
>  Slaves: [ node2 ]
>  ClusterFS  (ocf::heartbeat:Filesystem):Started node1
>

As Digimer said, you really need fencing when you are using DRBD. Otherwise
it's only a matter of time before your shared filesystem gets corrupted.

You also need an order constraint to be sure that the ClusterFS Filesystem
does not start until after the Master DRBD resource, and a colocation
constraint to ensure these are on the same node.

--Greg
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Error When Creating LVM Resource

2016-08-26 Thread Greg Woods
On Fri, Aug 26, 2016 at 9:32 AM, Jason A Ramsey  wrote:

> Failed Actions:
>
> * gctvanas-lvm_start_0 on node1 'not running' (7): call=42,
> status=complete, exitreason='LVM: targetfs did not activate correctly',
>
> last-rc-change='Fri Aug 26 10:57:22 2016', queued=0ms, exec=577ms
>
> * gctvanas-lvm_start_0 on node2 'unknown error' (1): call=34,
> status=complete, exitreason='Volume group [targetfs] does not exist or
> contains error!   Volume group "targetfs" not found',
>
> last-rc-change='Fri Aug 26 10:57:21 2016', queued=0ms, exec=322ms
>
>
>

I think you need a colocation constraint to prevent it from trying to start
the LVM resource on the DRBD secondary node. I used to run LVM-over-DRBD
clusters but don't any more (switched to NFS backend storage), so I don't
remember the exact syntax, but you certainly don't want the LVM resource to
start on node2 at this point because it will certainly fail.

It may not be running on node1 because it failed on node2, so if you can
get the proper colocation constraint in place, things may work after you do
a resource cleanup. (I stand ready to be corrected by someone more
knowledgeable who can spot a configuration problem that I missed).

If you still get failure and the constraint is correct, then I would try
running the lvcreate command manually on the DRBD primary node to make sure
that works.

--Greg
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Setup problem: couldn't find command: tcm_node

2016-07-20 Thread Greg Woods
On Wed, Jul 20, 2016 at 10:09 AM, Andrei Borzenkov 
wrote:

> tcm_node is part of lio-utils. I am not familiar with RedHat packages,
> but I presume that searching for "lio" should reveal something.
>

I checked on both Fedora and CentOS, and there is no such package and no
package provides a file called "tcm_node".  I also looked at rpmfind.net
and the only RPMs I found are for various versions of OpenSUSE. Looks like
something slipped in that is SuSE-specific.

--Greg
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org