6. Problem with failover/failback under Ubuntu 10.04 for
Active/Passive OpenNMS (Sven Wick)
Hi,
I am trying to build an Active/Passive OpenNMS installation
but have problems with failover and failback.
Doing all manually (DRBD, VIP, Filesystem, ...) works just fine.
I read many tutorials, the official book and the FAQ
but am still stuck with my problem.
I use Ubuntu 10.04 with "pacemaker" and "corosync".
Hosts are 2 HP ProLiant G6 and are connected through a Cisco Switch.
The switch was configured to allow Multicast as described in[1].
Hostnames are "monitoring-node-01" and "monitoring-node-02"
and it seems I can failover to "monitoring-node-02" but not back.
The DRBD Init Script is disabled on both nodes.
First my current config:
crm configure show
node monitoring-node-01 \
attributes standby="off"
node monitoring-node-02 \
attributes standby="off"
primitive drbd-opennms-config ocf:linbit:drbd \
params drbd_resource="config" \
op start interval="0" timeout="300s" \
op stop interval="0" timeout="300s"
primitive drbd-opennms-data ocf:linbit:drbd \
params drbd_resource="data" \
op start interval="0" timeout="300s" \
op stop interval="0" timeout="300s"
primitive drbd-opennms-db ocf:linbit:drbd \
params drbd_resource="db" \
op start interval="0" timeout="300s" \
op stop interval="0" timeout="300s"
primitive fs-opennms-config ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/config" directory="/etc/opennms"
fstype="xfs" \
op stop interval="0" timeout="300s" \
op start interval="0" timeout="300s"
primitive fs-opennms-data ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/data" directory="/var/lib/opennms"
fstype="xfs" \
op stop interval="0" timeout="300s" \
op start interval="0" timeout="300s"
primitive fs-opennms-db ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/db" directory="/var/lib/postgresql"
fstype="xfs" \
op stop interval="0" timeout="300s" \
op start interval="0" timeout="300s"
primitive opennms lsb:opennms \
op start interval="0" timeout="300s" \
op stop interval="0" timeout="300s" \
meta target-role="Started"
primitive postgres lsb:postgresql-8.4
primitive vip ocf:heartbeat:IPaddr2 \
params ip="172.24.25.20" cidr_netmask="24" nic="bond0"
group dependencies vip fs-opennms-config fs-opennms-db fs-opennms-data
postgres opennms
ms ms-opennms-config drbd-opennms-config \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
ms ms-opennms-data drbd-opennms-data \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
ms ms-opennms-db drbd-opennms-db \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
order promote-before-drbd-config inf: ms-opennms-config:promote
fs-opennms-config:start
order promote-before-drbd-data inf: ms-opennms-data:promote
fs-opennms-data:start
order promote-before-drbd-db inf: ms-opennms-db:promote
fs-opennms-db:start
property $id="cib-bootstrap-options" \
dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1278329430"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Current state if all works:
monitoring-node-01
cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
r...@monitoring-node-01, 2010-06-22 20:00:41
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:0 nr:0 dw:2052 dr:29542 al:2 bm:8 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
ns:0 nr:0 dw:0 dr:200 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r----
ns:0 nr:0 dw:0 dr:200 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p1 270G 3.1G 253G 2% /
none 16G 212K 16G 1% /dev
none 16G 3.5M 16G 1% /dev/shm
none 16G 84K 16G 1% /var/run
none 16G 12K 16G 1% /var/lock
none 16G 0 16G 0% /lib/init/rw
/dev/drbd0 10G 34M 10G 1% /etc/opennms
/dev/drbd1 50G 67M 50G 1% /var/lib/postgresql
/dev/drbd2 214G 178M 214G 1% /var/lib/opennms
crm_mon
============
Last updated: Mon Jul 5 13:02:54 2010
Stack: openais
Current DC: monitoring-node-02 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ monitoring-node-01 monitoring-node-02 ]
Master/Slave Set: ms-opennms-data
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-db
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-config
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Resource Group: dependencies
vip (ocf::heartbeat:IPaddr2): Started monitoring-node-01
fs-opennms-config (ocf::heartbeat:Filesystem): Started
monitoring-node-01
fs-opennms-db (ocf::heartbeat:Filesystem): Started
monitoring-node-01
fs-opennms-data (ocf::heartbeat:Filesystem): Started
monitoring-node-01
postgres (lsb:postgresql-8.4): Started monitoring-node-01
opennms (lsb:opennms): Started monitoring-node-01
monitoring-node-02
cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
r...@monitoring-node-02, 2010-06-22 19:59:35
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
ns:0 nr:0 dw:24620 dr:400 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:0
1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
ns:0 nr:302898 dw:302898 dr:400 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1
wo:d oos:0
2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----
ns:0 nr:19712467 dw:19712467 dr:400 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1
wo:d oos:0
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p1 270G 1.8G 254G 1% /
none 16G 212K 16G 1% /dev
none 16G 3.6M 16G 1% /dev/shm
none 16G 72K 16G 1% /var/run
none 16G 12K 16G 1% /var/lock
none 16G 0 16G 0% /lib/init/rw
Rebooting "monitoring-node01" seems to work. All services run now
on "monitoring-node02" and were started in correct order.
But when I reboot "monitoring-node02":
crm_mon
Last updated: Mon Jul 5 13:21:31 2010
Stack: openais
Current DC: monitoring-node-01 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ monitoring-node-01 monitoring-node-02 ]
Master/Slave Set: ms-opennms-data
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-db
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-config
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Resource Group: dependencies
vip (ocf::heartbeat:IPaddr2): Started monitoring-node-02
fs-opennms-config (ocf::heartbeat:Filesystem): Stopped
fs-opennms-db (ocf::heartbeat:Filesystem): Stopped
fs-opennms-data (ocf::heartbeat:Filesystem): Stopped
postgres (lsb:postgresql-8.4): Stopped
opennms (lsb:opennms): Stopped
Failed actions:
fs-opennms-config_start_0 (node=monitoring-node-02, call=18, rc=1,
status=complete): unknown error fs-opennms-config_start_0 (node=m)
When I then type the following command:
crm resource cleanup fs-opennms-config
then the failover goes through and services are started on node01
crm_mon
============
Last updated: Mon Jul 5 13:31:31 2010
Stack: openais
Current DC: monitoring-node-01 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ monitoring-node-01 monitoring-node-02 ]
Master/Slave Set: ms-opennms-data
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-db
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Master/Slave Set: ms-opennms-config
Masters: [ monitoring-node-01 ]
Slaves: [ monitoring-node-02 ]
Resource Group: dependencies
vip (ocf::heartbeat:IPaddr2): Started monitoring-node-01
fs-opennms-config (ocf::heartbeat:Filesystem): Started
monitoring-node-01
fs-opennms-db (ocf::heartbeat:Filesystem): Started
monitoring-node-01
fs-opennms-data (ocf::heartbeat:Filesystem): Started
monitoring-node-01
postgres (lsb:postgresql-8.4): Started monitoring-node-01
opennms (lsb:opennms): Started monitoring-node-01
There are some entries in my daemon.log[2] which look
if they have something to do with my problem...
monitoring-node-01 lrmd: [994]: info: rsc:drbd-opennms-data:1:30: promote
monitoring-node-01 crmd: [998]: info: do_lrm_rsc_op: Performing
key=93:25:0:d49b62be-1e33-48ca-a8c3-cb128676d444
op=fs-opennms-config_start_0 )
monitoring-node-01 lrmd: [994]: info: rsc:fs-opennms-config:31: start
monitoring-node-01 Filesystem[2464]: INFO: Running start for
/dev/drbd/by-res/config on /etc/opennms
monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) FATAL: Module scsi_hostadapter not found.
monitoring-node-01 lrmd: [994]: info: RA output:
(drbd-opennms-data:1:promote:stdout)
monitoring-node-01 crmd: [998]: info: process_lrm_event: LRM operation
drbd-opennms-data:1_promote_0 (call=30, rc=0, cib-update=35,
confirmed=true) ok
monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) /dev/drbd/by-res/config: Wrong medium
type
monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) mount: block device /dev/drbd0 is
write-protected, mounting read-only
monitoring-node-01 lrmd: [994]: info: RA output:
(fs-opennms-config:start:stderr) mount: Wrong medium type
monitoring-node-01 Filesystem[2464]: ERROR: Couldn't mount filesystem
/dev/drbd/by-res/config on /etc/opennms
monitoring-node-01 crmd: [998]: info: process_lrm_event: LRM operation
fs-opennms-config_start_0 (call=31, rc=1, cib-update=36, confirmed=true)
unknown error
...but I don't know how to troubleshoot it.
[1] http://www.corosync.org/doku.php?id=faq:cisco_switches
[2] http://pastebin.com/DKLjXtx8
Hi,
First you might want to look at the following error, see if the module
is available on both servers.
(fs-opennms-config:start:stderr) FATAL: Module scsi_hostadapter not found.
Then try to run the resource manually:
- go to /usr/lib/ocf/resource.d/heartbeat
- export OCF_ROOT=/usr/lib/ocf
- export OCF_RESKEY_device="/dev/drbd/by-res/config"
- export OCF_RESKEY_options=rw
- export OCF_RESKEY_fstype=xfs
- export OCF_RESKEY_directory="/etc/opennms"
- ./Filesystem start
See if you encounter any errors here. Run the steps on both servers.
Make sure to move the drbd resource from server to server so that the
mount works. You do that via
- go to server where drbd device is currently mounted and in a primary state
- umount /etc/opennms
- drbdadm secondary config
- move to other server
- drbdadm primary config
Also, make sure that pacemaker doesn't interfere with these operations :)
Cheers.
--
Dan FRINCU
Systems Engineer
CCNA, RHCE
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker