Hi to all!
I've an issue with my cluster env. First of all my config:
Two Cluster CentOS5.5 Active+Standby with one DRBD partition managing a Nagios
service, ip, and storage.
The config files at the bottom.
I'm trying to test fence option to prevent split brain and problems on double
access on drbd partition.
Starting on a sane situation, manual switching of the resources or simulating kernel-panic, crash of process or whatever, all
works well. If I try to shutdown the eth1 (192.168.100.0 as well as cross cable to drbd mirroring) the active stay as it is, it
calls the fence option adding the entry to crm config:
location drbd-fence-by-handler-ServerData ServerData \
rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf:
#uname ne opsview-core01-tn
But the standby node kills the corosync process:
*** STANDBY NODE LOG ***
Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14158
iface 192.168.100.12 to [1 of 10]
Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14160
iface 192.168.100.12 to [2 of 10]
Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14162
iface 192.168.100.12 to [3 of 10]
Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14164
iface 192.168.100.12 to [4 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Decrementing problem counter for iface
192.168.100.12 to [3 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14166
iface 192.168.100.12 to [4 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14168
iface 192.168.100.12 to [5 of 10]
Dec 24 11:00:07 corosync [TOTEM ] Incrementing problem counter for seqid 14170
iface 192.168.100.12 to [6 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14172
iface 192.168.100.12 to [7 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Decrementing problem counter for iface
192.168.100.12 to [6 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14174
iface 192.168.100.12 to [7 of 10]
Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14176
iface 192.168.100.12 to [8 of 10]
Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14178
iface 192.168.100.12 to [9 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Decrementing problem counter for iface
192.168.100.12 to [8 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14180
iface 192.168.100.12 to [9 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14182
iface 192.168.100.12 to [10 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Marking seqid 14182 ringid 0 interface
192.168.100.12 FAULTY - adminisrtative intervention required.
Dec 24 11:00:11 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: No such
file or directory (2)
Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: AIS
connection failed
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: AIS connection
terminated
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: AIS
connection failed
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: crm_ais_destroy: AIS
connection terminated
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: AIS
connection failed
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: AIS
connection failed
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: cib_ais_destroy: AIS
connection terminated
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: CRIT: attrd_ais_destroy: Lost
connection to OpenAIS service!
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: info: main: Exiting...
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR:
attrd_cib_connection_destroy: Connection to the CIB terminated...
*** STANDBY NODE LOG ***
The issues are not finished.
If I put up back the interface eth1, start corosync again and check that the ring are both online (corosync-cfgtool -r) the
cluster-standby tries to take the services even if resource-stickiness is set. It goes into error maybe due fence script.
crm status:
============
Last updated: Fri Dec 24 11:06:40 2010
Stack: openais
Current DC: opsview-core01-tn - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
2 Nodes configured, 2 expected votes
2 Resources configured.
============
Online: [ opsview-core01-tn opsview-core02-tn ]
Master/Slave Set: ServerData
drbd_data:0 (ocf::linbit:drbd): Slave opsview-core02-tn
(unmanaged) FAILED
Stopped: [ drbd_data:1 ]
Failed actions:
drbd_data:0_stop_0 (node=opsview-core02-tn, call=9, rc=6, status=complete):
not configured
LOGS on slave:
****************************************
Dec 24 11:06:13 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'): started
and ready to provide service.
Dec 24 11:06:13 corosync [MAIN ] Corosync built-in features: nss rdma
Dec 24 11:06:13 corosync [MAIN ] Successfully read main configuration file
'/etc/corosync/corosync.conf'.
Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security:
libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security:
libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 24 11:06:13 corosync [TOTEM ] The network interface [192.168.100.12] is now
up.
Dec 24 11:06:13 corosync [pcmk ] info: process_ais_conf: Reading configure
Set r/w permissions for uid=0, gid=0 on /var/log/cluster/corosync.log
Dec 24 11:06:13 corosync [pcmk ] info: config_find_init: Local handle:
4730966301143465986 for logging
Dec 24 11:06:13 corosync [pcmk ] info: config_find_next: Processing additional
logging options...
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'off' for option:
debug
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'yes' for option:
to_logfile
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found
'/var/log/cluster/corosync.log' for option: logfile
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'yes' for option:
to_syslog
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'daemon'
for option: syslog_facility
Dec 24 11:06:13 corosync [pcmk ] info: config_find_init: Local handle:
7739444317642555395 for service
Dec 24 11:06:13 corosync [pcmk ] info: config_find_next: Processing additional
service options...
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'pcmk'
for option: clustername
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for
option: use_logd
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for
option: use_mgmtd
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: CRM: Initialized
Dec 24 11:06:13 corosync [pcmk ] Logging: Initialized pcmk_startup
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Maximum core file size
is: 18446744073709551615
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Service: 9
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Local hostname:
opsview-core02-tn
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_update_nodeid: Local node id:
207923392
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Creating entry for node
207923392 born on 0
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x2aaaac000920 Node
207923392 now known as opsview-core02-tn (was: (null))
Dec 24 11:06:13 opsview-core02-tn lrmd: [5153]: info: lrmd is shutting down
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info:
G_main_add_SignalHandler: Added signal handler for signal 10
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: Invoked:
/usr/lib64/heartbeat/attrd
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Signal sent to pid=5153,
waiting for process to exit
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core02-tn
now has 1 quorum votes (was 0)
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info:
G_main_add_SignalHandler: Added signal handler for signal 12
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting up
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler:
Added signal handler for signal 15
Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: info: Invoked:
/usr/lib64/heartbeat/pengine
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node
207923392/opsview-core02-tn is now: member
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: Invoked:
/usr/lib64/heartbeat/cib
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_cluster_connect:
Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_cluster_connect:
Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: Invoked:
/usr/lib64/heartbeat/crmd
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6762 for
process stonithd
Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: WARN: main: Terminating
previous PE instance
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_TriggerHandler:
Added signal manual handler
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info:
init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler:
Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info:
init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: main: CRM Hg Version:
da7075976b5ff0bee71074385f8fd02f296ec8a3
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6763 for
process cib
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_SignalHandler:
Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn pengine: [5155]: WARN: process_pe_message:
Received quit message, terminating
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: enabling coredumps
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6764 for
process lrmd
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting crmd
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: retrieveCib: Reading cluster configuration from:
/var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler:
Added signal handler for signal 10
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6765 for
process attrd
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: G_main_add_SignalHandler:
Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler:
Added signal handler for signal 12
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6766 for
process pengine
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Started.
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6767 for
process crmd
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: Pacemaker Cluster
Manager 1.0.9
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync extended
virtual synchrony service
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync configuration
service
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster
closed process group service v1.01
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster
config database access v1.01
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync profile
loading service
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster
quorum service v0.1
Dec 24 11:06:13 corosync [MAIN ] Compatibility mode set to whitetank. Using
V1 and V2 of the synchronization engine.
Dec 24 11:06:13 corosync [TOTEM ] The network interface [172.18.17.12] is now
up.
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info:
init_ais_connection_once: AIS connection established
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info:
init_ais_connection_once: AIS connection established
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x868c90
for attrd/6765
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: get_ais_nodeid: Server
details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x86d0a0
for stonithd/6762
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node
opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node
207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Cluster connection
active
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tn
cname=pcmk
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node
opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Accepting
attribute updates
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node
207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting
mainloop...
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: notice:
/usr/lib64/heartbeat/stonithd start up successfully.
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: startCib: CIB
Initialization completed successfully
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_cluster_connect:
Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once:
Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once:
AIS connection established
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x872fa0
for cib/6763
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core02-tn now has process list:
00000000000000000000000000013312 (78610)
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Sending membership update 0
to cib
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: get_ais_nodeid: Server
details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node
opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node
207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_init: Starting cib
mainloop
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership
0: quorum still lost
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member (new)
addr=(null) votes=1 (new) born=0 seen=0 proc=00000000000000000000000000013312 (new)
Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: Archived previous version as
/var/lib/heartbeat/crm/cib-26.raw
Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: Wrote version 0.473.0 of the CIB to disk (digest:
3c7be90920e86222ad6102a0f01d9efd)
Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: retrieveCib: Reading cluster configuration from:
/var/lib/heartbeat/crm/cib.UxVZY6 (digest: /var/lib/heartbeat/crm/cib.76RIND)
Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 1
iface 172.18.17.12 to [1 of 10]
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Transitional
membership event on ring 13032: memb=0, new=0, lost=0
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Stable membership
event on ring 13032: memb=1, new=1, lost=0
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: NEW:
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB:
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and
a new membership was formed.
Dec 24 11:06:13 corosync [MAIN ] Completed service synchronization, ready to
provide service.
Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 2
iface 192.168.100.12 to [1 of 10]
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Transitional
membership event on ring 13036: memb=1, new=0, lost=0
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: memb:
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Stable membership
event on ring 13036: memb=2, new=1, lost=0
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Creating entry for node
191146176 born on 13036
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node 191146176/unknown
is now: member
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: NEW: .pending.
191146176
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: .pending.
191146176
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB:
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk ] info: send_member_notification: Sending
membership update 13036 to 1 children
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x2aaaac000920 Node
207923392 ((null)) born on: 13036
Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and
a new membership was formed.
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership
13036: quorum still lost
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node <null>
now has id: 191146176
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node (null): id=191146176 state=member (new) addr=r(0)
ip(192.168.100.11) r(1) ip(172.18.17.11) votes=0 born=0 seen=13036 proc=00000000000000000000000000000000
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member addr=r(0)
ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 born=0 seen=13036 proc=00000000000000000000000000013312
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x825ef0 Node 191146176
(opsview-core01-tn) born on: 13028
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: notice: ais_dispatch: Membership
13036: quorum acquired
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x825ef0 Node 191146176
now known as opsview-core01-tn (was: (null))
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_get_peer: Node
191146176 is now known as opsview-core01-tn
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core01-tn now has process list:
00000000000000000000000000013312 (78610)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core01-tn: id=191146176 state=member addr=r(0)
ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 (new) born=13028 seen=13036 proc=00000000000000000000000000013312 (new)
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core01-tn
now has 1 quorum votes (was 0)
Dec 24 11:06:13 corosync [pcmk ] info: send_member_notification: Sending
membership update 13036 to 1 children
Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to
local.crmd failed: unknown (rc=-2)
Dec 24 11:06:13 corosync [MAIN ] Completed service synchronization, ready to
provide service.
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_process_diff: Diff 0.475.1 -> 0.475.2 not applied to 0.473.0: current
"epoch" is less than required
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_server_process_diff:
Requesting re-sync from peer
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_diff_notify: Local-only Change (client:crmd, call: 105): -1.-1.-1
(Application of an update diff failed, requesting a full refresh)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not
applying diff 0.475.2 -> 0.475.3 (sync in progress)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not
applying diff 0.475.3 -> 0.475.4 (sync in progress)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not
applying diff 0.475.4 -> 0.476.1 (sync in progress)
Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to
local.crmd failed: unknown (rc=-2)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_replace_notify:
Local-only Replace: -1.-1.-1 from opsview-core01-tn
Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: Archived previous version as
/var/lib/heartbeat/crm/cib-27.raw
Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: Wrote version 0.476.0 of the CIB to disk (digest:
c348ac643cfe3b370e5eca03ff7f180c)
Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: retrieveCib: Reading cluster configuration from:
/var/lib/heartbeat/crm/cib.FYgzJ8 (digest: /var/lib/heartbeat/crm/cib.VrDRiH)
Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to
local.crmd failed: unknown (rc=-2)
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_cib_control: CIB
connection established
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_cluster_connect:
Connecting to OpenAIS
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once:
Creating connection to our AIS plugin
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once:
AIS connection established
Dec 24 11:06:14 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x878020
for crmd/6767
Dec 24 11:06:14 corosync [pcmk ] info: pcmk_ipc: Sending membership update
13036 to crmd
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: get_ais_nodeid: Server
details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node
opsview-core02-tn now has id: 207923392
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node
207923392 is now known as opsview-core02-tn
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_ha_control: Connected
to the cluster
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: Delaying
start, CCM (0000000000100000) not connected
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting
crmd's mainloop
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback:
Checking for expired actions every 900000ms
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback:
Sending expected-votes=2 to corosync
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: notice: ais_dispatch:
Membership 13036: quorum acquired
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node
opsview-core01-tn now has id: 191146176
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node
191146176 is now known as opsview-core01-tn
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node opsview-core01-tn: id=191146176 state=member (new)
addr=r(0) ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 born=13028 seen=13036 proc=00000000000000000000000000013312
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member (new)
addr=r(0) ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 (new) born=13036 seen=13036
proc=00000000000000000000000000013312 (new)
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: The local CRM
is operational
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_state_transition: State transition S_STARTING -> S_PENDING [
input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Dec 24 11:06:15 opsview-core02-tn pengine: [6766]: info: main: Starting pengine
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: ais_dispatch: Membership
13036: quorum retained
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_dc: Set DC to
opsview-core01-tn (3.0.1)
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_attrd: Connecting
to attrd...
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC
cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry:
Creating hash entry for terminate
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry:
Creating hash entry for shutdown
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_local_callback:
Sending full refresh (origin=crmd)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending
flush op to all hosts for: terminate (<null>)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying
operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending
flush op to all hosts for: shutdown (<null>)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying
operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying
operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying
operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: erase_xpath_callback: Deletion of
"//node_sta...@uname='opsview-core02-tn']/transient_attributes": ok (rc=0)
Dec 24 11:06:15 corosync [TOTEM ] ring 0 active with no faults
Dec 24 11:06:15 corosync [TOTEM ] ring 1 active with no faults
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node
opsview-core01-tn now has id: 191146176
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node
191146176 is now known as opsview-core01-tn
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry:
Creating hash entry for master-drbd_data:0
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:0=<null>: cib not
connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry:
Creating hash entry for probe_complete
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation probe_complete=<null>: cib not
connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=9:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=drbd_data:0_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:2: probe
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=10:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=ServerFS_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ServerFS:3: probe
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=11:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=ClusterIP01_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ClusterIP01:4: probe
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=12:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=opsview-core_lsb_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-core_lsb:5:
probe
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=13:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=opsview-web_lsb_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=14:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=WebSite_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry:
Creating hash entry for master-drbd_data:1
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:1=<null>: cib not
connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying
operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying
operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation ClusterIP01_monitor_0 (call=4, rc=7,
cib-update=7, confirmed=true) not running
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation ServerFS_monitor_0 (call=3, rc=7,
cib-update=8, confirmed=true) not running
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:0
(1000)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:0=1000: cib not
connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_monitor_0 (call=2, rc=0,
cib-update=9, confirmed=true) ok
Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-web_lsb:6:
probe
Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:WebSite:7: probe
Dec 24 11:06:16 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation WebSite_monitor_0 (call=7, rc=7,
cib-update=10, confirmed=true) not running
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Connected
to the CIB after 1 signon attempts
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Sending
full refresh
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:0
(1000)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_perform_update:
Sent update 4: master-drbd_data:0=1000
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete
(<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:1
(<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending
flush op to all hosts for: terminate (<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending
flush op to all hosts for: shutdown (<null>)
Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-core_lsb:probe:stderr) su: warning: cannot change
directory to /var/log/nagios: No such file or directory
Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-core_lsb:probe:stderr) /etc/init.d/opsview: line 262:
/usr/local/nagios/bin/profile: No such file or directory
Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-web_lsb:probe:stderr) su: warning: cannot change
directory to /var/log/nagios: No such file or directory
Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-web_lsb:probe:stderr) /etc/init.d/opsview-web: line 171:
/usr/local/nagios/bin/opsview.sh: No such file or directory
Dec 24 11:06:27 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation opsview-core_lsb_monitor_0 (call=5, rc=7,
cib-update=11, confirmed=true) not running
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation opsview-web_lsb_monitor_0 (call=6, rc=7,
cib-update=12, confirmed=true) not running
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (true)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update:
Sent update 15: probe_complete=true
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=61:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=drbd_data:0_notify_0 )
Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:8: notify
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_notify_0 (call=8, rc=0,
cib-update=13, confirmed=true) ok
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=13:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=drbd_data:0_stop_0 )
Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:9: stop
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_stop_0 (call=9, rc=6,
cib-update=14, confirmed=true) not configured
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch:
Update relayed from opsview-core01-tn
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry:
Creating hash entry for fail-count-drbd_data:0
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for:
fail-count-drbd_data:0 (INFINITY)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update:
Sent update 18: fail-count-drbd_data:0=INFINITY
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch:
Update relayed from opsview-core01-tn
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry:
Creating hash entry for last-failure-drbd_data:0
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for:
last-failure-drbd_data:0 (1293185188)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update:
Sent update 21: last-failure-drbd_data:0=1293185188
****************************************
Now the services are all DOWN.
At this point my only way to solve is to reboot cluster02, then starting
corosync it does NOT try to take the services again.
The unfence option is still there!
Now the drbd is in this state:
Master/Slave Set: ServerData
Masters: [ opsview-core01-tn ]
Stopped: [ drbd_data:1 ]
due the fence option.
If I try 'drbdadm -- --discard-my-data connect all' on the cluster02 I obtain:
[r...@core02-tn ~]# drbdadm -- --discard-my-data connect all
Could not stat("/proc/drbd"): No such file or directory
do you need to load the module?
try: modprobe drbd
Command 'drbdsetup 1 net 192.168.100.12:7789 192.168.100.11:7789 C --set-defaults --create-device --rr-conflict=disconnect
--after-sb-2pri=disconnect --after-sb-1pri=disconnect --after-sb-0pri=disconnect --discard-my-data' terminated with exit code 20
drbdadm connect cluster_data: exited with code 20
I've to remove manually the entry:
location drbd-fence-by-handler-ServerData ServerData \
rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf:
#uname ne opsview-core01-tn
Because I've no idea HOW to unfence the cluster to permit the auto-remove of
the above line.
Removing the line, the cluster02 connects back to drbd:
Master/Slave Set: ServerData
Masters: [ opsview-core01-tn ]
Slaves: [ opsview-core02-tn ]
Writing here I've tested that the inverse situation works on half. It means, if the cluster02 is master, i disconnect eth1, then
fence entry is added to crm, but cluster01 does *NOT* crash. So I've to start removing "location
drbd-fence-by-handler-ServerData..." to go back to a standard situation. BTW, removing the entry, on cluster01 the same error and
corosync kills:
********* cluster01 logs **********
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: info: update_dc: Unset DC
opsview-core01-tn
Dec 24 12:01:31 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: info: cib_process_request: Operation complete: op cib_modify for section nodes
(origin=local/crmd/165, version=0.491.1): ok (rc=0)
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: AIS
connection failed
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: AIS
connection failed
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: cib_ais_destroy: AIS
connection terminated
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: crm_ais_destroy: AIS
connection terminated
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error:
Resource temporarily unavailable (11)
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: AIS
connection failed
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: AIS connection
terminated
Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: Archived previous version as
/var/lib/heartbeat/crm/cib-23.raw
Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: Wrote version 0.491.0 of the CIB to disk (digest:
ad222fed7ff40dc7093ffc6411079df4)
Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: retrieveCib: Reading cluster configuration from:
/var/lib/heartbeat/crm/cib.R3dVbk (digest: /var/lib/heartbeat/crm/cib.EllYEu)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_text: Sending message 44: FAILED (rc=2): Library error:
Connection timed out (110)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (true)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC
Channel to 22670 is not connected
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op:
Sending message to CIB service FAILED
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update:
Sent update -5: probe_complete=true
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: attrd_cib_callback:
Update -5 for probe_complete=true failed: send failed
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_message: Not
connected to AIS
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update: Sending flush op to all hosts for:
master-drbd_data:1 (<null>)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC
Channel to 22670 is not connected
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op:
Sending message to CIB service FAILED
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update: Delete operation failed: node=opsview-core01-tn,
attr=master-drbd_data:1, id=<n/a>, set=(null), section=status: send failed (-5)
***********************
So, the questions:
What's wrong? Seems all starts when the corosyng on secondary node crash (or stop)
disconnecting the cable (due "Library error"?!?!?)
If I solve the issues with crashes, then, how (/when) should the unfence option
be executed? Should it not done automatically?
Do I have always to remove manually the entry (location ...) on crm?
Sorry for the long mail and thanks for the support!
Simon
Config files:
*************************************
cat /etc/corosync/corosync.conf
compatibility: whitetank
totem {
version: 2
# How long before declaring a token lost (ms)
token: 2000
# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 10
# How long to wait for join messages in the membership protocol (ms)
join: 200
# How long wait for consensus to be achieved before starting a new
round of membership configuration (ms)
consensus: 1000
vsftype: none
# Number of messages that may be sent by one processor on receipt of
the token
max_messages: 20
send_join: 0
# Limit generated nodeids to 31-bits (positive signed integers)
clear_node_high_bit: yes
secauth: off
threads: 0
rrp_mode: active
interface {
ringnumber: 0
bindnetaddr: 192.168.100.0
mcastaddr: 226.100.1.1
mcastport: 4000
}
interface {
ringnumber: 1
bindnetaddr: 172.18.17.0
#broadcast: yes
mcastaddr: 227.100.1.2
mcastport: 4001
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
aisexec {
user: root
group: root
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 0
}
*************************************
cat /etc/drbd.conf
global {
usage-count no;
}
common {
protocol C;
syncer {
rate 70M;
verify-alg sha1;
}
net {
after-sb-0pri disconnect;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
rr-conflict disconnect;
}
handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
startup {
degr-wfc-timeout 120; # 2 minutes.
}
disk {
fencing resource-only;
on-io-error call-local-io-error;
}
}
resource cluster_data {
device /dev/drbd1;
disk /dev/sda4;
meta-disk internal;
on opsview-core01-tn {
address 192.168.100.11:7789;
}
on opsview-core02-tn {
address 192.168.100.12:7789;
}
}
*************************************
crm configure show
node opsview-core01-tn \
attributes standby="off"
node opsview-core02-tn \
attributes standby="off"
primitive ClusterIP01 ocf:heartbeat:IPaddr2 \
params ip="172.18.17.10" cidr_netmask="32" \
op monitor interval="30"
primitive ServerFS ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/data" fstype="ext3"
primitive WebSite ocf:heartbeat:apache \
params configfile="/etc/httpd/conf/httpd.conf" \
op monitor interval="1min" \
meta target-role="Started"
primitive drbd_data ocf:linbit:drbd \
params drbd_resource="cluster_data" \
op monitor interval="60s"
primitive opsview-core_lsb lsb:opsview \
op start interval="0" timeout="350s" \
op stop interval="0" timeout="350s" \
op monitor interval="60s" timeout="350s"
primitive opsview-web_lsb lsb:opsview-web \
op start interval="0" timeout="350s" start-delay="15s" \
op stop interval="0" timeout="350s" \
op monitor interval="60s" timeout="350s" \
meta target-role="Started"
group OPSView-Apps ServerFS ClusterIP01 opsview-core_lsb opsview-web_lsb
WebSite \
meta target-role="Started"
ms ServerData drbd_data \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true" target-role="Master"
colocation fs_on_drbd inf: OPSView-Apps ServerData:Master
order ServerFS-after-ServerData inf: ServerData:promote OPSView-Apps:start
property $id="cib-bootstrap-options" \
dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker