[Pacemaker] Issues with fence and corosync crash

Simone Felici Fri, 24 Dec 2010 03:11:45 -0800


Hi to all!


I've an issue with my cluster env. First of all my config:

Two Cluster CentOS5.5 Active+Standby with one DRBD partition managing a Nagios 
service, ip, and storage.
The config files at the bottom.

I'm trying to test fence option to prevent split brain and problems on double 
access on drbd partition.

Starting on a sane situation, manual switching of the resources or simulating kernel-panic, crash of process or whatever, allworks well. If I try to shutdown the eth1 (192.168.100.0 as well as cross cable to drbd mirroring) the active stay as it is, itcalls the fence option adding the entry to crm config:

location drbd-fence-by-handler-ServerData ServerData \
        rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf: 
#uname ne opsview-core01-tn

But the standby node kills the corosync process:

*** STANDBY NODE LOG ***
Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14158 
iface 192.168.100.12 to [1 of 10]
Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14160 
iface 192.168.100.12 to [2 of 10]
Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14162 
iface 192.168.100.12 to [3 of 10]
Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14164 
iface 192.168.100.12 to [4 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Decrementing problem counter for iface 
192.168.100.12 to [3 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14166 
iface 192.168.100.12 to [4 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14168 
iface 192.168.100.12 to [5 of 10]
Dec 24 11:00:07 corosync [TOTEM ] Incrementing problem counter for seqid 14170 
iface 192.168.100.12 to [6 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14172 
iface 192.168.100.12 to [7 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Decrementing problem counter for iface 
192.168.100.12 to [6 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14174 
iface 192.168.100.12 to [7 of 10]
Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14176 
iface 192.168.100.12 to [8 of 10]
Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14178 
iface 192.168.100.12 to [9 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Decrementing problem counter for iface 
192.168.100.12 to [8 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14180 
iface 192.168.100.12 to [9 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14182 
iface 192.168.100.12 to [10 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Marking seqid 14182 ringid 0 interface 
192.168.100.12 FAULTY - adminisrtative intervention required.
Dec 24 11:00:11 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE

Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: No suchfile or directory (2)

Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: AIS 
connection failed

Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resourcetemporarily unavailable (11)

Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: AIS connection 
terminated
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: crm_ais_destroy: AIS 
connection terminated

Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resourcetemporarily unavailable (11)Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resourcetemporarily unavailable (11)

Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: cib_ais_destroy: AIS 
connection terminated
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: CRIT: attrd_ais_destroy: Lost 
connection to OpenAIS service!
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: info: main: Exiting...
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: 
attrd_cib_connection_destroy: Connection to the CIB terminated...
*** STANDBY NODE LOG ***

The issues are not finished.

If I put up back the interface eth1, start corosync again and check that the ring are both online (corosync-cfgtool -r) thecluster-standby tries to take the services even if resource-stickiness is set. It goes into error maybe due fence script.


crm status:
============
Last updated: Fri Dec 24 11:06:40 2010
Stack: openais
Current DC: opsview-core01-tn - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ opsview-core01-tn opsview-core02-tn ]

 Master/Slave Set: ServerData
     drbd_data:0        (ocf::linbit:drbd):     Slave opsview-core02-tn 
(unmanaged) FAILED
     Stopped: [ drbd_data:1 ]

Failed actions:
    drbd_data:0_stop_0 (node=opsview-core02-tn, call=9, rc=6, status=complete): 
not configured

LOGS on slave:
****************************************
Dec 24 11:06:13 corosync [MAIN  ] Corosync Cluster Engine ('1.2.7'): started 
and ready to provide service.
Dec 24 11:06:13 corosync [MAIN  ] Corosync built-in features: nss rdma
Dec 24 11:06:13 corosync [MAIN  ] Successfully read main configuration file 
'/etc/corosync/corosync.conf'.
Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security: 
libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security: 
libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 24 11:06:13 corosync [TOTEM ] The network interface [192.168.100.12] is now 
up.
Dec 24 11:06:13 corosync [pcmk  ] info: process_ais_conf: Reading configure
Set r/w permissions for uid=0, gid=0 on /var/log/cluster/corosync.log
Dec 24 11:06:13 corosync [pcmk  ] info: config_find_init: Local handle: 
4730966301143465986 for logging
Dec 24 11:06:13 corosync [pcmk  ] info: config_find_next: Processing additional 
logging options...
Dec 24 11:06:13 corosync [pcmk  ] info: get_config_opt: Found 'off' for option: 
debug
Dec 24 11:06:13 corosync [pcmk  ] info: get_config_opt: Found 'yes' for option: 
to_logfile
Dec 24 11:06:13 corosync [pcmk  ] info: get_config_opt: Found 
'/var/log/cluster/corosync.log' for option: logfile
Dec 24 11:06:13 corosync [pcmk  ] info: get_config_opt: Found 'yes' for option: 
to_syslog
Dec 24 11:06:13 corosync [pcmk  ] info: get_config_opt: Defaulting to 'daemon' 
for option: syslog_facility
Dec 24 11:06:13 corosync [pcmk  ] info: config_find_init: Local handle: 
7739444317642555395 for service
Dec 24 11:06:13 corosync [pcmk  ] info: config_find_next: Processing additional 
service options...
Dec 24 11:06:13 corosync [pcmk  ] info: get_config_opt: Defaulting to 'pcmk' 
for option: clustername
Dec 24 11:06:13 corosync [pcmk  ] info: get_config_opt: Defaulting to 'no' for 
option: use_logd
Dec 24 11:06:13 corosync [pcmk  ] info: get_config_opt: Defaulting to 'no' for 
option: use_mgmtd
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_startup: CRM: Initialized
Dec 24 11:06:13 corosync [pcmk  ] Logging: Initialized pcmk_startup
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_startup: Maximum core file size 
is: 18446744073709551615
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_startup: Service: 9
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_startup: Local hostname: 
opsview-core02-tn
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_update_nodeid: Local node id: 
207923392
Dec 24 11:06:13 corosync [pcmk  ] info: update_member: Creating entry for node 
207923392 born on 0
Dec 24 11:06:13 corosync [pcmk  ] info: update_member: 0x2aaaac000920 Node 
207923392 now known as opsview-core02-tn (was: (null))
Dec 24 11:06:13 opsview-core02-tn lrmd: [5153]: info: lrmd is shutting down
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
G_main_add_SignalHandler: Added signal handler for signal 10
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: Invoked: 
/usr/lib64/heartbeat/attrd
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Signal sent to pid=5153, 
waiting for process to exit
Dec 24 11:06:13 corosync [pcmk  ] info: update_member: Node opsview-core02-tn 
now has 1 quorum votes (was 0)
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
G_main_add_SignalHandler: Added signal handler for signal 12
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting up
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: 
Added signal handler for signal 15
Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: info: Invoked: 
/usr/lib64/heartbeat/pengine
Dec 24 11:06:13 corosync [pcmk  ] info: update_member: Node 
207923392/opsview-core02-tn is now: member
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: Invoked: 
/usr/lib64/heartbeat/cib
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_cluster_connect: 
Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_cluster_connect: 
Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: Invoked: 
/usr/lib64/heartbeat/crmd
Dec 24 11:06:13 corosync [pcmk  ] info: spawn_child: Forked child 6762 for 
process stonithd
Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: WARN: main: Terminating 
previous PE instance
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_TriggerHandler: 
Added signal manual handler
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: 
init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: main: CRM Hg Version: 
da7075976b5ff0bee71074385f8fd02f296ec8a3

Dec 24 11:06:13 corosync [pcmk  ] info: spawn_child: Forked child 6763 for 
process cib
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn pengine: [5155]: WARN: process_pe_message: 
Received quit message, terminating
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: enabling coredumps
Dec 24 11:06:13 corosync [pcmk  ] info: spawn_child: Forked child 6764 for 
process lrmd
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting crmd

Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: retrieveCib: Reading cluster configuration from:/var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)

Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: 
Added signal handler for signal 10
Dec 24 11:06:13 corosync [pcmk  ] info: spawn_child: Forked child 6765 for 
process attrd
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: 
Added signal handler for signal 12
Dec 24 11:06:13 corosync [pcmk  ] info: spawn_child: Forked child 6766 for 
process pengine
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Started.
Dec 24 11:06:13 corosync [pcmk  ] info: spawn_child: Forked child 6767 for 
process crmd
Dec 24 11:06:13 corosync [SERV  ] Service engine loaded: Pacemaker Cluster 
Manager 1.0.9
Dec 24 11:06:13 corosync [SERV  ] Service engine loaded: corosync extended 
virtual synchrony service
Dec 24 11:06:13 corosync [SERV  ] Service engine loaded: corosync configuration 
service
Dec 24 11:06:13 corosync [SERV  ] Service engine loaded: corosync cluster 
closed process group service v1.01
Dec 24 11:06:13 corosync [SERV  ] Service engine loaded: corosync cluster 
config database access v1.01
Dec 24 11:06:13 corosync [SERV  ] Service engine loaded: corosync profile 
loading service
Dec 24 11:06:13 corosync [SERV  ] Service engine loaded: corosync cluster 
quorum service v0.1
Dec 24 11:06:13 corosync [MAIN  ] Compatibility mode set to whitetank.  Using 
V1 and V2 of the synchronization engine.
Dec 24 11:06:13 corosync [TOTEM ] The network interface [172.18.17.12] is now 
up.
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: 
init_ais_connection_once: AIS connection established
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
init_ais_connection_once: AIS connection established
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_ipc: Recorded connection 0x868c90 
for attrd/6765
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: get_ais_nodeid: Server 
details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_ipc: Recorded connection 0x86d0a0 
for stonithd/6762
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 
opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 
207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Cluster connection 
active

Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tncname=pcmk

Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node 
opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Accepting 
attribute updates
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node 
207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting 
mainloop...
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: notice: 
/usr/lib64/heartbeat/stonithd start up successfully.
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
G_main_add_SignalHandler: Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: startCib: CIB 
Initialization completed successfully
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_cluster_connect: 
Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once: 
Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once: 
AIS connection established
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_ipc: Recorded connection 0x872fa0 
for cib/6763

Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core02-tn now has process list:00000000000000000000000000013312 (78610)

Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_ipc: Sending membership update 0 
to cib
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: get_ais_nodeid: Server 
details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node 
opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node 
207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_init: Starting cib 
mainloop
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership 
0: quorum still lost

Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member (new)addr=(null) votes=1 (new) born=0 seen=0 proc=00000000000000000000000000013312 (new)Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: Archived previous version as/var/lib/heartbeat/crm/cib-26.rawDec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: Wrote version 0.473.0 of the CIB to disk (digest:3c7be90920e86222ad6102a0f01d9efd)Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: retrieveCib: Reading cluster configuration from:/var/lib/heartbeat/crm/cib.UxVZY6 (digest: /var/lib/heartbeat/crm/cib.76RIND)

Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 1 
iface 172.18.17.12 to [1 of 10]
Dec 24 11:06:13 corosync [pcmk  ] notice: pcmk_peer_update: Transitional 
membership event on ring 13032: memb=0, new=0, lost=0
Dec 24 11:06:13 corosync [pcmk  ] notice: pcmk_peer_update: Stable membership 
event on ring 13032: memb=1, new=1, lost=0
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_peer_update: NEW:  
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_peer_update: MEMB: 
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was formed.
Dec 24 11:06:13 corosync [MAIN  ] Completed service synchronization, ready to 
provide service.
Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 2 
iface 192.168.100.12 to [1 of 10]
Dec 24 11:06:13 corosync [pcmk  ] notice: pcmk_peer_update: Transitional 
membership event on ring 13036: memb=1, new=0, lost=0
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_peer_update: memb: 
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk  ] notice: pcmk_peer_update: Stable membership 
event on ring 13036: memb=2, new=1, lost=0
Dec 24 11:06:13 corosync [pcmk  ] info: update_member: Creating entry for node 
191146176 born on 13036
Dec 24 11:06:13 corosync [pcmk  ] info: update_member: Node 191146176/unknown 
is now: member
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_peer_update: NEW:  .pending. 
191146176
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_peer_update: MEMB: .pending. 
191146176
Dec 24 11:06:13 corosync [pcmk  ] info: pcmk_peer_update: MEMB: 
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk  ] info: send_member_notification: Sending 
membership update 13036 to 1 children
Dec 24 11:06:13 corosync [pcmk  ] info: update_member: 0x2aaaac000920 Node 
207923392 ((null)) born on: 13036
Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was formed.
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership 
13036: quorum still lost
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node <null> 
now has id: 191146176

Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node (null): id=191146176 state=member (new) addr=r(0)ip(192.168.100.11) r(1) ip(172.18.17.11) votes=0 born=0 seen=13036 proc=00000000000000000000000000000000Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member addr=r(0)ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 born=0 seen=13036 proc=00000000000000000000000000013312

Dec 24 11:06:13 corosync [pcmk  ] info: update_member: 0x825ef0 Node 191146176 
(opsview-core01-tn) born on: 13028
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: notice: ais_dispatch: Membership 
13036: quorum acquired
Dec 24 11:06:13 corosync [pcmk  ] info: update_member: 0x825ef0 Node 191146176 
now known as opsview-core01-tn (was: (null))
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_get_peer: Node 
191146176 is now known as opsview-core01-tn

Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core01-tn now has process list:00000000000000000000000000013312 (78610)Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core01-tn: id=191146176 state=member addr=r(0)ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 (new) born=13028 seen=13036 proc=00000000000000000000000000013312 (new)

Dec 24 11:06:13 corosync [pcmk  ] info: update_member: Node opsview-core01-tn 
now has 1 quorum votes (was 0)
Dec 24 11:06:13 corosync [pcmk  ] info: send_member_notification: Sending 
membership update 13036 to 1 children
Dec 24 11:06:13 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
local.crmd failed: unknown (rc=-2)
Dec 24 11:06:13 corosync [MAIN  ] Completed service synchronization, ready to 
provide service.

Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_process_diff: Diff 0.475.1 -> 0.475.2 not applied to 0.473.0: current"epoch" is less than required

Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_server_process_diff: 
Requesting re-sync from peer

Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_diff_notify: Local-only Change (client:crmd, call: 105): -1.-1.-1(Application of an update diff failed, requesting a full refresh)

Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not 
applying diff 0.475.2 -> 0.475.3 (sync in progress)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not 
applying diff 0.475.3 -> 0.475.4 (sync in progress)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not 
applying diff 0.475.4 -> 0.476.1 (sync in progress)
Dec 24 11:06:13 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
local.crmd failed: unknown (rc=-2)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_replace_notify: 
Local-only Replace: -1.-1.-1 from opsview-core01-tn

Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: Archived previous version as/var/lib/heartbeat/crm/cib-27.rawDec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: Wrote version 0.476.0 of the CIB to disk (digest:c348ac643cfe3b370e5eca03ff7f180c)Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: retrieveCib: Reading cluster configuration from:/var/lib/heartbeat/crm/cib.FYgzJ8 (digest: /var/lib/heartbeat/crm/cib.VrDRiH)

Dec 24 11:06:13 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
local.crmd failed: unknown (rc=-2)
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_cib_control: CIB 
connection established
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_cluster_connect: 
Connecting to OpenAIS
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once: 
Creating connection to our AIS plugin
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once: 
AIS connection established
Dec 24 11:06:14 corosync [pcmk  ] info: pcmk_ipc: Recorded connection 0x878020 
for crmd/6767
Dec 24 11:06:14 corosync [pcmk  ] info: pcmk_ipc: Sending membership update 
13036 to crmd
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: get_ais_nodeid: Server 
details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 
opsview-core02-tn now has id: 207923392
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 
207923392 is now known as opsview-core02-tn
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_ha_control: Connected 
to the cluster
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: Delaying 
start, CCM (0000000000100000) not connected
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting 
crmd's mainloop
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback: 
Checking for expired actions every 900000ms
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback: 
Sending expected-votes=2 to corosync
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: notice: ais_dispatch: 
Membership 13036: quorum acquired
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 
opsview-core01-tn now has id: 191146176
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 
191146176 is now known as opsview-core01-tn

Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node opsview-core01-tn: id=191146176 state=member (new)addr=r(0) ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 born=13028 seen=13036 proc=00000000000000000000000000013312Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member (new)addr=r(0) ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 (new) born=13036 seen=13036proc=00000000000000000000000000013312 (new)

Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: The local CRM 
is operational

Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_state_transition: State transition S_STARTING -> S_PENDING [input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]

Dec 24 11:06:15 opsview-core02-tn pengine: [6766]: info: main: Starting pengine
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: ais_dispatch: Membership 
13036: quorum retained
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_dc: Set DC to 
opsview-core01-tn (3.0.1)
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_attrd: Connecting 
to attrd...

Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DCcause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]

Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for terminate
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for shutdown
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_local_callback: 
Sending full refresh (origin=crmd)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending 
flush op to all hosts for: terminate (<null>)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending 
flush op to all hosts for: shutdown (<null>)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation shutdown=<null>: cib not connected

Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: erase_xpath_callback: Deletion of"//node_sta...@uname='opsview-core02-tn']/transient_attributes": ok (rc=0)

Dec 24 11:06:15 corosync [TOTEM ] ring 0 active with no faults
Dec 24 11:06:15 corosync [TOTEM ] ring 1 active with no faults
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 
opsview-core01-tn now has id: 191146176
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 
191146176 is now known as opsview-core01-tn
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for master-drbd_data:0

Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:0=<null>: cib notconnected

Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for probe_complete

Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation probe_complete=<null>: cib notconnectedDec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=9:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0op=drbd_data:0_monitor_0 )

Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:2: probe

Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=10:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0op=ServerFS_monitor_0 )

Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ServerFS:3: probe

Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=11:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0op=ClusterIP01_monitor_0 )

Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ClusterIP01:4: probe
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No 
lrm_rprovider field in message

Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=12:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0op=opsview-core_lsb_monitor_0 )

Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-core_lsb:5: 
probe
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No 
lrm_rprovider field in message

Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=13:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0op=opsview-web_lsb_monitor_0 )Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=14:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0op=WebSite_monitor_0 )

Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for master-drbd_data:1

Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:1=<null>: cib notconnected

Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation shutdown=<null>: cib not connected

Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation ClusterIP01_monitor_0 (call=4, rc=7,cib-update=7, confirmed=true) not runningDec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation ServerFS_monitor_0 (call=3, rc=7,cib-update=8, confirmed=true) not runningDec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:0(1000)Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:0=1000: cib notconnectedDec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_monitor_0 (call=2, rc=0,cib-update=9, confirmed=true) ok

Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-web_lsb:6: 
probe
Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:WebSite:7: probe

Dec 24 11:06:16 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation WebSite_monitor_0 (call=7, rc=7,cib-update=10, confirmed=true) not running

Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Connected 
to the CIB after 1 signon attempts
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Sending 
full refresh

Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:0(1000)

Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: 
Sent update 4: master-drbd_data:0=1000

Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete(<null>)Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:1(<null>)

Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending 
flush op to all hosts for: terminate (<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending 
flush op to all hosts for: shutdown (<null>)

Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-core_lsb:probe:stderr) su: warning: cannot changedirectory to /var/log/nagios: No such file or directory

Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-core_lsb:probe:stderr) /etc/init.d/opsview: line 262:/usr/local/nagios/bin/profile: No such file or directory

Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-web_lsb:probe:stderr) su: warning: cannot changedirectory to /var/log/nagios: No such file or directory

Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-web_lsb:probe:stderr) /etc/init.d/opsview-web: line 171:/usr/local/nagios/bin/opsview.sh: No such file or directory

Dec 24 11:06:27 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation opsview-core_lsb_monitor_0 (call=5, rc=7,cib-update=11, confirmed=true) not runningDec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation opsview-web_lsb_monitor_0 (call=6, rc=7,cib-update=12, confirmed=true) not running

Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: 
Sent update 15: probe_complete=true

Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=61:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0op=drbd_data:0_notify_0 )

Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:8: notify

Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_notify_0 (call=8, rc=0,cib-update=13, confirmed=true) okDec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=13:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0op=drbd_data:0_stop_0 )

Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:9: stop

Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_stop_0 (call=9, rc=6,cib-update=14, confirmed=true) not configured

Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch: 
Update relayed from opsview-core01-tn
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for fail-count-drbd_data:0

Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for:fail-count-drbd_data:0 (INFINITY)

Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: 
Sent update 18: fail-count-drbd_data:0=INFINITY
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch: 
Update relayed from opsview-core01-tn
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for last-failure-drbd_data:0

Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for:last-failure-drbd_data:0 (1293185188)

Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: 
Sent update 21: last-failure-drbd_data:0=1293185188
****************************************

Now the services are all DOWN.
At this point my only way to solve is to reboot cluster02, then starting 
corosync it does NOT try to take the services again.
The unfence option is still there!
Now the drbd is in this state:
 Master/Slave Set: ServerData
     Masters: [ opsview-core01-tn ]
     Stopped: [ drbd_data:1 ]
due the fence option.
If I try 'drbdadm -- --discard-my-data connect all' on the cluster02 I obtain:
[r...@core02-tn ~]# drbdadm -- --discard-my-data connect all
Could not stat("/proc/drbd"): No such file or directory
do you need to load the module?
try: modprobe drbd

Command 'drbdsetup 1 net 192.168.100.12:7789 192.168.100.11:7789 C --set-defaults --create-device --rr-conflict=disconnect--after-sb-2pri=disconnect --after-sb-1pri=disconnect --after-sb-0pri=disconnect --discard-my-data' terminated with exit code 20

drbdadm connect cluster_data: exited with code 20

I've to remove manually the entry:

location drbd-fence-by-handler-ServerData ServerData \
        rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf: 
#uname ne opsview-core01-tn

Because I've no idea HOW to unfence the cluster to permit the auto-remove of 
the above line.

Removing the line, the cluster02 connects back to drbd:

 Master/Slave Set: ServerData
     Masters: [ opsview-core01-tn ]
     Slaves: [ opsview-core02-tn ]

Writing here I've tested that the inverse situation works on half. It means, if the cluster02 is master, i disconnect eth1, thenfence entry is added to crm, but cluster01 does *NOT* crash. So I've to start removing "locationdrbd-fence-by-handler-ServerData..." to go back to a standard situation. BTW, removing the entry, on cluster01 the same error andcorosync kills:


********* cluster01 logs **********
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: info: update_dc: Unset DC 
opsview-core01-tn
Dec 24 12:01:31 corosync [TOTEM ] FAILED TO RECEIVE

Dec 24 12:01:31 opsview-core01-tn cib: [22670]: info: cib_process_request: Operation complete: op cib_modify for section nodes(origin=local/crmd/165, version=0.491.1): ok (rc=0)Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resourcetemporarily unavailable (11)Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resourcetemporarily unavailable (11)

Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: cib_ais_destroy: AIS 
connection terminated
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: crm_ais_destroy: AIS 
connection terminated

Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error:Resource temporarily unavailable (11)

Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: AIS connection 
terminated

Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: Archived previous version as/var/lib/heartbeat/crm/cib-23.rawDec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: Wrote version 0.491.0 of the CIB to disk (digest:ad222fed7ff40dc7093ffc6411079df4)Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: retrieveCib: Reading cluster configuration from:/var/lib/heartbeat/crm/cib.R3dVbk (digest: /var/lib/heartbeat/crm/cib.EllYEu)Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_text: Sending message 44: FAILED (rc=2): Library error:Connection timed out (110)

Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC 
Channel to 22670 is not connected
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op: 
Sending message to CIB service FAILED
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update: 
Sent update -5: probe_complete=true
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: attrd_cib_callback: 
Update -5 for probe_complete=true failed: send failed
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_message: Not 
connected to AIS

Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update: Sending flush op to all hosts for:master-drbd_data:1 (<null>)

Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC 
Channel to 22670 is not connected
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op: 
Sending message to CIB service FAILED

Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update: Delete operation failed: node=opsview-core01-tn,attr=master-drbd_data:1, id=<n/a>, set=(null), section=status: send failed (-5)


***********************

So, the questions:

What's wrong? Seems all starts when the corosyng on secondary node crash (or stop) 
disconnecting the cable (due "Library error"?!?!?)

If I solve the issues with crashes, then, how (/when) should the unfence option 
be executed? Should it not done automatically?

Do I have always to remove manually the entry (location ...) on crm?

Sorry for the long mail and thanks for the support!


Simon

Config files:

*************************************
cat /etc/corosync/corosync.conf


compatibility: whitetank

totem {
        version: 2
        # How long before declaring a token lost (ms)
        token:          2000
        # How many token retransmits before forming a new configuration
        token_retransmits_before_loss_const: 10
        # How long to wait for join messages in the membership protocol (ms)
        join:           200
        # How long wait for consensus to be achieved before starting a new 
round of membership configuration (ms)
        consensus:      1000
        vsftype: none
        # Number of messages that may be sent by one processor on receipt of 
the token
        max_messages:   20
        send_join: 0
        # Limit generated nodeids to 31-bits (positive signed integers)
        clear_node_high_bit: yes
        secauth: off
        threads: 0
        rrp_mode: active
        interface {
                ringnumber: 0
                bindnetaddr: 192.168.100.0
                mcastaddr: 226.100.1.1
                mcastport: 4000
        }
        interface {
                ringnumber: 1
                bindnetaddr: 172.18.17.0
                #broadcast: yes
                mcastaddr: 227.100.1.2
                mcastport: 4001
        }
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        to_syslog: yes
        logfile: /var/log/cluster/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

amf {
        mode: disabled
}

aisexec {
        user:   root
        group:  root
}

service {
        # Load the Pacemaker Cluster Resource Manager
        name: pacemaker
        ver:  0
}

*************************************
cat /etc/drbd.conf

global {
  usage-count no;
}

common {
  protocol C;

  syncer {
    rate 70M;
    verify-alg sha1;
  }

  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }

  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }

  startup {
    degr-wfc-timeout    120;    # 2 minutes.
  }

  disk {
    fencing resource-only;
    on-io-error call-local-io-error;
  }
}

resource cluster_data {
  device        /dev/drbd1;
  disk          /dev/sda4;
  meta-disk     internal;

  on opsview-core01-tn {
    address     192.168.100.11:7789;
  }

  on opsview-core02-tn {
    address     192.168.100.12:7789;
  }
}

*************************************

crm configure show
node opsview-core01-tn \
        attributes standby="off"
node opsview-core02-tn \
        attributes standby="off"
primitive ClusterIP01 ocf:heartbeat:IPaddr2 \
        params ip="172.18.17.10" cidr_netmask="32" \
        op monitor interval="30"
primitive ServerFS ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/data" fstype="ext3"
primitive WebSite ocf:heartbeat:apache \
        params configfile="/etc/httpd/conf/httpd.conf" \
        op monitor interval="1min" \
        meta target-role="Started"
primitive drbd_data ocf:linbit:drbd \
        params drbd_resource="cluster_data" \
        op monitor interval="60s"
primitive opsview-core_lsb lsb:opsview \
        op start interval="0" timeout="350s" \
        op stop interval="0" timeout="350s" \
        op monitor interval="60s" timeout="350s"
primitive opsview-web_lsb lsb:opsview-web \
        op start interval="0" timeout="350s" start-delay="15s" \
        op stop interval="0" timeout="350s" \
        op monitor interval="60s" timeout="350s" \
        meta target-role="Started"
group OPSView-Apps ServerFS ClusterIP01 opsview-core_lsb opsview-web_lsb 
WebSite \
        meta target-role="Started"
ms ServerData drbd_data \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" 
notify="true" target-role="Master"
colocation fs_on_drbd inf: OPSView-Apps ServerData:Master
order ServerFS-after-ServerData inf: ServerData:promote OPSView-Apps:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

[Pacemaker] Issues with fence and corosync crash

Reply via email to