Re: [Pacemaker] Issues with fence and corosync crash

Simone Felici Mon, 27 Dec 2010 01:36:41 -0800


Hello,

I know my mail is really long, btw is there someone could help me at least with the error '[22670]: ERROR: ais_dispatch: Receivingmessage body failed: (2) Library error: Resource temporarily unavailable (11) Dec 24 12:01:31 opsview-core01-tn crmd: [22674]:ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11)' and could point meto right place where understood how the unfence procedure should work? (i.e. automatic.) for now I've to manually remove the'location' directive every time.


Thanks a lot!

Simon

Il 24/12/2010 12:05, Simone Felici ha scritto:


Hi to all!

I've an issue with my cluster env. First of all my config:

Two Cluster CentOS5.5 Active+Standby with one DRBD partition managing a Nagios 
service, ip, and storage.
The config files at the bottom.

I'm trying to test fence option to prevent split brain and problems on double 
access on drbd partition.
Starting on a sane situation, manual switching of the resources or simulating 
kernel-panic, crash of process or whatever, all
works well. If I try to shutdown the eth1 (192.168.100.0 as well as cross cable 
to drbd mirroring) the active stay as it is, it
calls the fence option adding the entry to crm config:
location drbd-fence-by-handler-ServerData ServerData \
rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf: #uname ne 
opsview-core01-tn

But the standby node kills the corosync process:

*** STANDBY NODE LOG ***
Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14158 
iface 192.168.100.12 to [1 of 10]
Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14160 
iface 192.168.100.12 to [2 of 10]
Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14162 
iface 192.168.100.12 to [3 of 10]
Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14164 
iface 192.168.100.12 to [4 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Decrementing problem counter for iface 
192.168.100.12 to [3 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14166 
iface 192.168.100.12 to [4 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14168 
iface 192.168.100.12 to [5 of 10]
Dec 24 11:00:07 corosync [TOTEM ] Incrementing problem counter for seqid 14170 
iface 192.168.100.12 to [6 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14172 
iface 192.168.100.12 to [7 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Decrementing problem counter for iface 
192.168.100.12 to [6 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14174 
iface 192.168.100.12 to [7 of 10]
Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14176 
iface 192.168.100.12 to [8 of 10]
Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14178 
iface 192.168.100.12 to [9 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Decrementing problem counter for iface 
192.168.100.12 to [8 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14180 
iface 192.168.100.12 to [9 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14182 
iface 192.168.100.12 to [10 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Marking seqid 14182 ringid 0 interface 
192.168.100.12 FAULTY - adminisrtative intervention
required.
Dec 24 11:00:11 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: 
Receiving message body failed: (2) Library error: No such
file or directory (2)
Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: Receiving 
message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: AIS connection 
terminated
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: crm_ais_destroy: AIS 
connection terminated
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: Receiving 
message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: Receiving 
message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: cib_ais_destroy: AIS 
connection terminated
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: CRIT: attrd_ais_destroy: Lost 
connection to OpenAIS service!
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: info: main: Exiting...
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: 
attrd_cib_connection_destroy: Connection to the CIB terminated...
*** STANDBY NODE LOG ***

The issues are not finished.
If I put up back the interface eth1, start corosync again and check that the 
ring are both online (corosync-cfgtool -r) the
cluster-standby tries to take the services even if resource-stickiness is set. 
It goes into error maybe due fence script.

crm status:
============
Last updated: Fri Dec 24 11:06:40 2010
Stack: openais
Current DC: opsview-core01-tn - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ opsview-core01-tn opsview-core02-tn ]

Master/Slave Set: ServerData
drbd_data:0 (ocf::linbit:drbd): Slave opsview-core02-tn (unmanaged) FAILED
Stopped: [ drbd_data:1 ]

Failed actions:
drbd_data:0_stop_0 (node=opsview-core02-tn, call=9, rc=6, status=complete): not 
configured

LOGS on slave:
****************************************
Dec 24 11:06:13 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'): started and 
ready to provide service.
Dec 24 11:06:13 corosync [MAIN ] Corosync built-in features: nss rdma
Dec 24 11:06:13 corosync [MAIN ] Successfully read main configuration file 
'/etc/corosync/corosync.conf'.
Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security: 
libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security: 
libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 24 11:06:13 corosync [TOTEM ] The network interface [192.168.100.12] is now 
up.
Dec 24 11:06:13 corosync [pcmk ] info: process_ais_conf: Reading configure
Set r/w permissions for uid=0, gid=0 on /var/log/cluster/corosync.log
Dec 24 11:06:13 corosync [pcmk ] info: config_find_init: Local handle: 
4730966301143465986 for logging
Dec 24 11:06:13 corosync [pcmk ] info: config_find_next: Processing additional 
logging options...
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'off' for option: 
debug
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: 
to_logfile
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 
'/var/log/cluster/corosync.log' for option: logfile
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: 
to_syslog
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'daemon' 
for option: syslog_facility
Dec 24 11:06:13 corosync [pcmk ] info: config_find_init: Local handle: 
7739444317642555395 for service
Dec 24 11:06:13 corosync [pcmk ] info: config_find_next: Processing additional 
service options...
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'pcmk' for 
option: clustername
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for 
option: use_logd
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for 
option: use_mgmtd
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: CRM: Initialized
Dec 24 11:06:13 corosync [pcmk ] Logging: Initialized pcmk_startup
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 
18446744073709551615
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Service: 9
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Local hostname: 
opsview-core02-tn
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_update_nodeid: Local node id: 
207923392
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Creating entry for node 
207923392 born on 0
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x2aaaac000920 Node 
207923392 now known as opsview-core02-tn (was: (null))
Dec 24 11:06:13 opsview-core02-tn lrmd: [5153]: info: lrmd is shutting down
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
G_main_add_SignalHandler: Added signal handler for signal 10
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: Invoked: 
/usr/lib64/heartbeat/attrd
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Signal sent to pid=5153, 
waiting for process to exit
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core02-tn 
now has 1 quorum votes (was 0)
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
G_main_add_SignalHandler: Added signal handler for signal 12
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting up
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: 
Added signal handler for signal 15
Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: info: Invoked: 
/usr/lib64/heartbeat/pengine
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node 
207923392/opsview-core02-tn is now: member
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: Invoked: 
/usr/lib64/heartbeat/cib
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_cluster_connect: 
Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_cluster_connect: 
Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: Invoked: 
/usr/lib64/heartbeat/crmd
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6762 for 
process stonithd
Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: WARN: main: Terminating 
previous PE instance
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_TriggerHandler: 
Added signal manual handler
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: 
init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: main: CRM Hg Version: 
da7075976b5ff0bee71074385f8fd02f296ec8a3

Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6763 for 
process cib
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn pengine: [5155]: WARN: process_pe_message: 
Received quit message, terminating
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: enabling coredumps
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6764 for 
process lrmd
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting crmd
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: retrieveCib: Reading 
cluster configuration from:
/var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: 
Added signal handler for signal 10
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6765 for 
process attrd
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: 
Added signal handler for signal 12
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6766 for 
process pengine
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Started.
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6767 for 
process crmd
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: Pacemaker Cluster 
Manager 1.0.9
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync extended 
virtual synchrony service
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync configuration 
service
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster closed 
process group service v1.01
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster config 
database access v1.01
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync profile 
loading service
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster quorum 
service v0.1
Dec 24 11:06:13 corosync [MAIN ] Compatibility mode set to whitetank. Using V1 
and V2 of the synchronization engine.
Dec 24 11:06:13 corosync [TOTEM ] The network interface [172.18.17.12] is now 
up.
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: 
init_ais_connection_once: AIS connection established
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
init_ais_connection_once: AIS connection established
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x868c90 
for attrd/6765
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: get_ais_nodeid: Server 
details: id=207923392 uname=opsview-core02-tn
cname=pcmk
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x86d0a0 
for stonithd/6762
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 
opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 
207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Cluster connection 
active
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: get_ais_nodeid: 
Server details: id=207923392 uname=opsview-core02-tn
cname=pcmk
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node 
opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Accepting 
attribute updates
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node 
207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting 
mainloop...
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: notice: 
/usr/lib64/heartbeat/stonithd start up successfully.
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: 
G_main_add_SignalHandler: Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: startCib: CIB 
Initialization completed successfully
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_cluster_connect: 
Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once: 
Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once: 
AIS connection established
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x872fa0 
for cib/6763
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core02-tn 
now has process list:
00000000000000000000000000013312 (78610)
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Sending membership update 0 to 
cib
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: get_ais_nodeid: Server 
details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node 
opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node 
207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_init: Starting cib 
mainloop
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership 
0: quorum still lost
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node 
opsview-core02-tn: id=207923392 state=member (new)
addr=(null) votes=1 (new) born=0 seen=0 proc=00000000000000000000000000013312 
(new)
Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: 
Archived previous version as
/var/lib/heartbeat/crm/cib-26.raw
Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: Wrote 
version 0.473.0 of the CIB to disk (digest:
3c7be90920e86222ad6102a0f01d9efd)
Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: retrieveCib: Reading 
cluster configuration from:
/var/lib/heartbeat/crm/cib.UxVZY6 (digest: /var/lib/heartbeat/crm/cib.76RIND)
Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 1 
iface 172.18.17.12 to [1 of 10]
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Transitional 
membership event on ring 13032: memb=0, new=0, lost=0
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Stable membership 
event on ring 13032: memb=1, new=1, lost=0
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: NEW: opsview-core02-tn 
207923392
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: 
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was formed.
Dec 24 11:06:13 corosync [MAIN ] Completed service synchronization, ready to 
provide service.
Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 2 
iface 192.168.100.12 to [1 of 10]
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Transitional 
membership event on ring 13036: memb=1, new=0, lost=0
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: memb: 
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Stable membership 
event on ring 13036: memb=2, new=1, lost=0
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Creating entry for node 
191146176 born on 13036
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node 191146176/unknown is 
now: member
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: NEW: .pending. 
191146176
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: .pending. 
191146176
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: 
opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk ] info: send_member_notification: Sending 
membership update 13036 to 1 children
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x2aaaac000920 Node 
207923392 ((null)) born on: 13036
Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was formed.
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership 
13036: quorum still lost
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node <null> 
now has id: 191146176
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node 
(null): id=191146176 state=member (new) addr=r(0)
ip(192.168.100.11) r(1) ip(172.18.17.11) votes=0 born=0 seen=13036 
proc=00000000000000000000000000000000
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node 
opsview-core02-tn: id=207923392 state=member addr=r(0)
ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 born=0 seen=13036 
proc=00000000000000000000000000013312
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x825ef0 Node 191146176 
(opsview-core01-tn) born on: 13028
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: notice: ais_dispatch: Membership 
13036: quorum acquired
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x825ef0 Node 191146176 
now known as opsview-core01-tn (was: (null))
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_get_peer: Node 
191146176 is now known as opsview-core01-tn
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core01-tn 
now has process list:
00000000000000000000000000013312 (78610)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node 
opsview-core01-tn: id=191146176 state=member addr=r(0)
ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 (new) born=13028 seen=13036 
proc=00000000000000000000000000013312 (new)
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core01-tn 
now has 1 quorum votes (was 0)
Dec 24 11:06:13 corosync [pcmk ] info: send_member_notification: Sending 
membership update 13036 to 1 children
Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to 
local.crmd failed: unknown (rc=-2)
Dec 24 11:06:13 corosync [MAIN ] Completed service synchronization, ready to 
provide service.
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_process_diff: Diff 
0.475.1 -> 0.475.2 not applied to 0.473.0: current
"epoch" is less than required
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_server_process_diff: 
Requesting re-sync from peer
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_diff_notify: 
Local-only Change (client:crmd, call: 105): -1.-1.-1
(Application of an update diff failed, requesting a full refresh)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not 
applying diff 0.475.2 -> 0.475.3 (sync in progress)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not 
applying diff 0.475.3 -> 0.475.4 (sync in progress)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not 
applying diff 0.475.4 -> 0.476.1 (sync in progress)
Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to 
local.crmd failed: unknown (rc=-2)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_replace_notify: 
Local-only Replace: -1.-1.-1 from opsview-core01-tn
Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: 
Archived previous version as
/var/lib/heartbeat/crm/cib-27.raw
Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: Wrote 
version 0.476.0 of the CIB to disk (digest:
c348ac643cfe3b370e5eca03ff7f180c)
Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: retrieveCib: Reading 
cluster configuration from:
/var/lib/heartbeat/crm/cib.FYgzJ8 (digest: /var/lib/heartbeat/crm/cib.VrDRiH)
Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to 
local.crmd failed: unknown (rc=-2)
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_cib_control: CIB 
connection established
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_cluster_connect: 
Connecting to OpenAIS
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once: 
Creating connection to our AIS plugin
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once: 
AIS connection established
Dec 24 11:06:14 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x878020 
for crmd/6767
Dec 24 11:06:14 corosync [pcmk ] info: pcmk_ipc: Sending membership update 
13036 to crmd
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: get_ais_nodeid: Server 
details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 
opsview-core02-tn now has id: 207923392
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 
207923392 is now known as opsview-core02-tn
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_ha_control: Connected 
to the cluster
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: Delaying 
start, CCM (0000000000100000) not connected
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting 
crmd's mainloop
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback: 
Checking for expired actions every 900000ms
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback: 
Sending expected-votes=2 to corosync
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: notice: ais_dispatch: 
Membership 13036: quorum acquired
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 
opsview-core01-tn now has id: 191146176
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 
191146176 is now known as opsview-core01-tn
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node 
opsview-core01-tn: id=191146176 state=member (new)
addr=r(0) ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 born=13028 
seen=13036 proc=00000000000000000000000000013312
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node 
opsview-core02-tn: id=207923392 state=member (new)
addr=r(0) ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 (new) 
born=13036 seen=13036 proc=00000000000000000000000000013312
(new)
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: The local CRM 
is operational
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_state_transition: State 
transition S_STARTING -> S_PENDING [
input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Dec 24 11:06:15 opsview-core02-tn pengine: [6766]: info: main: Starting pengine
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: ais_dispatch: Membership 
13036: quorum retained
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_dc: Set DC to 
opsview-core01-tn (3.0.1)
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_attrd: Connecting 
to attrd...
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_state_transition: State 
transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC
cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for terminate
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for shutdown
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_local_callback: 
Sending full refresh (origin=crmd)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending 
flush op to all hosts for: terminate (<null>)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending 
flush op to all hosts for: shutdown (<null>)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: erase_xpath_callback: 
Deletion of
"//node_sta...@uname='opsview-core02-tn']/transient_attributes": ok (rc=0)
Dec 24 11:06:15 corosync [TOTEM ] ring 0 active with no faults
Dec 24 11:06:15 corosync [TOTEM ] ring 1 active with no faults
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 
opsview-core01-tn now has id: 191146176
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 
191146176 is now known as opsview-core01-tn
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for master-drbd_data:0
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation master-drbd_data:0=<null>: cib not
connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for probe_complete
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation probe_complete=<null>: cib not
connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing 
key=9:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=drbd_data:0_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:2: probe
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing 
key=10:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=ServerFS_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ServerFS:3: probe
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing 
key=11:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=ClusterIP01_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ClusterIP01:4: probe
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No 
lrm_rprovider field in message
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing 
key=12:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=opsview-core_lsb_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-core_lsb:5: 
probe
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No 
lrm_rprovider field in message
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing 
key=13:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=opsview-web_lsb_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing 
key=14:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=WebSite_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for master-drbd_data:1
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation master-drbd_data:1=<null>: cib not
connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying 
operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM 
operation ClusterIP01_monitor_0 (call=4, rc=7,
cib-update=7, confirmed=true) not running
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM 
operation ServerFS_monitor_0 (call=3, rc=7,
cib-update=8, confirmed=true) not running
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd_data:0
(1000)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: 
Delaying operation master-drbd_data:0=1000: cib not
connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM 
operation drbd_data:0_monitor_0 (call=2, rc=0,
cib-update=9, confirmed=true) ok
Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-web_lsb:6: 
probe
Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:WebSite:7: probe
Dec 24 11:06:16 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM 
operation WebSite_monitor_0 (call=7, rc=7,
cib-update=10, confirmed=true) not running
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Connected 
to the CIB after 1 signon attempts
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Sending 
full refresh
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd_data:0
(1000)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: 
Sent update 4: master-drbd_data:0=1000
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete
(<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: 
Sending flush op to all hosts for: master-drbd_data:1
(<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending 
flush op to all hosts for: terminate (<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending 
flush op to all hosts for: shutdown (<null>)
Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: 
(opsview-core_lsb:probe:stderr) su: warning: cannot change
directory to /var/log/nagios: No such file or directory

Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: 
(opsview-core_lsb:probe:stderr) /etc/init.d/opsview: line 262:
/usr/local/nagios/bin/profile: No such file or directory

Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: 
(opsview-web_lsb:probe:stderr) su: warning: cannot change
directory to /var/log/nagios: No such file or directory

Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: 
(opsview-web_lsb:probe:stderr) /etc/init.d/opsview-web: line 171:
/usr/local/nagios/bin/opsview.sh: No such file or directory

Dec 24 11:06:27 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM 
operation opsview-core_lsb_monitor_0 (call=5, rc=7,
cib-update=11, confirmed=true) not running
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM 
operation opsview-web_lsb_monitor_0 (call=6, rc=7,
cib-update=12, confirmed=true) not running
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: 
Sent update 15: probe_complete=true
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing 
key=61:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=drbd_data:0_notify_0 )
Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:8: notify
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM 
operation drbd_data:0_notify_0 (call=8, rc=0,
cib-update=13, confirmed=true) ok
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing 
key=13:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=drbd_data:0_stop_0 )
Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:9: stop
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM 
operation drbd_data:0_stop_0 (call=9, rc=6,
cib-update=14, confirmed=true) not configured
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch: 
Update relayed from opsview-core01-tn
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for fail-count-drbd_data:0
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: 
Sending flush op to all hosts for:
fail-count-drbd_data:0 (INFINITY)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: 
Sent update 18: fail-count-drbd_data:0=INFINITY
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch: 
Update relayed from opsview-core01-tn
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry: 
Creating hash entry for last-failure-drbd_data:0
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: 
Sending flush op to all hosts for:
last-failure-drbd_data:0 (1293185188)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: 
Sent update 21: last-failure-drbd_data:0=1293185188
****************************************

Now the services are all DOWN.
At this point my only way to solve is to reboot cluster02, then starting 
corosync it does NOT try to take the services again.
The unfence option is still there!
Now the drbd is in this state:
Master/Slave Set: ServerData
Masters: [ opsview-core01-tn ]
Stopped: [ drbd_data:1 ]
due the fence option.
If I try 'drbdadm -- --discard-my-data connect all' on the cluster02 I obtain:
[r...@core02-tn ~]# drbdadm -- --discard-my-data connect all
Could not stat("/proc/drbd"): No such file or directory
do you need to load the module?
try: modprobe drbd
Command 'drbdsetup 1 net 192.168.100.12:7789 192.168.100.11:7789 C 
--set-defaults --create-device --rr-conflict=disconnect
--after-sb-2pri=disconnect --after-sb-1pri=disconnect 
--after-sb-0pri=disconnect --discard-my-data' terminated with exit code 20
drbdadm connect cluster_data: exited with code 20

I've to remove manually the entry:

location drbd-fence-by-handler-ServerData ServerData \
rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf: #uname ne 
opsview-core01-tn

Because I've no idea HOW to unfence the cluster to permit the auto-remove of 
the above line.

Removing the line, the cluster02 connects back to drbd:

Master/Slave Set: ServerData
Masters: [ opsview-core01-tn ]
Slaves: [ opsview-core02-tn ]


Writing here I've tested that the inverse situation works on half. It means, if 
the cluster02 is master, i disconnect eth1, then
fence entry is added to crm, but cluster01 does *NOT* crash. So I've to start 
removing "location
drbd-fence-by-handler-ServerData..." to go back to a standard situation. BTW, 
removing the entry, on cluster01 the same error and
corosync kills:

********* cluster01 logs **********
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: info: update_dc: Unset DC 
opsview-core01-tn
Dec 24 12:01:31 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: info: cib_process_request: 
Operation complete: op cib_modify for section nodes
(origin=local/crmd/165, version=0.491.1): ok (rc=0)
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: Receiving 
message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: Receiving 
message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: cib_ais_destroy: AIS 
connection terminated
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: crm_ais_destroy: AIS 
connection terminated
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: 
Receiving message body failed: (2) Library error:
Resource temporarily unavailable (11)
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: AIS 
connection failed
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: AIS connection 
terminated
Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: 
Archived previous version as
/var/lib/heartbeat/crm/cib-23.raw
Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: Wrote 
version 0.491.0 of the CIB to disk (digest:
ad222fed7ff40dc7093ffc6411079df4)
Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: retrieveCib: Reading 
cluster configuration from:
/var/lib/heartbeat/crm/cib.R3dVbk (digest: /var/lib/heartbeat/crm/cib.EllYEu)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_text: Sending 
message 44: FAILED (rc=2): Library error:
Connection timed out (110)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete
(true)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC 
Channel to 22670 is not connected
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op: 
Sending message to CIB service FAILED
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update: 
Sent update -5: probe_complete=true
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: attrd_cib_callback: 
Update -5 for probe_complete=true failed: send failed
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_message: Not 
connected to AIS
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update: 
Sending flush op to all hosts for:
master-drbd_data:1 (<null>)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC 
Channel to 22670 is not connected
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op: 
Sending message to CIB service FAILED
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update: 
Delete operation failed: node=opsview-core01-tn,
attr=master-drbd_data:1, id=<n/a>, set=(null), section=status: send failed (-5)

***********************

So, the questions:

What's wrong? Seems all starts when the corosyng on secondary node crash (or stop) 
disconnecting the cable (due "Library error"?!?!?)

If I solve the issues with crashes, then, how (/when) should the unfence option 
be executed? Should it not done automatically?

Do I have always to remove manually the entry (location ...) on crm?

Sorry for the long mail and thanks for the support!


Simon

Config files:

*************************************
cat /etc/corosync/corosync.conf


compatibility: whitetank

totem {
version: 2
# How long before declaring a token lost (ms)
token: 2000
# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 10
# How long to wait for join messages in the membership protocol (ms)
join: 200
# How long wait for consensus to be achieved before starting a new round of 
membership configuration (ms)
consensus: 1000
vsftype: none
# Number of messages that may be sent by one processor on receipt of the token
max_messages: 20
send_join: 0
# Limit generated nodeids to 31-bits (positive signed integers)
clear_node_high_bit: yes
secauth: off
threads: 0
rrp_mode: active
interface {
ringnumber: 0
bindnetaddr: 192.168.100.0
mcastaddr: 226.100.1.1
mcastport: 4000
}
interface {
ringnumber: 1
bindnetaddr: 172.18.17.0
#broadcast: yes
mcastaddr: 227.100.1.2
mcastport: 4001
}
}

logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}

aisexec {
user: root
group: root
}

service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 0
}

*************************************
cat /etc/drbd.conf

global {
usage-count no;
}

common {
protocol C;

syncer {
rate 70M;
verify-alg sha1;
}

net {
after-sb-0pri disconnect;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
rr-conflict disconnect;
}

handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}

startup {
degr-wfc-timeout 120; # 2 minutes.
}

disk {
fencing resource-only;
on-io-error call-local-io-error;
}
}

resource cluster_data {
device /dev/drbd1;
disk /dev/sda4;
meta-disk internal;

on opsview-core01-tn {
address 192.168.100.11:7789;
}

on opsview-core02-tn {
address 192.168.100.12:7789;
}
}

*************************************

crm configure show
node opsview-core01-tn \
attributes standby="off"
node opsview-core02-tn \
attributes standby="off"
primitive ClusterIP01 ocf:heartbeat:IPaddr2 \
params ip="172.18.17.10" cidr_netmask="32" \
op monitor interval="30"
primitive ServerFS ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/data" fstype="ext3"
primitive WebSite ocf:heartbeat:apache \
params configfile="/etc/httpd/conf/httpd.conf" \
op monitor interval="1min" \
meta target-role="Started"
primitive drbd_data ocf:linbit:drbd \
params drbd_resource="cluster_data" \
op monitor interval="60s"
primitive opsview-core_lsb lsb:opsview \
op start interval="0" timeout="350s" \
op stop interval="0" timeout="350s" \
op monitor interval="60s" timeout="350s"
primitive opsview-web_lsb lsb:opsview-web \
op start interval="0" timeout="350s" start-delay="15s" \
op stop interval="0" timeout="350s" \
op monitor interval="60s" timeout="350s" \
meta target-role="Started"
group OPSView-Apps ServerFS ClusterIP01 opsview-core_lsb opsview-web_lsb 
WebSite \
meta target-role="Started"
ms ServerData drbd_data \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" 
target-role="Master"
colocation fs_on_drbd inf: OPSView-Apps ServerData:Master
order ServerFS-after-ServerData inf: ServerData:promote OPSView-Apps:start
property $id="cib-bootstrap-options" \
dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



--
Simone Felici
Divisione Tecnica: Progettazione e Sviluppo

tel. +39 0461.030.111
fax. +39 0461 030.112
Via Fersina, 23 - 38123 Trento

-------------
MC-link S.p.A.
Sede Direzionale e Amministrativa
Via Carlo Perrier, 9/a - 00157 Roma
Sede Legale
Via Fersina, 23 - 38123 Trento

http://www.mclink.it

Save a tree. Don't print this e-mail unless it's really necessary

Informativa ai sensi del Codice della proprietà industriale e del Codice dei 
dati personali.

Le informazioni contenute in questa e-mail e negli eventuali allegati, possono contenere informazioni confidenziali e coperte dasegreto commerciale/industriale. Esse vengono comunicate nei limiti giuridici dei rapporti in essere fra le parti e pertantonessun ulteriore diritto di proprietà intellettuale o industriale può essere rivendicato dal ricevente.Le informazioni contenute in questa e-mail e negli eventuali allegati sono indirizzate esclusivamente a coloro che figurano comedestinatari.Se avete ricevuto per errore questa e-mail siete pregati di informarci (rispedendola al mittente) e di provvedere alla suarimozione, a non farne utilizzo e a non conservarne alcuna copia.


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Issues with fence and corosync crash

Reply via email to