Thank you Vladislav. I have configured resource level fencing on drbd and removed wfc-timeout and defr-wfc-timeout (is this required?). My drbd configuration is now:
resource pg { device /dev/drbd0; disk /dev/vdb; meta-disk internal; disk { fencing resource-only; on-io-error detach; resync-rate 40M; } handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; split-brain "/usr/lib/drbd/notify-split-brain.sh nkbm"; } on node01 { address 10.2.136.52:7789; } on node02 { address 10.2.136.55:7789; } net { verify-alg md5; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } } Failover works on my initial test (restarting both nodes alternately - this always works). Will wait for a couple of hours after doing a failover test again (Which always fail on my previous setup). Thank you! Kiam On Thu, Sep 11, 2014 at 2:14 PM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > 11.09.2014 05:57, Norbert Kiam Maclang wrote: > > Is this something to do with quorum? But I already set > > You'd need to configure fencing at the drbd resources level. > > > http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib > > > > > > property no-quorum-policy="ignore" \ > > expected-quorum-votes="1" > > > > Thanks in advance, > > Kiam > > > > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang > > <norbert.kiam.macl...@gmail.com <mailto:norbert.kiam.macl...@gmail.com>> > > wrote: > > > > Hi, > > > > Please help me understand what is causing the problem. I have a 2 > > node cluster running on vms using KVM. Each vm (I am using Ubuntu > > 14.04) runs on a separate hypervisor on separate machines. All are > > working well during testing (I restarted the vms alternately), but > > after a day when I kill the other node, I always end up corosync and > > pacemaker hangs on the surviving node. Date and time on the vms are > > in sync, I use unicast, tcpdump shows both nodes exchanges, > > confirmed that DRBD is healthy and crm_mon show good status before I > > kill the other node. Below are my configurations and versions I used: > > > > corosync 2.3.3-1ubuntu1 > > crmsh 1.2.5+hg1034-1ubuntu3 > > drbd8-utils 2:8.4.4-1ubuntu1 > > libcorosync-common4 2.3.3-1ubuntu1 > > libcrmcluster4 1.1.10+git20130802-1ubuntu2 > > libcrmcommon3 1.1.10+git20130802-1ubuntu2 > > libcrmservice1 1.1.10+git20130802-1ubuntu2 > > pacemaker 1.1.10+git20130802-1ubuntu2 > > pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2 > > postgresql-9.3 9.3.5-0ubuntu0.14.04.1 > > > > # /etc/corosync/corosync: > > totem { > > version: 2 > > token: 3000 > > token_retransmits_before_loss_const: 10 > > join: 60 > > consensus: 3600 > > vsftype: none > > max_messages: 20 > > clear_node_high_bit: yes > > secauth: off > > threads: 0 > > rrp_mode: none > > interface { > > member { > > memberaddr: 10.2.136.56 > > } > > member { > > memberaddr: 10.2.136.57 > > } > > ringnumber: 0 > > bindnetaddr: 10.2.136.0 > > mcastport: 5405 > > } > > transport: udpu > > } > > amf { > > mode: disabled > > } > > quorum { > > provider: corosync_votequorum > > expected_votes: 1 > > } > > aisexec { > > user: root > > group: root > > } > > logging { > > fileline: off > > to_stderr: yes > > to_logfile: no > > to_syslog: yes > > syslog_facility: daemon > > debug: off > > timestamp: on > > logger_subsys { > > subsys: AMF > > debug: off > > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > > } > > } > > > > # /etc/corosync/service.d/pcmk: > > service { > > name: pacemaker > > ver: 1 > > } > > > > /etc/drbd.d/global_common.conf: > > global { > > usage-count no; > > } > > > > common { > > net { > > protocol C; > > } > > } > > > > # /etc/drbd.d/pg.res: > > resource pg { > > device /dev/drbd0; > > disk /dev/vdb; > > meta-disk internal; > > startup { > > wfc-timeout 15; > > degr-wfc-timeout 60; > > } > > disk { > > on-io-error detach; > > resync-rate 40M; > > } > > on node01 { > > address 10.2.136.56:7789 <http://10.2.136.56:7789>; > > } > > on node02 { > > address 10.2.136.57:7789 <http://10.2.136.57:7789>; > > } > > net { > > verify-alg md5; > > after-sb-0pri discard-zero-changes; > > after-sb-1pri discard-secondary; > > after-sb-2pri disconnect; > > } > > } > > > > # Pacemaker configuration: > > node $id="167938104" node01 > > node $id="167938105" node02 > > primitive drbd_pg ocf:linbit:drbd \ > > params drbd_resource="pg" \ > > op monitor interval="29s" role="Master" \ > > op monitor interval="31s" role="Slave" > > primitive fs_pg ocf:heartbeat:Filesystem \ > > params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main" > > fstype="ext4" > > primitive ip_pg ocf:heartbeat:IPaddr2 \ > > params ip="10.2.136.59" cidr_netmask="24" nic="eth0" > > primitive lsb_pg lsb:postgresql > > group PGServer fs_pg lsb_pg ip_pg > > ms ms_drbd_pg drbd_pg \ > > meta master-max="1" master-node-max="1" clone-max="2" > > clone-node-max="1" notify="true" > > colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master > > order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start > > property $id="cib-bootstrap-options" \ > > dc-version="1.1.10-42f2063" \ > > cluster-infrastructure="corosync" \ > > stonith-enabled="false" \ > > no-quorum-policy="ignore" > > rsc_defaults $id="rsc-options" \ > > resource-stickiness="100" > > > > # Logs on node01 > > Sep 10 10:25:33 node01 crmd[1019]: notice: peer_update_callback: > > Our peer on the DC is dead > > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: > > State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION > > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] > > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: > > State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC > > cause=C_FSA_INTERNAL origin=do_election_check ] > > Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new membership > > (10.2.136.56:52 <http://10.2.136.56:52>) was formed. Members left: > > 167938105 > > Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did > > not arrive in time. > > Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer( > > Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( > > UpToDate -> DUnknown ) > > Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender > > terminated > > Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating > > drbd_a_pg > > Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection > > closed > > Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn( > > NetworkFailure -> Unconnected ) > > Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver > > terminated > > Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting > > receiver thread > > Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver > > (re)started > > Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn( > > Unconnected -> WFConnection ) > > Sep 10 10:26:12 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 8445) timed out > > Sep 10 10:26:12 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:8445 - timed out after 20000ms > > Sep 10 10:26:12 node01 crmd[1019]: error: process_lrm_event: LRM > > operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms) > > Sep 10 10:26:32 node01 crmd[1019]: warning: cib_rsc_callback: > > Resource update 23 failed: (rc=-62) Timer expired > > Sep 10 10:27:03 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 8693) timed out > > Sep 10 10:27:03 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:8693 - timed out after 20000ms > > Sep 10 10:27:54 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 8938) timed out > > Sep 10 10:27:54 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:8938 - timed out after 20000ms > > Sep 10 10:28:33 node01 crmd[1019]: error: crm_timer_popped: > > Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! > > (180000ms) > > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: > > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED > > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: 1 > > cluster nodes failed to respond to the join offer. > > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: > > join-1: node02=none > > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: > > join-1: node01=welcomed > > Sep 10 10:28:45 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 9185) timed out > > Sep 10 10:28:45 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:9185 - timed out after 20000ms > > Sep 10 10:29:36 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 9432) timed out > > Sep 10 10:29:36 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:9432 - timed out after 20000ms > > Sep 10 10:30:27 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 9680) timed out > > Sep 10 10:30:27 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:9680 - timed out after 20000ms > > Sep 10 10:31:18 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 9927) timed out > > Sep 10 10:31:18 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:9927 - timed out after 20000ms > > Sep 10 10:32:09 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 10174) timed out > > Sep 10 10:32:09 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:10174 - timed out after 20000ms > > > > #crm_mon on node01 before I kill the other vm: > > Stack: corosync > > Current DC: node02 (167938104) - partition with quorum > > Version: 1.1.10-42f2063 > > 2 Nodes configured > > 5 Resources configured > > > > Online: [ node01 node02 ] > > > > Resource Group: PGServer > > fs_pg (ocf::heartbeat:Filesystem): Started node02 > > lsb_pg (lsb:postgresql): Started node02 > > ip_pg (ocf::heartbeat:IPaddr2): Started node02 > > Master/Slave Set: ms_drbd_pg [drbd_pg] > > Masters: [ node02 ] > > Slaves: [ node01 ] > > > > Thank you, > > Kiam > > > > > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org