On 11 Sep 2014, at 6:51 pm, Norbert Kiam Maclang <norbert.kiam.macl...@gmail.com> wrote:
> Thank you for spending time looking at my problem, really appreciate it. > > Additional information on my problem > > Before doing a restart on the primary node, tcpdump shows (exchange is good): > IP node01.55010 > node02.5405: UDP, length 87 > IP node01.5405 > node02.5405: UDP, length 74 > IP node02.5405 > node01.5405: UDP, length 74 > IP node01.55010 > node02.5405: UDP, length 87 > IP node01.5405 > node02.5405: UDP, length 74 > IP node02.5405 > node01.5405: UDP, length 74 > IP node01.55010 > node02.5405: UDP, length 87 > IP node01.5405 > node02.5405: UDP, length 74 > > On the surviving node, pacemaker and corosync seems not respoding. > # /etc/init.d/pacemaker stop Pacemaker can't stop until the resources stop. And they can't stop because you've not configured fencing (either at the cluster or resource level) > Signaling Pacemaker Cluster Manager to terminate: [ OK ] > Waiting for cluster services to unload:................... > .......................................................... > .......................................................... > .........^C > > # logs: > node01 pacemakerd[9486]: notice: pcmk_shutdown_worker: Shuting down > Pacemaker > node01 pacemakerd[9486]: notice: stop_child: Stopping crmd: Sent -15 to > process 9493 > > # /etc/init.d/corosync stop > * Stopping corosync daemon corosync > ^C > > # service drbd stop > * Stopping all DRBD resources [ OK ] > > Cannot reboot the vm unless I kill corosync and pacemaker > > Node 1 & node 2's tcp dump after node02 went up again: > IP node02.52587 > node01.5405: UDP, length 87 > IP node02.52587 > node01.5405: UDP, length 87 > IP node02.52587 > node01.5405: UDP, length 87 > IP node02.52587 > node01.5405: UDP, length 87 > IP node02.52587 > node01.5405: UDP, length 87 > IP node02.52587 > node01.5405: UDP, length 87 > IP node02.52587 > node01.5405: UDP, length 87 > IP node02.52587 > node01.5405: UDP, length 87 > > HA will work again after I reboot node01 and exchange is good again: > IP node01.55010 > node02.5405: UDP, length 87 > IP node01.5405 > node02.5405: UDP, length 74 > IP node02.5405 > node01.5405: UDP, length 74 > IP node01.55010 > node02.5405: UDP, length 87 > IP node01.5405 > node02.5405: UDP, length 74 > IP node02.5405 > node01.5405: UDP, length 74 > > Is it required to do resource-level fencing as what Vladislav Bogdanov > mentioned? > > You'd need to configure fencing at the drbd resources level. > > http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib > > Thanks, > Kiam > > On Thu, Sep 11, 2014 at 12:23 PM, Andrew Beekhof <and...@beekhof.net> wrote: > > On 11 Sep 2014, at 12:57 pm, Norbert Kiam Maclang > <norbert.kiam.macl...@gmail.com> wrote: > > > Is this something to do with quorum? But I already set > > > > property no-quorum-policy="ignore" \ > > expected-quorum-votes="1" > > No fencing wouldn't be helping. > And it looks like drbd resources are hanging, not pacemaker/corosync. > > > Sep 10 10:26:12 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 8445) timed out > > Sep 10 10:26:12 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:8445 - timed out after 20000ms > > > > > > > > Thanks in advance, > > Kiam > > > > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang > > <norbert.kiam.macl...@gmail.com> wrote: > > Hi, > > > > Please help me understand what is causing the problem. I have a 2 node > > cluster running on vms using KVM. Each vm (I am using Ubuntu 14.04) runs on > > a separate hypervisor on separate machines. All are working well during > > testing (I restarted the vms alternately), but after a day when I kill the > > other node, I always end up corosync and pacemaker hangs on the surviving > > node. Date and time on the vms are in sync, I use unicast, tcpdump shows > > both nodes exchanges, confirmed that DRBD is healthy and crm_mon show good > > status before I kill the other node. Below are my configurations and > > versions I used: > > > > corosync 2.3.3-1ubuntu1 > > crmsh 1.2.5+hg1034-1ubuntu3 > > drbd8-utils 2:8.4.4-1ubuntu1 > > libcorosync-common4 2.3.3-1ubuntu1 > > libcrmcluster4 1.1.10+git20130802-1ubuntu2 > > libcrmcommon3 1.1.10+git20130802-1ubuntu2 > > libcrmservice1 1.1.10+git20130802-1ubuntu2 > > pacemaker 1.1.10+git20130802-1ubuntu2 > > pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2 > > postgresql-9.3 9.3.5-0ubuntu0.14.04.1 > > > > # /etc/corosync/corosync: > > totem { > > version: 2 > > token: 3000 > > token_retransmits_before_loss_const: 10 > > join: 60 > > consensus: 3600 > > vsftype: none > > max_messages: 20 > > clear_node_high_bit: yes > > secauth: off > > threads: 0 > > rrp_mode: none > > interface { > > member { > > memberaddr: 10.2.136.56 > > } > > member { > > memberaddr: 10.2.136.57 > > } > > ringnumber: 0 > > bindnetaddr: 10.2.136.0 > > mcastport: 5405 > > } > > transport: udpu > > } > > amf { > > mode: disabled > > } > > quorum { > > provider: corosync_votequorum > > expected_votes: 1 > > } > > aisexec { > > user: root > > group: root > > } > > logging { > > fileline: off > > to_stderr: yes > > to_logfile: no > > to_syslog: yes > > syslog_facility: daemon > > debug: off > > timestamp: on > > logger_subsys { > > subsys: AMF > > debug: off > > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > > } > > } > > > > # /etc/corosync/service.d/pcmk: > > service { > > name: pacemaker > > ver: 1 > > } > > > > /etc/drbd.d/global_common.conf: > > global { > > usage-count no; > > } > > > > common { > > net { > > protocol C; > > } > > } > > > > # /etc/drbd.d/pg.res: > > resource pg { > > device /dev/drbd0; > > disk /dev/vdb; > > meta-disk internal; > > startup { > > wfc-timeout 15; > > degr-wfc-timeout 60; > > } > > disk { > > on-io-error detach; > > resync-rate 40M; > > } > > on node01 { > > address 10.2.136.56:7789; > > } > > on node02 { > > address 10.2.136.57:7789; > > } > > net { > > verify-alg md5; > > after-sb-0pri discard-zero-changes; > > after-sb-1pri discard-secondary; > > after-sb-2pri disconnect; > > } > > } > > > > # Pacemaker configuration: > > node $id="167938104" node01 > > node $id="167938105" node02 > > primitive drbd_pg ocf:linbit:drbd \ > > params drbd_resource="pg" \ > > op monitor interval="29s" role="Master" \ > > op monitor interval="31s" role="Slave" > > primitive fs_pg ocf:heartbeat:Filesystem \ > > params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main" > > fstype="ext4" > > primitive ip_pg ocf:heartbeat:IPaddr2 \ > > params ip="10.2.136.59" cidr_netmask="24" nic="eth0" > > primitive lsb_pg lsb:postgresql > > group PGServer fs_pg lsb_pg ip_pg > > ms ms_drbd_pg drbd_pg \ > > meta master-max="1" master-node-max="1" clone-max="2" > > clone-node-max="1" notify="true" > > colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master > > order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start > > property $id="cib-bootstrap-options" \ > > dc-version="1.1.10-42f2063" \ > > cluster-infrastructure="corosync" \ > > stonith-enabled="false" \ > > no-quorum-policy="ignore" > > rsc_defaults $id="rsc-options" \ > > resource-stickiness="100" > > > > # Logs on node01 > > Sep 10 10:25:33 node01 crmd[1019]: notice: peer_update_callback: Our peer > > on the DC is dead > > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: State > > transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION > > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] > > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: State > > transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC > > cause=C_FSA_INTERNAL origin=do_election_check ] > > Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new membership > > (10.2.136.56:52) was formed. Members left: 167938105 > > Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did not > > arrive in time. > > Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer( Primary -> > > Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > > Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender terminated > > Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating > > drbd_a_pg > > Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection closed > > Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn( > > NetworkFailure -> Unconnected ) > > Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver terminated > > Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting receiver > > thread > > Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver (re)started > > Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn( Unconnected > > -> WFConnection ) > > Sep 10 10:26:12 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 8445) timed out > > Sep 10 10:26:12 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:8445 - timed out after 20000ms > > Sep 10 10:26:12 node01 crmd[1019]: error: process_lrm_event: LRM > > operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms) > > Sep 10 10:26:32 node01 crmd[1019]: warning: cib_rsc_callback: Resource > > update 23 failed: (rc=-62) Timer expired > > Sep 10 10:27:03 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 8693) timed out > > Sep 10 10:27:03 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:8693 - timed out after 20000ms > > Sep 10 10:27:54 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 8938) timed out > > Sep 10 10:27:54 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:8938 - timed out after 20000ms > > Sep 10 10:28:33 node01 crmd[1019]: error: crm_timer_popped: Integration > > Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms) > > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: > > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED > > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: 1 cluster > > nodes failed to respond to the join offer. > > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: join-1: > > node02=none > > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: join-1: > > node01=welcomed > > Sep 10 10:28:45 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 9185) timed out > > Sep 10 10:28:45 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:9185 - timed out after 20000ms > > Sep 10 10:29:36 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 9432) timed out > > Sep 10 10:29:36 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:9432 - timed out after 20000ms > > Sep 10 10:30:27 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 9680) timed out > > Sep 10 10:30:27 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:9680 - timed out after 20000ms > > Sep 10 10:31:18 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 9927) timed out > > Sep 10 10:31:18 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:9927 - timed out after 20000ms > > Sep 10 10:32:09 node01 lrmd[1016]: warning: child_timeout_callback: > > drbd_pg_monitor_31000 process (PID 10174) timed out > > Sep 10 10:32:09 node01 lrmd[1016]: warning: operation_finished: > > drbd_pg_monitor_31000:10174 - timed out after 20000ms > > > > #crm_mon on node01 before I kill the other vm: > > Stack: corosync > > Current DC: node02 (167938104) - partition with quorum > > Version: 1.1.10-42f2063 > > 2 Nodes configured > > 5 Resources configured > > > > Online: [ node01 node02 ] > > > > Resource Group: PGServer > > fs_pg (ocf::heartbeat:Filesystem): Started node02 > > lsb_pg (lsb:postgresql): Started node02 > > ip_pg (ocf::heartbeat:IPaddr2): Started node02 > > Master/Slave Set: ms_drbd_pg [drbd_pg] > > Masters: [ node02 ] > > Slaves: [ node01 ] > > > > Thank you, > > Kiam > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org