Re: [Pacemaker] Corosync and Pacemaker Hangs

Andrew Beekhof Thu, 11 Sep 2014 02:20:28 -0700

On 11 Sep 2014, at 6:51 pm, Norbert Kiam Maclang 
<norbert.kiam.macl...@gmail.com> wrote:


> Thank you for spending time looking at my problem, really appreciate it.
> 
> Additional information on my problem
> 
> Before doing a restart on the primary node, tcpdump shows (exchange is good):
> IP node01.55010 > node02.5405: UDP, length 87
> IP node01.5405 > node02.5405: UDP, length 74
> IP node02.5405 > node01.5405: UDP, length 74
> IP node01.55010 > node02.5405: UDP, length 87
> IP node01.5405 > node02.5405: UDP, length 74
> IP node02.5405 > node01.5405: UDP, length 74
> IP node01.55010 > node02.5405: UDP, length 87
> IP node01.5405 > node02.5405: UDP, length 74
> 
> On the surviving node, pacemaker and corosync seems not respoding.
> # /etc/init.d/pacemaker stop

Pacemaker can't stop until the resources stop.
And they can't stop because you've not configured fencing (either at the 
cluster or resource level)

> Signaling Pacemaker Cluster Manager to terminate: [  OK  ]
> Waiting for cluster services to unload:...................
> ..........................................................
> ..........................................................
> .........^C
> 
> # logs:
> node01 pacemakerd[9486]:   notice: pcmk_shutdown_worker: Shuting down 
> Pacemaker
> node01 pacemakerd[9486]:   notice: stop_child: Stopping crmd: Sent -15 to 
> process 9493
> 
> # /etc/init.d/corosync stop
>  * Stopping corosync daemon corosync
> ^C
> 
> # service drbd stop
>  * Stopping all DRBD resources                    [ OK ]
>  
> Cannot reboot the vm unless I kill corosync and pacemaker
> 
> Node 1 & node 2's tcp dump after node02 went up again:
> IP node02.52587 > node01.5405: UDP, length 87
> IP node02.52587 > node01.5405: UDP, length 87
> IP node02.52587 > node01.5405: UDP, length 87
> IP node02.52587 > node01.5405: UDP, length 87
> IP node02.52587 > node01.5405: UDP, length 87
> IP node02.52587 > node01.5405: UDP, length 87
> IP node02.52587 > node01.5405: UDP, length 87
> IP node02.52587 > node01.5405: UDP, length 87
> 
> HA will work again after I reboot node01 and exchange is good again:
> IP node01.55010 > node02.5405: UDP, length 87
> IP node01.5405 > node02.5405: UDP, length 74
> IP node02.5405 > node01.5405: UDP, length 74
> IP node01.55010 > node02.5405: UDP, length 87
> IP node01.5405 > node02.5405: UDP, length 74
> IP node02.5405 > node01.5405: UDP, length 74
> 
> Is it required to do resource-level fencing as what Vladislav Bogdanov 
> mentioned?
> > You'd need to configure fencing at the drbd resources level.
> > http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib
> 
> Thanks,
> Kiam
> 
> On Thu, Sep 11, 2014 at 12:23 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> 
> On 11 Sep 2014, at 12:57 pm, Norbert Kiam Maclang 
> <norbert.kiam.macl...@gmail.com> wrote:
> 
> > Is this something to do with quorum? But I already set
> >
> > property no-quorum-policy="ignore" \
> >       expected-quorum-votes="1"
> 
> No fencing wouldn't be helping.
> And it looks like drbd resources are hanging, not pacemaker/corosync.
> 
> > Sep 10 10:26:12 node01 lrmd[1016]:  warning: child_timeout_callback: 
> > drbd_pg_monitor_31000 process (PID 8445) timed out
> > Sep 10 10:26:12 node01 lrmd[1016]:  warning: operation_finished: 
> > drbd_pg_monitor_31000:8445 - timed out after 20000ms
> 
> 
> 
> 
> >
> > Thanks in advance,
> > Kiam
> >
> > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang 
> > <norbert.kiam.macl...@gmail.com> wrote:
> > Hi,
> >
> > Please help me understand what is causing the problem. I have a 2 node 
> > cluster running on vms using KVM. Each vm (I am using Ubuntu 14.04) runs on 
> > a separate hypervisor on separate machines. All are working well during 
> > testing (I restarted the vms alternately), but after a day when I kill the 
> > other node, I always end up corosync and pacemaker hangs on the surviving 
> > node. Date and time on the vms are in sync, I use unicast, tcpdump shows 
> > both nodes exchanges, confirmed that DRBD is healthy and crm_mon show good 
> > status before I kill the other node. Below are my configurations and 
> > versions I used:
> >
> > corosync             2.3.3-1ubuntu1
> > crmsh                1.2.5+hg1034-1ubuntu3
> > drbd8-utils          2:8.4.4-1ubuntu1
> > libcorosync-common4  2.3.3-1ubuntu1
> > libcrmcluster4       1.1.10+git20130802-1ubuntu2
> > libcrmcommon3        1.1.10+git20130802-1ubuntu2
> > libcrmservice1       1.1.10+git20130802-1ubuntu2
> > pacemaker            1.1.10+git20130802-1ubuntu2
> > pacemaker-cli-utils  1.1.10+git20130802-1ubuntu2
> > postgresql-9.3       9.3.5-0ubuntu0.14.04.1
> >
> > # /etc/corosync/corosync:
> > totem {
> >       version: 2
> >       token: 3000
> >       token_retransmits_before_loss_const: 10
> >       join: 60
> >       consensus: 3600
> >       vsftype: none
> >       max_messages: 20
> >       clear_node_high_bit: yes
> >       secauth: off
> >       threads: 0
> >       rrp_mode: none
> >       interface {
> >                 member {
> >                         memberaddr: 10.2.136.56
> >                 }
> >                 member {
> >                         memberaddr: 10.2.136.57
> >                 }
> >                 ringnumber: 0
> >                 bindnetaddr: 10.2.136.0
> >                 mcastport: 5405
> >         }
> >         transport: udpu
> > }
> > amf {
> >       mode: disabled
> > }
> > quorum {
> >       provider: corosync_votequorum
> >       expected_votes: 1
> > }
> > aisexec {
> >         user:   root
> >         group:  root
> > }
> > logging {
> >         fileline: off
> >         to_stderr: yes
> >         to_logfile: no
> >         to_syslog: yes
> >       syslog_facility: daemon
> >         debug: off
> >         timestamp: on
> >         logger_subsys {
> >                 subsys: AMF
> >                 debug: off
> >                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> >         }
> > }
> >
> > # /etc/corosync/service.d/pcmk:
> > service {
> >   name: pacemaker
> >   ver: 1
> > }
> >
> > /etc/drbd.d/global_common.conf:
> > global {
> >       usage-count no;
> > }
> >
> > common {
> >       net {
> >                 protocol C;
> >       }
> > }
> >
> > # /etc/drbd.d/pg.res:
> > resource pg {
> >   device /dev/drbd0;
> >   disk /dev/vdb;
> >   meta-disk internal;
> >   startup {
> >     wfc-timeout 15;
> >     degr-wfc-timeout 60;
> >   }
> >   disk {
> >     on-io-error detach;
> >     resync-rate 40M;
> >   }
> >   on node01 {
> >     address 10.2.136.56:7789;
> >   }
> >   on node02 {
> >     address 10.2.136.57:7789;
> >   }
> >   net {
> >     verify-alg md5;
> >     after-sb-0pri discard-zero-changes;
> >     after-sb-1pri discard-secondary;
> >     after-sb-2pri disconnect;
> >   }
> > }
> >
> > # Pacemaker configuration:
> > node $id="167938104" node01
> > node $id="167938105" node02
> > primitive drbd_pg ocf:linbit:drbd \
> >       params drbd_resource="pg" \
> >       op monitor interval="29s" role="Master" \
> >       op monitor interval="31s" role="Slave"
> > primitive fs_pg ocf:heartbeat:Filesystem \
> >       params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main" 
> > fstype="ext4"
> > primitive ip_pg ocf:heartbeat:IPaddr2 \
> >       params ip="10.2.136.59" cidr_netmask="24" nic="eth0"
> > primitive lsb_pg lsb:postgresql
> > group PGServer fs_pg lsb_pg ip_pg
> > ms ms_drbd_pg drbd_pg \
> >       meta master-max="1" master-node-max="1" clone-max="2" 
> > clone-node-max="1" notify="true"
> > colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master
> > order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start
> > property $id="cib-bootstrap-options" \
> >       dc-version="1.1.10-42f2063" \
> >       cluster-infrastructure="corosync" \
> >       stonith-enabled="false" \
> >       no-quorum-policy="ignore"
> > rsc_defaults $id="rsc-options" \
> >       resource-stickiness="100"
> >
> > # Logs on node01
> > Sep 10 10:25:33 node01 crmd[1019]:   notice: peer_update_callback: Our peer 
> > on the DC is dead
> > Sep 10 10:25:33 node01 crmd[1019]:   notice: do_state_transition: State 
> > transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION 
> > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> > Sep 10 10:25:33 node01 crmd[1019]:   notice: do_state_transition: State 
> > transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
> > cause=C_FSA_INTERNAL origin=do_election_check ]
> > Sep 10 10:25:33 node01 corosync[940]:   [TOTEM ] A new membership 
> > (10.2.136.56:52) was formed. Members left: 167938105
> > Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did not 
> > arrive in time.
> > Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer( Primary -> 
> > Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> > Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender terminated
> > Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating 
> > drbd_a_pg
> > Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection closed
> > Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn( 
> > NetworkFailure -> Unconnected )
> > Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver terminated
> > Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting receiver 
> > thread
> > Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver (re)started
> > Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn( Unconnected 
> > -> WFConnection )
> > Sep 10 10:26:12 node01 lrmd[1016]:  warning: child_timeout_callback: 
> > drbd_pg_monitor_31000 process (PID 8445) timed out
> > Sep 10 10:26:12 node01 lrmd[1016]:  warning: operation_finished: 
> > drbd_pg_monitor_31000:8445 - timed out after 20000ms
> > Sep 10 10:26:12 node01 crmd[1019]:    error: process_lrm_event: LRM 
> > operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms)
> > Sep 10 10:26:32 node01 crmd[1019]:  warning: cib_rsc_callback: Resource 
> > update 23 failed: (rc=-62) Timer expired
> > Sep 10 10:27:03 node01 lrmd[1016]:  warning: child_timeout_callback: 
> > drbd_pg_monitor_31000 process (PID 8693) timed out
> > Sep 10 10:27:03 node01 lrmd[1016]:  warning: operation_finished: 
> > drbd_pg_monitor_31000:8693 - timed out after 20000ms
> > Sep 10 10:27:54 node01 lrmd[1016]:  warning: child_timeout_callback: 
> > drbd_pg_monitor_31000 process (PID 8938) timed out
> > Sep 10 10:27:54 node01 lrmd[1016]:  warning: operation_finished: 
> > drbd_pg_monitor_31000:8938 - timed out after 20000ms
> > Sep 10 10:28:33 node01 crmd[1019]:    error: crm_timer_popped: Integration 
> > Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)
> > Sep 10 10:28:33 node01 crmd[1019]:  warning: do_state_transition: 
> > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
> > Sep 10 10:28:33 node01 crmd[1019]:  warning: do_state_transition: 1 cluster 
> > nodes failed to respond to the join offer.
> > Sep 10 10:28:33 node01 crmd[1019]:   notice: crmd_join_phase_log: join-1: 
> > node02=none
> > Sep 10 10:28:33 node01 crmd[1019]:   notice: crmd_join_phase_log: join-1: 
> > node01=welcomed
> > Sep 10 10:28:45 node01 lrmd[1016]:  warning: child_timeout_callback: 
> > drbd_pg_monitor_31000 process (PID 9185) timed out
> > Sep 10 10:28:45 node01 lrmd[1016]:  warning: operation_finished: 
> > drbd_pg_monitor_31000:9185 - timed out after 20000ms
> > Sep 10 10:29:36 node01 lrmd[1016]:  warning: child_timeout_callback: 
> > drbd_pg_monitor_31000 process (PID 9432) timed out
> > Sep 10 10:29:36 node01 lrmd[1016]:  warning: operation_finished: 
> > drbd_pg_monitor_31000:9432 - timed out after 20000ms
> > Sep 10 10:30:27 node01 lrmd[1016]:  warning: child_timeout_callback: 
> > drbd_pg_monitor_31000 process (PID 9680) timed out
> > Sep 10 10:30:27 node01 lrmd[1016]:  warning: operation_finished: 
> > drbd_pg_monitor_31000:9680 - timed out after 20000ms
> > Sep 10 10:31:18 node01 lrmd[1016]:  warning: child_timeout_callback: 
> > drbd_pg_monitor_31000 process (PID 9927) timed out
> > Sep 10 10:31:18 node01 lrmd[1016]:  warning: operation_finished: 
> > drbd_pg_monitor_31000:9927 - timed out after 20000ms
> > Sep 10 10:32:09 node01 lrmd[1016]:  warning: child_timeout_callback: 
> > drbd_pg_monitor_31000 process (PID 10174) timed out
> > Sep 10 10:32:09 node01 lrmd[1016]:  warning: operation_finished: 
> > drbd_pg_monitor_31000:10174 - timed out after 20000ms
> >
> > #crm_mon on node01 before I kill the other vm:
> > Stack: corosync
> > Current DC: node02 (167938104) - partition with quorum
> > Version: 1.1.10-42f2063
> > 2 Nodes configured
> > 5 Resources configured
> >
> > Online: [ node01 node02 ]
> >
> >  Resource Group: PGServer
> >      fs_pg      (ocf::heartbeat:Filesystem):    Started node02
> >      lsb_pg     (lsb:postgresql):       Started node02
> >      ip_pg      (ocf::heartbeat:IPaddr2):       Started node02
> >  Master/Slave Set: ms_drbd_pg [drbd_pg]
> >      Masters: [ node02 ]
> >      Slaves: [ node01 ]
> >
> > Thank you,
> > Kiam
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Corosync and Pacemaker Hangs

Reply via email to