Hi Ulrich, Thanks for the response.
30 sec is the time for detection only as confirmed by logs. ++++++++++++++++++++++++++++++++++++ Jan 10 11:06:18 [19261] orana crmd: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (30000ms) Jan 10 11:06:18 [19261] orana crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jan 10 11:06:18 [19261] orana crmd: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Jan 10 11:06:18 [19260] orana pengine: info: process_pe_message: Input has not changed since last time, not saving to disk Jan 10 11:06:18 [19260] orana pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 10 11:06:18 [19260] orana pengine: info: determine_online_status_fencing: Node tigana is active Jan 10 11:06:18 [19260] orana pengine: info: determine_online_status: Node tigana is online Jan 10 11:06:18 [19260] orana pengine: info: determine_online_status_fencing: Node orana is active Jan 10 11:06:18 [19260] orana pengine: info: determine_online_status: Node orana is online Jan 10 11:06:18 [19260] orana pengine: info: clone_print: Master/Slave Set: unicloud-master [unicloud] Jan 10 11:06:18 [19260] orana pengine: info: short_print: Masters: [ tigana ] Jan 10 11:06:18 [19260] orana pengine: info: short_print: Slaves: [ orana ] Jan 10 11:06:18 [19260] orana pengine: info: native_print: fence-uc-orana (stonith:fence_ilo4): Started tigana Jan 10 11:06:18 [19260] orana pengine: info: native_print: fence-uc-tigana (stonith:fence_ilo4): Started tigana Jan 10 11:06:18 [19260] orana pengine: info: master_color: Promoting unicloud:0 (Master tigana) Jan 10 11:06:18 [19260] orana pengine: info: master_color: unicloud-master: Promoted 1 instances of a possible 1 to master Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave unicloud:0 (Master tigana) Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave unicloud:1 (Slave orana) Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave fence-uc-orana (Started tigana) Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave fence-uc-tigana (Started tigana) Jan 10 11:06:18 [19260] orana pengine: notice: process_pe_message: Calculated Transition 2390: /var/lib/pacemaker/pengine/pe-input-1655.bz2 Jan 10 11:06:18 [19261] orana crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jan 10 11:06:18 [19261] orana crmd: info: do_te_invoke: Processing graph 2390 (ref=pe_calc-dc-1515562578-2650) derived from /var/lib/pacemaker/pengine/pe-input-1655.bz2 Jan 10 11:06:18 [19261] orana crmd: notice: run_graph: Transition 2390 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1655.bz2): Complete Jan 10 11:06:18 [19261] orana crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE Jan 10 11:06:18 [19261] orana crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 10 11:06:31 corosync [TOTEM ] A processor failed, forming new configuration. Jan 10 11:06:33 corosync [QUORUM] Members[1]: 1 Jan 10 11:06:33 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 10 11:06:33 [19250] orana pacemakerd: info: cman_event_callback: Membership 2064: quorum retained Jan 10 11:06:33 [19261] orana crmd: info: cman_event_callback: Membership 2064: quorum retained Jan 10 11:06:33 [19250] orana pacemakerd: notice: crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is now lost (was member) Jan 10 11:06:33 [19261] orana crmd: notice: crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is now lost (was member) Jan 10 11:06:33 [19261] orana crmd: info: peer_update_callback: tigana is now lost (was member) Jan 10 11:06:33 [19261] orana crmd: warning: match_down_event: No match for shutdown action on tigana Jan 10 11:06:33 [19261] orana crmd: notice: peer_update_callback: Stonith/shutdown of tigana not matched Jan 10 11:06:33 [19261] orana crmd: info: crm_update_peer_join: peer_update_callback: Node tigana[2] - join-2 phase 4 -> 0 Jan 10 11:06:33 [19261] orana crmd: info: abort_transition_graph: Transition aborted: Node failure (source=peer_update_callback:240, 1) Jan 10 11:06:33 corosync [CPG ] chosen downlist: sender r(0) ip(7.7.7.1) ; members(old:2 left:1) ++++++++++++++++++++++++++ this is the logs from standby node(new active). kernel panic was triggered at 11:06:00 at the other node and here totem change is reported at 11:06:31. 30 secs is the cluster recheck timer. Regards, Ashutosh On Wed, Jan 10, 2018 at 3:12 PM, <users-requ...@clusterlabs.org> wrote: > Send Users mailing list submissions to > users@clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.clusterlabs.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-requ...@clusterlabs.org > > You can reach the person managing the list at > users-ow...@clusterlabs.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Users digest..." > > > Today's Topics: > > 1. corosync taking almost 30 secs to detect node failure in case > of kernel panic (ashutosh tiwari) > 2. Antw: corosync taking almost 30 secs to detect node failure > in case of kernel panic (Ulrich Windl) > 3. pacemaker reports monitor timeout while CPU is high (???) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 10 Jan 2018 12:43:46 +0530 > From: ashutosh tiwari <ashutosh.k...@gmail.com> > To: users@clusterlabs.org > Subject: [ClusterLabs] corosync taking almost 30 secs to detect node > failure in case of kernel panic > Message-ID: > <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb > 3...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi, > > We have two node cluster running in active/standby mode and having IPMI > fencing configured. > > In case of kernel panic at Active node, standby node is detecting node > failure in around 30 secs which leads to delay in standby node taking the > active role. > > we have totem token timeout as 10000 msecs. > Please let us know in case there is any more configuration controlling > membership detection. > > s/w versions. > > centos 6.7 > corosync-1.4.7-5.el6.x86_64 > pacemaker-1.1.14-8.el6.x86_64 > > Thanks and Regards, > Ashutosh Tiwari > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: <http://lists.clusterlabs.org/pipermail/users/attachments/ > 20180110/235f148d/attachment-0001.html> > > ------------------------------ > > Message: 2 > Date: Wed, 10 Jan 2018 08:32:16 +0100 > From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> > To: <users@clusterlabs.org> > Subject: [ClusterLabs] Antw: corosync taking almost 30 secs to detect > node failure in case of kernel panic > Message-ID: <5a55c180020000a100029...@gwsmtp1.uni-regensburg.de> > Content-Type: text/plain; charset=US-ASCII > > Hi! > > Maybe define "detecting node failure". Culkd it be your 30 seconds are > between detection and reaction? Logs would help here, too. > > Regards, > Ulrich > > > >>> ashutosh tiwari <ashutosh.k...@gmail.com> schrieb am 10.01.2018 um > 08:13 in > Nachricht > <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muv...@mail.gmail.com>: > > Hi, > > > > We have two node cluster running in active/standby mode and having IPMI > > fencing configured. > > > > In case of kernel panic at Active node, standby node is detecting node > > failure in around 30 secs which leads to delay in standby node taking the > > active role. > > > > we have totem token timeout as 10000 msecs. > > Please let us know in case there is any more configuration controlling > > membership detection. > > > > s/w versions. > > > > centos 6.7 > > corosync-1.4.7-5.el6.x86_64 > > pacemaker-1.1.14-8.el6.x86_64 > > > > Thanks and Regards, > > Ashutosh Tiwari > > > > > ------------------------------ > > Message: 3 > Date: Wed, 10 Jan 2018 09:40:51 +0000 > From: ??? <fanguot...@highgo.com> > To: Cluster Labs - All topics related to open-source clustering > welcomed <users@clusterlabs.org> > Subject: [ClusterLabs] pacemaker reports monitor timeout while CPU is > high > Message-ID: <4dc98a5d9be144a78fb9a18721743...@ex01.highgo.com> > Content-Type: text/plain; charset="utf-8" > > Hello, > > This issue only appears when we run performance test and the CPU is high. > The cluster and log is as below. The Pacemaker will restart the Slave Side > pgsql-ha resource about every two minutes. > > Take the following scenario for example:?when the pgsqlms RA is called, we > print the log ?execute the command start (command)?. When the command is > returned, we print the log ?execute the command stop (Command) (result)?? > > 1. We could see that pacemaker call ?pgsqlms monitor? about every 15 > seconds. And it return $OCF_SUCCESS > > 2. In calls monitor command again at 13:56:16, and then it reports > timeout error error 13:56:18. It is only 2 seconds but it reports > ?timeout=10000ms? > > 3. In other logs, sometimes after 15 minutes, there is no ?execute the > command start monitor? printed and it reports timeout error directly. > > Could you please tell how to debug or resolve such issue? > > The log: > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command > start monitor > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop > monitor 0 > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command > start monitor > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop > monitor 0 > Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU load detected: > 426.779999 > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command > start monitor > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process > (PID 5606) timed out > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - > timed out after 10000ms > Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for > pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=10000ms > Jan 10 13:56:18 sds2 crmd[26096]: notice: db2-pgsqld_monitor_16000:102 [ > /tmp:5432 - accepting connections\n ] > Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph > Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor > for pgsqld:0 on db2: unknown error (1) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start > for pgsqld:1 on db1: unknown error (1) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from > db1 after 1000000 failures (max=1000000) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from > db1 after 1000000 failures (max=1000000) > Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover pgsqld:0#011(Slave > db2) > Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition 37, > saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2 > > > The Cluster Configuration: > 2 nodes and 13 resources configured > > Online: [ db1 db2 ] > > Full list of resources: > > Clone Set: dlm-clone [dlm] > Started: [ db1 db2 ] > Clone Set: clvmd-clone [clvmd] > Started: [ db1 db2 ] > ipmi_node1 (stonith:fence_ipmilan): Started db2 > ipmi_node2 (stonith:fence_ipmilan): Started db1 > Clone Set: clusterfs-clone [clusterfs] > Started: [ db1 db2 ] > Master/Slave Set: pgsql-ha [pgsqld]> > > Masters: [ db1 ] > > Slaves: [ db2 ] > Resource Group: mastergroup > db1-vip (ocf::heartbeat:IPaddr2): Started > rep-vip (ocf::heartbeat:IPaddr2): Started > Resource Group: slavegroup > db2-vip (ocf::heartbeat:IPaddr2): Started > > > pcs resource show pgsql-ha > Master: pgsql-ha > Meta Attrs: interleave=true notify=true > Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) > Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data > Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s) > stop interval=0s timeout=60s (pgsqld-stop-interval-0s) > promote interval=0s timeout=130s > (pgsqld-promote-interval-0s) > demote interval=0s timeout=120s (pgsqld-demote-interval-0s) > monitor interval=15s role=Master timeout=10s > (pgsqld-monitor-interval-15s) > monitor interval=16s role=Slave timeout=10s > (pgsqld-monitor-interval-16s) > notify interval=0s timeout=60s (pgsqld-notify-interval-0s) > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: <http://lists.clusterlabs.org/pipermail/users/attachments/ > 20180110/88e7c872/attachment.html> > > ------------------------------ > > _______________________________________________ > Users mailing list > Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > > End of Users Digest, Vol 36, Issue 8 > ************************************ >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org