Re: [Openais] occasionally timing out node

Steven Dake Tue, 19 Apr 2011 12:21:11 -0700

On 04/19/2011 11:04 AM, dan clark wrote:
> During a relatively low level of periodic activity on cpg messaging,
> one of three nodes seems to reach some timeout but immediately rejoin
> the group.  What might cause such a transition of the node out of the
> quorum?
>


not enough information - send a copy of your config file so I can see
the timeout values.  A corosync-blackbox run on the timed out node right
after a timeout would be helpful too.

> Timing cpg communications between applications also shows most
> transactions in the sub 25ms range for a small buffer (<64 bytes) but
> on occasion the timing is > one half a second, which clears up within
> a couple of seconds.  Any suggestions on where to look for periodic
> timing problems?
> 

this occurs because totem "slows down" the token during inactive
periods.  During this slow-down, it will take some time to turn off the
"braking" of the token.

This is configurable (from corosync.conf.8) via:
       hold   This timeout specifies in milliseconds how long the token
should
              be  held  by  the  representative when the protocol is
under low
              utilization.   It is not recommended to alter this value
without
              guidance from the corosync community.

              The default is 180 milliseconds.

A lower value will increase cpu utilization but reduce latency in
lightly loaded networks.

Regards
-steve

> corosync -v
> Corosync Cluster Engine, version '1.3.0'
> 
> corosync.conf
> compatibility: whitetank
> 
> totem {
>       version: 2
>         rrp_mode: active
>       secauth: off
>       threads: 0
>       token: 1000  # default value
>       consensus: 1201       # consensus must be greater than 1.2 * token
>       interface {
>               ringnumber: 0
>               bindnetaddr: 192.168.7.0
>               mcastaddr: 239.192.101.99
>               mcastport: 5407
>       }
> }
> 
> logging {
>       timestamp: on
>       fileline: on
>       function_name: on
>       to_stderr: yes
>       to_logfile: yes
>       to_syslog: yes
>       logfile: /var/log/corosync
>       debug: off
>       trace: none|enter|leave|trace1|trace2|trace3
>       logger_subsys {
>               subsys: AMF
>               debug: off
>       }
> }
> amf {
>       mode: disabled
> }
> 
> Apr 19 10:41:13 Anode crmd: [3626]: info: crm_timer_popped: PEngine
> Recheck Timer (I_PE_CALC) just popped!
> Apr 19 10:41:13 Anode crmd: [3626]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Apr 19 10:41:13 Anode crmd: [3626]: info: do_state_transition:
> Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> Apr 19 10:41:13 Anode crmd: [3626]: info: do_state_transition: All 3
> cluster nodes are eligible to run resources.
> Apr 19 10:41:13 Anode crmd: [3626]: info: do_pe_invoke: Query 168:
> Requesting the current CIB: S_POLICY_ENGINE
> Apr 19 10:41:13 Anode crmd: [3626]: info: do_pe_invoke_callback:
> Invoking the PE: query=168, ref=pe_calc-dc-1303234873-144, seq=17252,
> quorate=1
> Apr 19 10:41:13 Anode pengine: [3625]: notice: unpack_config: On loss
> of CCM Quorum: Ignore
> Apr 19 10:41:13 Anode pengine: [3625]: info: unpack_config: Node
> scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Apr 19 10:41:13 Anode pengine: [3625]: info: determine_online_status:
> Node Bnode is online
> Apr 19 10:41:13 Anode pengine: [3625]: info: determine_online_status:
> Node Anode is online
> Apr 19 10:41:13 Anode pengine: [3625]: info: determine_online_status:
> Node Cnode is online
> Apr 19 10:41:13 Anode pengine: [3625]: notice: native_print:
> ClusterIP     (ocf::heartbeat:IPaddr2):       Started Anode
> Apr 19 10:41:13 Anode pengine: [3625]: notice: clone_print:  Clone
> Set: Connected
> Apr 19 10:41:13 Anode pengine: [3625]: notice: short_print:
> Started: [ Bnode Anode Cnode ]
> Apr 19 10:41:13 Anode pengine: [3625]: notice: clone_print:  Clone
> Set: pingClone
> Apr 19 10:41:13 Anode pengine: [3625]: notice: short_print:
> Started: [ Bnode Anode Cnode ]
> Apr 19 10:41:13 Anode pengine: [3625]: info: get_failcount: Connected
> has failed 1 times on Anode
> Apr 19 10:41:13 Anode pengine: [3625]: notice:
> common_apply_stickiness: Connected can fail 999999 more times on Anode
> before being forced off
> Apr 19 10:41:13 Anode pengine: [3625]: info: get_failcount: Connected
> has failed 1 times on Anode
> Apr 19 10:41:13 Anode pengine: [3625]: notice:
> common_apply_stickiness: Connected can fail 999999 more times on Anode
> before being forced off
> Apr 19 10:41:13 Anode pengine: [3625]: info: get_failcount: Connected
> has failed 1 times on Anode
> Apr 19 10:41:13 Anode pengine: [3625]: notice:
> common_apply_stickiness: Connected can fail 999999 more times on Anode
> before being forced off
> Apr 19 10:41:13 Anode pengine: [3625]: notice: LogActions: Leave
> resource ClusterIP    (Started Anode)
> Apr 19 10:41:13 Anode pengine: [3625]: notice: LogActions: Leave
> resource ping:0(Started Bnode)
> Apr 19 10:41:13 Anode pengine: [3625]: notice: LogActions: Leave
> resource ping:1(Started Anode)
> Apr 19 10:41:13 Anode pengine: [3625]: notice: LogActions: Leave
> resource ping:2(Started Cnode)
> Apr 19 10:41:13 Anode pengine: [3625]: notice: LogActions: Leave
> resource pingPrimitive:0      (Started Bnode)
> Apr 19 10:41:13 Anode pengine: [3625]: notice: LogActions: Leave
> resource pingPrimitive:1      (Started Anode)
> Apr 19 10:41:13 Anode pengine: [3625]: notice: LogActions: Leave
> resource pingPrimitive:2      (Started Cnode)
> Apr 19 10:41:13 Anode crmd: [3626]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Apr 19 10:41:13 Anode pengine: [3625]: info: process_pe_message:
> Transition 90: PEngine Input stored in:
> /var/lib/pengine/pe-input-5309.bz2
> Apr 19 10:41:13 Anode crmd: [3626]: info: unpack_graph: Unpacked
> transition 90: 0 actions in 0 synapses
> Apr 19 10:41:13 Anode crmd: [3626]: info: do_te_invoke: Processing
> graph 90 (ref=pe_calc-dc-1303234873-144) derived from
> /var/lib/pengine/pe-input-5309.bz2
> Apr 19 10:41:13 Anode crmd: [3626]: info: run_graph:
> ====================================================
> Apr 19 10:41:13 Anode crmd: [3626]: notice: run_graph: Transition 90
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-5309.bz2): Complete
> Apr 19 10:41:13 Anode crmd: [3626]: info: te_graph_trigger: Transition
> 90 is now complete
> Apr 19 10:41:13 Anode crmd: [3626]: info: notify_crmd: Transition 90
> status: done - <null>
> Apr 19 10:41:13 Anode crmd: [3626]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Apr 19 10:41:13 Anode crmd: [3626]: info: do_state_transition:
> Starting PEngine Recheck Timer
> Apr 19 10:42:05 Anode attrd_updater: [16080]: WARN: Initializing
> connection to logging daemon failed. Logging daemon may not be running
> _______________________________________________
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] occasionally timing out node

Reply via email to