On Fri, 2019-02-15 at 08:55 +0800, ma.jinf...@zte.com.cn wrote: > There is a issue that pacemaker don't schedule resource which is in > docker container after docker is restarted but the pacemaker cluster > show the resource is started ,it seems to be a bug of pacemaker . > I am very confused what happend when pengine print those > logs(pengine: notice: check_operation_expiry: Clearing > failure of event_agent on 120_120__fd4 because it expired | > event_agent_clear_failcount_0). Does anyone know what they mean? > Thank you very much! > 1. pacemaker/corosync version: 1.1.16/2.4.3 > 2. corosync logs as follows; > Feb 06 09:52:19 [58629] node-4 attrd: info: > attrd_peer_update: Setting event_agent_status[120_120__fd4]: ok -> > fail from 120_120__fd4
This is the attribute manager setting the "event_agent_status" attribute for node "120_120__fd4" to "fail". That is a user-created attribute; pacemaker does not do anything with it other than store it. Most likely, the resource agent monitor action created it. > Feb 06 09:52:19 [58629] node-4 attrd: info: write_attribute: > Sent update 50 with 1 changes for event_agent_status, id=<n/a>, > set=(null) > Feb 06 09:52:19 [58629] node-4 attrd: info: > attrd_cib_callback: Update 50 for event_agent_status: OK (0) > Feb 06 09:52:19 [58629] node-4 attrd: info: > attrd_cib_callback: Update 50 for > event_agent_status[120_120__fd4]=fail: OK (0) > Feb 06 09:52:19 [58630] node-4 pengine: notice: unpack_config: > On loss of CCM Quorum: Ignore > Feb 06 09:52:19 [58630] node-4 pengine: info: > determine_online_status: Node 120_120__fd4 is online > Feb 06 09:52:19 [58630] node-4 pengine: info: > get_failcount_full: event_agent has failed 1 times on 120_120__fd4 > Feb 06 09:52:19 [58630] node-4 pengine: notice: > check_operation_expiry: Clearing failure of event_agent on > 120_120__fd4 because it expired | event_agent_clear_failcount_0 This indicates that there is a failure-timeout for the event_agent resource, and the last failure happened more than that much time ago, so the failure will be ignored (other than displaying it in status). > Feb 06 09:52:19 [58630] node-4 pengine: notice: unpack_rsc_op: > Re-initiated expired calculated failure event_agent_monitor_60000 > (rc=1, magic=0:1;9:18:0:9d1d66d2-2cbe-4182-89f6-c90ba008e2b7) on > 120_120__fd4 > Feb 06 09:52:19 [58630] node-4 pengine: info: > get_failcount_full: event_agent has failed 1 times on 120_120__fd4 > Feb 06 09:52:19 [58630] node-4 pengine: notice: > check_operation_expiry: Clearing failure of event_agent on > 120_120__fd4 because it expired | event_agent_clear_failcount_0 > Feb 06 09:52:19 [58630] node-4 pengine: info: > get_failcount_full: event_agent has failed 1 times on 120_120__fd4 > Feb 06 09:52:19 [58630] node-4 pengine: notice: > check_operation_expiry: Clearing failure of event_agent on > 120_120__fd4 because it expired | event_agent_clear_failcount_0 > Feb 06 09:52:19 [58630] node-4 pengine: info: > unpack_node_loop: Node 4052 is already processed > Feb 06 09:52:19 [58630] node-4 pengine: info: > unpack_node_loop: Node 4052 is already processed > Feb 06 09:52:19 [58630] node-4 pengine: info: common_print: > pm_agent (ocf::heartbeat:pm_agent): Started 120_120__fd4 > Feb 06 09:52:19 [58630] node-4 pengine: info: common_print: > event_agent (ocf::heartbeat:event_agent): Started 120_120__fd4 > Feb 06 09:52:19 [58630] node-4 pengine: info: common_print: > nwmonitor_vip (ocf::heartbeat:IPaddr2): Started 120_120__fd4 > Feb 06 09:52:19 [58630] node-4 pengine: info: common_print: > nwmonitor (ocf::heartbeat:nwmonitor): Started 120_120__fd4 > Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions: > Leave pm_agent (Started 120_120__fd4) > Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions: > Leave event_agent (Started 120_120__fd4) Because the last failure has expired, pacemaker does not need to recover event_agent. > Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions: > Leave nwmonitor_vip (Started 120_120__fd4) > Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions: > Leave nwmonitor (Started 120_120__fd4) > 3. the event_agent resource is marked fail by attrd, that triggered > pengine computing, but PE actually does't do anything about > event_agent later. is it related to check_operation_expiry function > in unpack.c ? I see some notes in this function as fllows: > /* clearing recurring monitor operation failures automatically > * needs to be carefully considered */ > if (safe_str_eq(crm_element_value(xml_op, XML_LRM_ATTR_TASK), > "monitor") && > safe_str_neq(crm_element_value(xml_op, > XML_LRM_ATTR_INTERVAL), "0")) { > /* TODO, in the future we should consider not clearing > recurring monitor > * op failures unless the last action for a resource was a > "stop" action. > * otherwise it is possible that clearing the monitor failure > will result > * in the resource being in an undeterministic state. Yes, this is relevant -- the event_agent monitor had previously failed, but the failure has expired due to failure-timeout. The comment here suggests that we may not want to expire monitor failures unless there has been a stop since then, but that would defeat the intent of failure-timeout, so it's not straightforward which is the better handling. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org