On Wed, 24 Jan 2018 17:42:56 -0600 Ken Gaillot <kgail...@redhat.com> wrote:
> On Fri, 2018-01-19 at 00:37 +0100, Jehan-Guillaume de Rorthais wrote: > > On Thu, 18 Jan 2018 10:54:33 -0600 > > Ken Gaillot <kgail...@redhat.com> wrote: > > > > > On Thu, 2018-01-18 at 16:15 +0100, Jehan-Guillaume de Rorthais > > > wrote: > > > > Hi list, > > > > > > > > I was explaining how to use crm_simulate to a colleague when he > > > > pointed to me a > > > > non expected and buggy output. > > > > > > > > Here are some simple steps to reproduce: > > > > > > > > $ pcs cluster setup --name usecase srv1 srv2 srv3 > > > > $ pcs cluster start --all > > > > $ pcs property set stonith-enabled=false > > > > $ pcs resource create dummy1 ocf:heartbeat:Dummy \ > > > > state=/tmp/dummy1.state \ > > > > op monitor interval=10s \ > > > > meta migration-threshold=3 resource-stickiness=1 > > > > > > > > Now, we are injecting 2 monitor soft errors, triggering 2 local > > > > recovery > > > > (stop/start): > > > > > > > > $ crm_simulate -S -L -i dummy1_monitor_10@srv1=1 -O > > > > /tmp/step1.xml > > > > $ crm_simulate -S -x /tmp/step1.xml -i dummy1_monitor_10@srv1=1 > > > > -O /tmp/step2.xml > > > > > > > > > > > > So far so good. A third soft error on monitor push dummy1 out of > > > > srv1, this > > > > was expected. However, the final status of the cluster shows > > > > dummy1 > > > > as > > > > started on both srv1 and srv2! > > > > > > > > $ crm_simulate -S -x /tmp/step2.xml -i dummy1_monitor_10@srv1=1 > > > > -O /tmp/step3.xml > > > > > > > > Current cluster status: > > > > Online: [ srv1 srv2 srv3 ] > > > > > > > > dummy1 (ocf::heartbeat:Dummy): Started srv1 > > > > > > > > Performing requested modifications > > > > + Injecting dummy1_monitor_10@srv1=1 into the configuration > > > > + Injecting attribute fail-count-dummy1=value++ into > > > > /node_state > > > > '1' > > > > + Injecting attribute last-failure-dummy1=1516287891 into > > > > /node_state '1' > > > > > > > > Transition Summary: > > > > * Recover dummy1 ( srv1 -> srv2 ) > > > > > > > > Executing cluster transition: > > > > * Cluster action: clear_failcount for dummy1 on srv1 > > > > * Resource action: dummy1 stop on srv1 > > > > * Resource action: dummy1 cancel=10 on srv1 > > > > * Pseudo action: all_stopped > > > > * Resource action: dummy1 start on srv2 > > > > * Resource action: dummy1 monitor=10000 on srv2 > > > > > > > > Revised cluster status: > > > > Online: [ srv1 srv2 srv3 ] > > > > > > > > dummy1 (ocf::heartbeat:Dummy): Started[ srv1 > > > > srv2 ] > > > > > > > > I suppose this is a bug from crm_simulate? Why is it considering > > > > dummy1 is > > > > started on srv1 when the transition execution stopped it on > > > > srv1? > > > > > > It's definitely a bug, either in crm_simulate or the policy engine > > > itself. Can you attach step2.xml? > > > > Sure, please, find in attachment step2.xml. > > I can reproduce the issue with 1.1.16 but not 1.1.17 or later, so > whatever it was, it got fixed. Interesting. I did some quick search and was supposing the bug came from somewhere around "fake_transition.c". I hadn't time to dig real far though. Moreover, I am studying the code from master, not for 1.1.16, what a waste of time if it's fixed on the master branch :) Too bad most stable distro are still using 1.1.16 (Debian, CentOS, Ubuntu). I have another question related to my tests in regard with this subject. Keep in mind this is still with 1.1.16, but it might not be related to the version. While testing on a live cluster, without crm_simulate, to see what really happen there, I found in the log file that the pengine was producing 2 distincts transitions. Find the log file in attachment. As far as I understand it, I suppose that: * the crmd on srv3/dc receive the failure result from srv1 for monitor on dummy1 * crmd/srv3 forward the failcount++ to the attrd * crmd/srv3 invoke the pengine, producing the transition 5 * attrd set the failcount for dummy1/srv1 to 3 (maximum allowed, dummy1 must move) * because of the failcount update, crmd/srv3 invoke a second time the pengine which produce transition 6 * crmd set transition 5 as obsolete Isn't it possible to wait for the attrd result before invoking pengine to provide it with an up to date cib as input? If I understand correctly, the attrd call is asynchronous, however, when the crmd/DC invoke it, I suppose such calls should be synchrone, isn't it? Shouldn't it be a little bit more simple, fast, less error prone, with fewer and easier to understand log entries? This last issue is quite important IMO as reading Pacemaker logfile is still quite a pain for many people... Thanks,
Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: Diff: --- 0.7.24 2 Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: Diff: +++ 0.7.25 (null) Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: + /cib: @num_updates=25 Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='dummy1']/lrm_rsc_op[@id='dummy1_last_failure_0']: @transition-key=1:4:0:dec40799-30c9-4aca-ab8d-0d7930ef4af0, @transition-magic=0:7;1:4:0:dec40799-30c9-4aca-ab8d-0d7930ef4af0, @call-id=15, @last-rc-change=1516651764 Jan 22 21:09:24 [6862] srv3 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=srv1/crmd/24, version=0.7.25) Jan 22 21:09:24 [6863] srv3 stonith-ng: debug: xml_patch_version_check: Can apply patch 0.7.25 to 0.7.24 Jan 22 21:09:24 [6867] srv3 crmd: debug: te_update_diff: Processing (cib_modify) diff: 0.7.24 -> 0.7.25 (S_IDLE) Jan 22 21:09:24 [6867] srv3 crmd: info: abort_transition_graph: Transition aborted by operation dummy1_monitor_10000 'modify' on srv1: Inactive graph | magic=0:7;1:4:0:dec40799-30c9-4aca-ab8d-0d7930ef4af0 cib=0.7.25 source=process_graph_event:503 complete=true Jan 22 21:09:24 [6867] srv3 crmd: info: update_failcount: Updating failcount for dummy1 on srv1 after failed monitor: rc=7 (update=value++, time=1516651764) Jan 22 21:09:24 [6867] srv3 crmd: debug: attrd_update_delegate: Asked attrd to update fail-count-dummy1=value++ for srv1: OK (0) Jan 22 21:09:24 [6867] srv3 crmd: debug: attrd_update_delegate: Asked attrd to update last-failure-dummy1=1516651764 for srv1: OK (0) Jan 22 21:09:24 [6867] srv3 crmd: info: process_graph_event: Detected action (4.1) dummy1_monitor_10000.15=not running: failed Jan 22 21:09:24 [6867] srv3 crmd: debug: s_crmd_fsa: Processing I_PE_CALC: [ state=S_IDLE cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jan 22 21:09:24 [6867] srv3 crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Jan 22 21:09:24 [6867] srv3 crmd: debug: do_state_transition: All 3 cluster nodes are eligible to run resources. Jan 22 21:09:24 [6867] srv3 crmd: debug: do_pe_invoke: Query 47: Requesting the current CIB: S_POLICY_ENGINE Jan 22 21:09:24 [6865] srv3 attrd: info: attrd_client_update: Expanded fail-count-dummy1=value++ to 3 Jan 22 21:09:24 [6865] srv3 attrd: debug: attrd_client_update: Broadcasting fail-count-dummy1[srv1] = 3 (writer) Jan 22 21:09:24 [6865] srv3 attrd: debug: attrd_client_update: Broadcasting last-failure-dummy1[srv1] = 1516651764 (writer) Jan 22 21:09:24 [6865] srv3 attrd: info: attrd_peer_update: Setting fail-count-dummy1[srv1]: 2 -> 3 from srv3 Jan 22 21:09:24 [6865] srv3 attrd: debug: write_attribute: Update: srv1[fail-count-dummy1]=3 (1 1 1 srv1) Jan 22 21:09:24 [6865] srv3 attrd: info: write_attribute: Sent update 8 with 1 changes for fail-count-dummy1, id=<n/a>, set=(null) Jan 22 21:09:24 [6865] srv3 attrd: info: attrd_peer_update: Setting last-failure-dummy1[srv1]: 1516651593 -> 1516651764 from srv3 Jan 22 21:09:24 [6865] srv3 attrd: debug: write_attribute: Update: srv1[last-failure-dummy1]=1516651764 (1 1 1 srv1) Jan 22 21:09:24 [6865] srv3 attrd: info: write_attribute: Sent update 9 with 1 changes for last-failure-dummy1, id=<n/a>, set=(null) Jan 22 21:09:24 [6862] srv3 cib: info: cib_process_request: Forwarding cib_modify operation for section status to all (origin=local/attrd/8) Jan 22 21:09:24 [6862] srv3 cib: info: cib_process_request: Forwarding cib_modify operation for section status to all (origin=local/attrd/9) Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: Diff: --- 0.7.25 2 Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: Diff: +++ 0.7.26 (null) Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: + /cib: @num_updates=26 Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: + /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-dummy1']: @value=3 Jan 22 21:09:24 [6863] srv3 stonith-ng: debug: xml_patch_version_check: Can apply patch 0.7.26 to 0.7.25 Jan 22 21:09:24 [6862] srv3 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=srv3/attrd/8, version=0.7.26) Jan 22 21:09:24 [6865] srv3 attrd: info: attrd_cib_callback: Update 8 for fail-count-dummy1: OK (0) Jan 22 21:09:24 [6865] srv3 attrd: info: attrd_cib_callback: Update 8 for fail-count-dummy1[srv1]=3: OK (0) Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: Diff: --- 0.7.26 2 Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: Diff: +++ 0.7.27 (null) Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: + /cib: @num_updates=27 Jan 22 21:09:24 [6862] srv3 cib: info: cib_perform_op: + /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-dummy1']: @value=1516651764 Jan 22 21:09:24 [6862] srv3 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=srv3/attrd/9, version=0.7.27) Jan 22 21:09:24 [6863] srv3 stonith-ng: debug: xml_patch_version_check: Can apply patch 0.7.27 to 0.7.26 Jan 22 21:09:24 [6865] srv3 attrd: info: attrd_cib_callback: Update 9 for last-failure-dummy1: OK (0) Jan 22 21:09:24 [6865] srv3 attrd: info: attrd_cib_callback: Update 9 for last-failure-dummy1[srv1]=1516651764: OK (0) Jan 22 21:09:24 [6867] srv3 crmd: debug: do_pe_invoke_callback: Invoking the PE: query=47, ref=pe_calc-dc-1516651764-34, seq=740, quorate=1 Jan 22 21:09:24 [6867] srv3 crmd: debug: te_update_diff: Processing (cib_modify) diff: 0.7.25 -> 0.7.26 (S_POLICY_ENGINE) Jan 22 21:09:24 [6867] srv3 crmd: info: abort_transition_graph: Transition aborted by status-1-fail-count-dummy1 doing modify fail-count-dummy1=3: Transient attribute change | cib=0.7.26 source=abort_unless_down:343 path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-dummy1'] complete=true Jan 22 21:09:24 [6867] srv3 crmd: debug: te_update_diff: Processing (cib_modify) diff: 0.7.26 -> 0.7.27 (S_POLICY_ENGINE) Jan 22 21:09:24 [6867] srv3 crmd: info: abort_transition_graph: Transition aborted by status-1-last-failure-dummy1 doing modify last-failure-dummy1=1516651764: Transient attribute change | cib=0.7.27 source=abort_unless_down:343 path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-dummy1'] complete=true Jan 22 21:09:24 [6867] srv3 crmd: debug: s_crmd_fsa: Processing I_PE_CALC: [ state=S_POLICY_ENGINE cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jan 22 21:09:24 [6867] srv3 crmd: debug: do_pe_invoke: Query 48: Requesting the current CIB: S_POLICY_ENGINE Jan 22 21:09:24 [6867] srv3 crmd: debug: s_crmd_fsa: Processing I_PE_CALC: [ state=S_POLICY_ENGINE cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jan 22 21:09:24 [6867] srv3 crmd: debug: do_pe_invoke: Query 49: Requesting the current CIB: S_POLICY_ENGINE Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: STONITH timeout: 60000 Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: STONITH of failed nodes is disabled Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Concurrent fencing is disabled Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Stop all active resources: false Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Cluster is symmetric - resources can run anywhere by default Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Default stickiness: 0 Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: On loss of CCM Quorum: Stop ALL resources Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Jan 22 21:09:24 [6866] srv3 pengine: info: determine_online_status: Node srv2 is online Jan 22 21:09:24 [6866] srv3 pengine: info: determine_online_status: Node srv3 is online Jan 22 21:09:24 [6866] srv3 pengine: info: determine_online_status: Node srv1 is online Jan 22 21:09:24 [6866] srv3 pengine: debug: determine_op_status: dummy1_monitor_10000 on srv1 returned 'not running' (7) instead of the expected value: 'ok' (0) Jan 22 21:09:24 [6866] srv3 pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for dummy1 on srv1: not running (7) Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 2 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 3 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 1 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 2 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 3 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 1 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: common_print: dummy1 (ocf::heartbeat:Dummy): FAILED srv1 Jan 22 21:09:24 [6866] srv3 pengine: debug: common_apply_stickiness: Resource dummy1: preferring current location (node=srv1, weight=1) Jan 22 21:09:24 [6866] srv3 pengine: info: get_failcount_full: dummy1 has failed 2 times on srv1 Jan 22 21:09:24 [6866] srv3 pengine: info: check_migration_threshold: dummy1 can fail 1 more times on srv1 before being forced off Jan 22 21:09:24 [6866] srv3 pengine: debug: native_assign_node: Assigning srv1 to dummy1 Jan 22 21:09:24 [6866] srv3 pengine: info: RecurringOp: Start recurring monitor (10s) for dummy1 on srv1 Jan 22 21:09:24 [6866] srv3 pengine: notice: LogAction: * Recover dummy1 ( srv1 ) Jan 22 21:09:24 [6866] srv3 pengine: notice: process_pe_message: Calculated transition 5, saving inputs in /var/lib/pacemaker/pengine/pe-input-481.bz2 Jan 22 21:09:24 [6867] srv3 crmd: debug: do_pe_invoke_callback: Invoking the PE: query=49, ref=pe_calc-dc-1516651764-35, seq=740, quorate=1 Jan 22 21:09:24 [6867] srv3 crmd: info: handle_response: pe_calc calculation pe_calc-dc-1516651764-34 is obsolete Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: STONITH timeout: 60000 Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: STONITH of failed nodes is disabled Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Concurrent fencing is disabled Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Stop all active resources: false Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Cluster is symmetric - resources can run anywhere by default Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Default stickiness: 0 Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: On loss of CCM Quorum: Stop ALL resources Jan 22 21:09:24 [6866] srv3 pengine: debug: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Jan 22 21:09:24 [6866] srv3 pengine: info: determine_online_status: Node srv2 is online Jan 22 21:09:24 [6866] srv3 pengine: info: determine_online_status: Node srv3 is online Jan 22 21:09:24 [6866] srv3 pengine: info: determine_online_status: Node srv1 is online Jan 22 21:09:24 [6866] srv3 pengine: debug: determine_op_status: dummy1_monitor_10000 on srv1 returned 'not running' (7) instead of the expected value: 'ok' (0) Jan 22 21:09:24 [6866] srv3 pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for dummy1 on srv1: not running (7) Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 2 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 3 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 1 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 2 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 3 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: unpack_node_loop: Node 1 is already processed Jan 22 21:09:24 [6866] srv3 pengine: info: common_print: dummy1 (ocf::heartbeat:Dummy): FAILED srv1 Jan 22 21:09:24 [6866] srv3 pengine: debug: common_apply_stickiness: Resource dummy1: preferring current location (node=srv1, weight=1) Jan 22 21:09:24 [6866] srv3 pengine: info: get_failcount_full: dummy1 has failed 3 times on srv1 Jan 22 21:09:24 [6866] srv3 pengine: warning: check_migration_threshold: Forcing dummy1 away from srv1 after 3 failures (max=3) Jan 22 21:09:24 [6866] srv3 pengine: debug: native_assign_node: Assigning srv2 to dummy1 Jan 22 21:09:24 [6866] srv3 pengine: info: RecurringOp: Start recurring monitor (10s) for dummy1 on srv2 Jan 22 21:09:24 [6866] srv3 pengine: notice: LogAction: * Recover dummy1 ( srv1 -> srv2 ) Jan 22 21:09:24 [6866] srv3 pengine: notice: process_pe_message: Calculated transition 6, saving inputs in /var/lib/pacemaker/pengine/pe-input-482.bz2 Jan 22 21:09:24 [6867] srv3 crmd: debug: s_crmd_fsa: Processing I_PE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_IPC_MESSAGE origin=handle_response ] Jan 22 21:09:24 [6867] srv3 crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response Jan 22 21:09:24 [6867] srv3 crmd: debug: unpack_graph: Unpacked transition 6: 4 actions in 4 synapses Jan 22 21:09:24 [6867] srv3 crmd: info: do_te_invoke: Processing graph 6 (ref=pe_calc-dc-1516651764-35) derived from /var/lib/pacemaker/pengine/pe-input-482.bz2 ...
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org