On 02/22/2016 05:23 PM, Jeremy Matthews wrote: > Thanks for the quick response again, and pardon for the delay in responding. > A colleague of mine and I have been trying some different things today. > > But from the reboot on Friday, further below are the logs from corosync.log > from the time of the reboot command to the constraint being added. > > I am not able to perform a "pcs cluster cib-upgrade". The version of pcs that > I have does not have that option (just cib [filename] and cib-push > <filename>). My versions at the time of these logs were:
I'm curious whether you were able to solve your issue. Regarding cib-upgrade, you can use the "cibadmin --upgrade" command instead, which is what pcs does behind the scenes. For a better-safe-than-sorry how-to, see: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_upgrading_the_configuration > [root@g5se-f3efce Packages]# pcs --version > 0.9.90 > [root@g5se-f3efce Packages]# pacemakerd --version > Pacemaker 1.1.11 > Written by Andrew Beekhof > > I think you're right in that we had a script banning the ClusterIP. It is > called from a message daemon that we created that acts as middleware between > the cluster software and our application. In this daemon, it has an exit > handler that calls a script which runs: > > pcs resource ban ClusterIP $host # where $host is the > result of "host =`hostname` > > ...cause we normally try to push the cluster IP to the other side (though in > this case, we just have one node), but then right after that the script calls: > > pcs resource clear ClusterIP > > > ...but for some reason, it doesn't seem to result in the constraint being > removed (see even FURTHER below where I show a /var/log/message log snippet > with both the constraint addition and removal; this was using an earlier > version of pacemaker, Pacemaker 1.1.10-1.el6_4.4). I guess with the earlier > pcs or pacemaker version, these logs went to messages rather than > corosync.log today. > > I am in a bit of a conundrum in that if I upgrade pcs to the 0.9.149 > (retrieved and "make install" 'ed from github.com because 0.9.139 had a pcs > issue with one node clusters) which has the cib-upgrade option), then if I > manually remove the ClusterIP constraint this causes a problem for our > message daemon in that it thinks neither side in the cluster is active; > something to look at on our end. So it seems the removal of the constraint > affects our daemon in the new pcs. For the time being, I've rolled back pcs > to the above 0.9.90 version. > > One other thing to mention is that the timing of pacemaker's start may have > been delayed by what I found out was a change to its initialization header > (by either our daemon or application installation script) from 90 1 to 70 20. > So in /etc/rc3.d, there is S70pacemaker rather than S90pacemaker. I am not a > Linux expert by any means. I guess that may affect start up, but I'm not sure > about shutdown. > > Corosync logs from the time reboot was issued to the constraint being added: > > Feb 19 15:22:22 [1997] g5se-f3efce attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: standby (true) > Feb 19 15:22:22 [1997] g5se-f3efce attrd: notice: > attrd_perform_update: Sent update 24: standby=true > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: > Forwarding cib_modify operation for section status to master > (origin=local/attrd/24) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: --- 0.291.2 2 > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: +++ 0.291.3 (null) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > + /cib: @num_updates=3 > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > ++ > /cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce']: > <nvpair id="status-g5se-f3efce-standby" name="standby" value="true"/> > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: > abort_transition_graph: Transition aborted by > status-g5se-f3efce-standby, standby=true: Transient attribute change (create > cib=0.291.3, source=te_update_diff:391, > path=/cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce'], > 1) > Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: do_state_transition: > State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=abort_transition_graph ] > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=g5se-f3efce/attrd/24, version=0.291.3) > Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: update_validation: > pacemaker-1.2-style configuration is also valid for pacemaker-1.3 > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: update_validation: > Transformation upgrade-1.3.xsl successful > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: update_validation: > Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: cli_config_update: > Your configuration was internally updated to the latest version > (pacemaker-2.0) > Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: unpack_config: > On loss of CCM Quorum: Ignore > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: unpack_status: > Node g5se-f3efce is in standby-mode > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: > determine_online_status: Node g5se-f3efce is standby > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_print: > sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_print: > meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_print: > netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_print: > ClusterIP (ocf::heartbeat:IPaddr2): Started g5se-f3efce > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_color: > Resource sw-ready-g5se-f3efce cannot run anywhere > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_color: > Resource meta-data cannot run anywhere > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_color: > Resource netmon cannot run anywhere > Feb 19 15:22:22 [1998] g5se-f3efce pengine: info: native_color: > Resource ClusterIP cannot run anywhere > Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: LogActions: Stop > sw-ready-g5se-f3efce (g5se-f3efce) > Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: LogActions: Stop > meta-data (g5se-f3efce) > Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: LogActions: Stop > netmon (g5se-f3efce) > Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: LogActions: Stop > ClusterIP (g5se-f3efce) > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_state_transition: > State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ > input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_te_invoke: > Processing graph 8 (ref=pe_calc-dc-1455920542-41) derived from > /var/lib/pacemaker/pengine/pe-input-641.bz2 > Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: te_rsc_command: > Initiating action 8: stop sw-ready-g5se-f3efce_stop_0 on g5se-f3efce (local) > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: > cancel_recurring_action: Cancelling operation > sw-ready-g5se-f3efce_monitor_10000 > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_lrm_rsc_op: > Performing key=8:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed > op=sw-ready-g5se-f3efce_stop_0 > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_execute: > executing - rsc:sw-ready-g5se-f3efce action:stop call_id:31 > Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: te_rsc_command: > Initiating action 9: stop meta-data_stop_0 on g5se-f3efce (local) > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: > cancel_recurring_action: Cancelling operation meta-data_monitor_60000 > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_lrm_rsc_op: > Performing key=9:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed > op=meta-data_stop_0 > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_execute: > executing - rsc:meta-data action:stop call_id:33 > Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: te_rsc_command: > Initiating action 10: stop netmon_stop_0 on g5se-f3efce (local) > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: > cancel_recurring_action: Cancelling operation netmon_monitor_10000 > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_lrm_rsc_op: > Performing key=10:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed op=netmon_stop_0 > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_execute: > executing - rsc:netmon action:stop call_id:35 > Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: te_rsc_command: > Initiating action 11: stop ClusterIP_stop_0 on g5se-f3efce (local) > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: > cancel_recurring_action: Cancelling operation ClusterIP_monitor_30000 > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: do_lrm_rsc_op: > Performing key=11:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed > op=ClusterIP_stop_0 > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_execute: > executing - rsc:ClusterIP action:stop call_id:37 > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: process_lrm_event: > Operation sw-ready-g5se-f3efce_monitor_10000: Cancelled (node=g5se-f3efce, > call=29, confirmed=true) > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: process_lrm_event: > Operation meta-data_monitor_60000: Cancelled (node=g5se-f3efce, call=21, > confirmed=true) > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: process_lrm_event: > Operation netmon_monitor_10000: Cancelled (node=g5se-f3efce, call=23, > confirmed=true) > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: process_lrm_event: > Operation ClusterIP_monitor_30000: Cancelled (node=g5se-f3efce, call=25, > confirmed=true) > Feb 19 15:22:22 [1998] g5se-f3efce pengine: notice: process_pe_message: > Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-641.bz2 > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_finished: > finished - rsc:sw-ready-g5se-f3efce action:stop call_id:31 pid:6013 > exit-code:0 exec-time:56ms queue-time:0ms > Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: process_lrm_event: > Operation sw-ready-g5se-f3efce_stop_0: ok (node=g5se-f3efce, call=31, rc=0, > cib-update=72, confirmed=true) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: > Forwarding cib_modify operation for section status to master > (origin=local/crmd/72) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: --- 0.291.3 2 > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: +++ 0.291.4 (null) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > + /cib: @num_updates=4 > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > + > /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='sw-ready-g5se-f3efce']/lrm_rsc_op[@id='sw-ready-g5se-f3efce_last_0']: > @operation_key=sw-ready-g5se-f3efce_stop_0, @operation=stop, > @transition-key=8:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, > @transition-magic=0:0;8:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, > @call-id=31, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=56 > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=g5se-f3efce/crmd/72, version=0.291.4) > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: match_graph_event: > Action sw-ready-g5se-f3efce_stop_0 (8) confirmed on g5se-f3efce (rc=0) > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_finished: > finished - rsc:meta-data action:stop call_id:33 pid:6014 exit-code:0 > exec-time:72ms queue-time:0ms > Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: process_lrm_event: > Operation meta-data_stop_0: ok (node=g5se-f3efce, call=33, rc=0, > cib-update=73, confirmed=true) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: > Forwarding cib_modify operation for section status to master > (origin=local/crmd/73) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: --- 0.291.4 2 > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: +++ 0.291.5 (null) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > + /cib: @num_updates=5 > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > + > /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='meta-data']/lrm_rsc_op[@id='meta-data_last_0']: > @operation_key=meta-data_stop_0, @operation=stop, > @crm-debug-origin=do_update_resource, > @transition-key=9:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, > @transition-magic=0:0;9:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, > @call-id=33, @last-run=1455920542, @last-rc-change=1455920542, @exec-time= > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=g5se-f3efce/crmd/73, version=0.291.5) > Feb 19 15:22:22 [1999] g5se-f3efce crmd: info: match_graph_event: > Action meta-data_stop_0 (9) confirmed on g5se-f3efce (rc=0) > Feb 19 15:22:22 [1997] g5se-f3efce attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: ethmonitor-eth0 > (<null>) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: > Forwarding cib_delete operation for section status to master > (origin=local/attrd/26) > Feb 19 15:22:22 [1997] g5se-f3efce attrd: notice: > attrd_perform_update: Sent delete 26: node=g5se-f3efce, > attr=ethmonitor-eth0, id=<n/a>, set=(null), section=status > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: --- 0.291.5 2 > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: +++ 0.291.6 (null) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > -- > /cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce']/nvpair[@id='status-g5se-f3efce-ethmonitor-eth0'] > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_perform_op: > + /cib: @num_updates=6 > Feb 19 15:22:22 [1996] g5se-f3efce lrmd: info: log_finished: > finished - rsc:netmon action:stop call_id:35 pid:6015 exit-code:0 > exec-time:99ms queue-time:0ms > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: > Completed cib_delete operation for section status: OK (rc=0, > origin=g5se-f3efce/attrd/26, version=0.291.6) > Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: > abort_transition_graph: Transition aborted by deletion of > nvpair[@id='status-g5se-f3efce-ethmonitor-eth0']: Transient attribute change > (cib=0.291.6, source=te_update_diff:391, > path=/cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce']/nvpair[@id='status-g5se-f3efce-ethmonitor-eth0'], > 0) > Feb 19 15:22:22 [1999] g5se-f3efce crmd: notice: process_lrm_event: > Operation netmon_stop_0: ok (node=g5se-f3efce, call=35, rc=0, > cib-update=74, confirmed=true) > Feb 19 15:22:22 [1994] g5se-f3efce cib: info: cib_process_request: > Forwarding cib_modify operation for section status to master > (origin=local/crmd/74) > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: --- 0.291.6 2 > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: +++ 0.291.7 (null) > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > + /cib: @num_updates=7 > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > + > /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='netmon']/lrm_rsc_op[@id='netmon_last_0']: > @operation_key=netmon_stop_0, @operation=stop, > @crm-debug-origin=do_update_resource, > @transition-key=10:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, > @transition-magic=0:0;10:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, > @call-id=35, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=99 > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=g5se-f3efce/crmd/74, version=0.291.7) > Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: match_graph_event: > Action netmon_stop_0 (10) confirmed on g5se-f3efce (rc=0) > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: > Forwarding cib_delete operation for section constraints to master > (origin=local/crm_resource/3) > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: > Completed cib_delete operation for section constraints: OK (rc=0, > origin=g5se-f3efce/crm_resource/3, version=0.291.7) > IPaddr2[6016]: 2016/02/19_15:22:23 INFO: IP status = ok, IP_CIP= > Feb 19 15:22:23 [1996] g5se-f3efce lrmd: info: log_finished: > finished - rsc:ClusterIP action:stop call_id:37 pid:6016 exit-code:0 > exec-time:127ms queue-time:0ms > Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: process_lrm_event: > Operation ClusterIP_stop_0: ok (node=g5se-f3efce, call=37, rc=0, > cib-update=75, confirmed=true) > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: > Forwarding cib_modify operation for section status to master > (origin=local/crmd/75) > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: --- 0.291.7 2 > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: +++ 0.291.8 (null) > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > + /cib: @num_updates=8 > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > + > /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='ClusterIP']/lrm_rsc_op[@id='ClusterIP_last_0']: > @operation_key=ClusterIP_stop_0, @operation=stop, > @transition-key=11:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, > @transition-magic=0:0;11:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, > @call-id=37, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=127 > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=g5se-f3efce/crmd/75, version=0.291.8) > Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: match_graph_event: > Action ClusterIP_stop_0 (11) confirmed on g5se-f3efce (rc=0) > Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: run_graph: > Transition 8 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-641.bz2): Stopped > Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_state_transition: > State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=notify_crmd ] > Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: update_validation: > pacemaker-1.2-style configuration is also valid for pacemaker-1.3 > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: > Transformation upgrade-1.3.xsl successful > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: > Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: cli_config_update: > Your configuration was internally updated to the latest version > (pacemaker-2.0) > Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: unpack_config: > On loss of CCM Quorum: Ignore > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: unpack_status: > Node g5se-f3efce is in standby-mode > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: > determine_online_status: Node g5se-f3efce is standby > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_print: > sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Stopped > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_print: > meta-data (ocf::pacemaker:GBmon): Stopped > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_print: > netmon (ocf::heartbeat:ethmonitor): Stopped > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_print: > ClusterIP (ocf::heartbeat:IPaddr2): Stopped > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_color: > Resource sw-ready-g5se-f3efce cannot run anywhere > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_color: > Resource meta-data cannot run anywhere > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_color: > Resource netmon cannot run anywhere > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: native_color: > Resource ClusterIP cannot run anywhere > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: LogActions: Leave > sw-ready-g5se-f3efce (Stopped) > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: LogActions: Leave > meta-data (Stopped) > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: LogActions: Leave > netmon (Stopped) > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: LogActions: Leave > ClusterIP (Stopped) > Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_state_transition: > State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ > input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] > Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_te_invoke: > Processing graph 9 (ref=pe_calc-dc-1455920543-46) derived from > /var/lib/pacemaker/pengine/pe-input-642.bz2 > Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: run_graph: > Transition 9 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-642.bz2): Complete > Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_log: FSA: > Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE > Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: do_state_transition: > State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: process_pe_message: > Calculated Transition 9: /var/lib/pacemaker/pengine/pe-input-642.bz2 > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: > Forwarding cib_modify operation for section constraints to master > (origin=local/crm_resource/3) > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: --- 0.291.8 2 > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > Diff: +++ 0.292.0 (null) > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > + /cib: @epoch=292, @num_updates=0 > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: > ++ /cib/configuration/constraints: <rsc_location > id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" role="Started" > node="g5se-f3efce" score="-INFINITY"/> > Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_process_request: > Completed cib_modify operation for section constraints: OK (rc=0, > origin=g5se-f3efce/crm_resource/3, version=0.292.0) > Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: > abort_transition_graph: Transition aborted by > rsc_location.cli-ban-ClusterIP-on-g5se-f3efce 'create': Non-status change > (cib=0.292.0, source=te_update_diff:383, path=/cib/configuration/constraints, > 1) > Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: do_state_transition: > State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=abort_transition_graph ] > Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: update_validation: > pacemaker-1.2-style configuration is also valid for pacemaker-1.3 > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: > Transformation upgrade-1.3.xsl successful > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: > Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: cli_config_update: > Your configuration was internally updated to the latest version > (pacemaker-2.0) > Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: unpack_config: > On loss of CCM Quorum: Ignore > Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: unpack_status: > Node g5se-f3efce is in standby-mode > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > /var/log/messages snippet showing at the bottom addition and removal of > constraint (this is with pcs 0.9.90 and pacemakerd 1.1.10-1.el6_4.4): > > Feb 21 23:10:38 g5se-dea2b1 azMD[1584]: Sending INIT message to partner. > Count 21 > Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty1) main process (1732) killed > by TERM signal > Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty2) main process (1734) killed > by TERM signal > Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty3) main process (1736) killed > by TERM signal > Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty4) main process (1738) killed > by TERM signal > Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty5) main process (1740) killed > by TERM signal > Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty6) main process (1742) killed > by TERM signal > Feb 21 23:10:41 g5se-dea2b1 avahi-daemon[1473]: Got SIGTERM, quitting. > Feb 21 23:10:41 g5se-dea2b1 avahi-daemon[1473]: Leaving mDNS multicast group > on interface eth0.IPv4 with address 172.20.240.124. > Feb 21 23:10:42 g5se-dea2b1 azMD[1584]: [azIntTrmHandler] Int Trm handler 15 > Feb 21 23:10:42 g5se-dea2b1 azMD[1584]: [azExitHandler] exit handler > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.66.3 > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.67.1 > 44c794d4381e36ea4f5d51d0dd7fde1d > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: -- <cib > admin_epoch="0" epoch="66" num_updates="3"/> > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ > <nvpair id="sw-ready-g5se-dea2b1-meta_attributes-target-role" > name="target-role" value="Stopped"/> > Feb 21 23:10:42 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On > loss of CCM Quorum: Ignore > Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss > of CCM Quorum: Ignore > Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: LogActions: Stop > sw-ready-g5se-dea2b1#011(g5se-dea2b1) > Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: process_pe_message: > Calculated Transition 32: /var/lib/pacemaker/pengine/pe-input-134.bz2 > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating > action 10: stop sw-ready-g5se-dea2b1_stop_0 on g5se-dea2b1 (local) > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM > operation sw-ready-g5se-dea2b1_stop_0 (call=48, rc=0, cib-update=67, > confirmed=true) ok > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: run_graph: Transition 32 > (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-134.bz2): Complete > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Feb 21 23:10:42 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On > loss of CCM Quorum: Ignore > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.69.3 > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.70.1 > 216351853e036a12a96b442b30522287 > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: -- <cib > admin_epoch="0" epoch="69" num_updates="3"/> > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ > <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" > role="Started" node="g5se-dea2b1" score="-INFINITY"/> > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss > of CCM Quorum: Ignore > Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: LogActions: Stop > ClusterIP#011(g5se-dea2b1) > Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: process_pe_message: > Calculated Transition 35: /var/lib/pacemaker/pengine/pe-input-137.bz2 > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating > action 7: stop ClusterIP_stop_0 on g5se-dea2b1 (local) > Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13237]: INFO: IP status = ok, IP_CIP= > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM > operation ClusterIP_stop_0 (call=64, rc=0, cib-update=74, confirmed=true) ok > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: run_graph: Transition 35 > (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-137.bz2): Complete > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Feb 21 23:10:42 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On > loss of CCM Quorum: Ignore > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.70.2 > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.71.1 > 453ef48657244dc188b444348eb547ed > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: -- > <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" > role="Started" node="g5se-dea2b1" score="-INFINITY"/> > Feb 21 23:10:42 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ <cib epoch="71" > num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" > cib-last-written="Sun Feb 21 23:10:42 2016" update-origin="g5se-dea2b1" > update-client="crm_resource" crm_feature_set="3.0.7" have-quorum="1" > dc-uuid="g5se-dea2b1"/> > Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss > of CCM Quorum: Ignore > Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: LogActions: Start > ClusterIP#011(g5se-dea2b1) > Feb 21 23:10:42 g5se-dea2b1 pengine[1561]: notice: process_pe_message: > Calculated Transition 36: /var/lib/pacemaker/pengine/pe-input-138.bz2 > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating > action 6: start ClusterIP_start_0 on g5se-dea2b1 (local) > Feb 21 23:10:42 g5se-dea2b1 azMD[1584]: [azExitHandler] exit handler > Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13282]: INFO: ip -f inet addr add > 172.20.240.123/24 brd 172.20.240.255 dev eth0 > Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13282]: INFO: ip link set eth0 up > Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13282]: INFO: > /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p > /var/run/heartbeat/rsctmp/send_arp-172.20.240.123 eth0 172.20.240.123 auto > not_used not_used > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM > operation ClusterIP_start_0 (call=68, rc=0, cib-update=76, confirmed=true) ok > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating > action 7: monitor ClusterIP_monitor_30000 on g5se-dea2b1 (local) > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM > operation ClusterIP_monitor_30000 (call=71, rc=0, cib-update=77, > confirmed=false) ok > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: run_graph: Transition 36 > (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-138.bz2): Complete > Feb 21 23:10:42 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Feb 21 23:10:43 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On > loss of CCM Quorum: Ignore > Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.71.3 > Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.72.1 > 4e5a3b6259a59f84bcfec6d0f16ad3ba > Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: -- <cib > admin_epoch="0" epoch="71" num_updates="3"/> > Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ > <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" > role="Started" node="g5se-dea2b1" score="-INFINITY"/> > Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss > of CCM Quorum: Ignore > Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: LogActions: Stop > ClusterIP#011(g5se-dea2b1) > Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: process_pe_message: > Calculated Transition 37: /var/lib/pacemaker/pengine/pe-input-139.bz2 > Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating > action 7: stop ClusterIP_stop_0 on g5se-dea2b1 (local) > Feb 21 23:10:43 g5se-dea2b1 IPaddr2[13372]: INFO: IP status = ok, IP_CIP= > Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: process_lrm_event: LRM > operation ClusterIP_stop_0 (call=75, rc=0, cib-update=79, confirmed=true) ok > Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: run_graph: Transition 37 > (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-139.bz2): Complete > Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Feb 21 23:10:43 g5se-dea2b1 stonith-ng[1558]: notice: unpack_config: On > loss of CCM Quorum: Ignore > Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: --- 0.72.2 > Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: Diff: +++ 0.73.1 > 93f902fd51a6750b828144d42f8c7a6e > Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: -- > <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" > role="Started" node="g5se-dea2b1" score="-INFINITY"/> > Feb 21 23:10:43 g5se-dea2b1 cib[1557]: notice: cib:diff: ++ <cib epoch="73" > num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" > cib-last-written="Sun Feb 21 23:10:43 2016" update-origin="g5se-dea2b1" > update-client="crm_resource" crm_feature_set="3.0.7" have-quorum="1" > dc-uuid="g5se-dea2b1"/> > Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: unpack_config: On loss > of CCM Quorum: Ignore > Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: LogActions: Start > ClusterIP#011(g5se-dea2b1) > Feb 21 23:10:43 g5se-dea2b1 pengine[1561]: notice: process_pe_message: > Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-140.bz2 > Feb 21 23:10:43 g5se-dea2b1 crmd[1562]: notice: te_rsc_command: Initiating > action 6: start ClusterIP_start_0 on g5se-dea2b1 (local) > > > > -----Original Message----- > From: users-requ...@clusterlabs.org [mailto:users-requ...@clusterlabs.org] > Sent: Monday, February 22, 2016 11:42 AM > To: users@clusterlabs.org > Subject: Users Digest, Vol 13, Issue 44 > > Send Users mailing list submissions to > users@clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://clusterlabs.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-requ...@clusterlabs.org > > You can reach the person managing the list at > users-ow...@clusterlabs.org > > When replying, please edit your Subject line so it is more specific than "Re: > Contents of Users digest..." > > > Today's Topics: > > 1. Re: fencing by node name or by node ID (Ken Gaillot) > 2. Re: ClusterIP location constraint reappears after reboot > (Ken Gaillot) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 22 Feb 2016 11:10:57 -0600 > From: Ken Gaillot <kgail...@redhat.com> > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] fencing by node name or by node ID > Message-ID: <56cb4121.7000...@redhat.com> > Content-Type: text/plain; charset=windows-1252 > > On 02/21/2016 06:19 PM, Ferenc W?gner wrote: >> Hi, >> >> Last night a node in our cluster (Corosync 2.3.5, Pacemaker 1.1.14) >> experienced some failure and fell out of the cluster: >> >> Feb 21 22:11:12 vhbl06 corosync[3603]: [TOTEM ] A new membership >> (10.0.6.9:612) was formed. Members left: 167773709 >> Feb 21 22:11:12 vhbl06 corosync[3603]: [TOTEM ] Failed to receive the >> leave message. failed: 167773709 >> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: crm_update_peer_proc: Node >> vhbl07[167773709] - state is now lost (was member) >> Feb 21 22:11:12 vhbl06 cib[8304]: notice: crm_update_peer_proc: Node >> vhbl07[167773709] - state is now lost (was member) >> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: Removing all vhbl07 attributes >> for attrd_peer_change_cb >> Feb 21 22:11:12 vhbl06 cib[8304]: notice: Removing vhbl07/167773709 from >> the membership list >> Feb 21 22:11:12 vhbl06 cib[8304]: notice: Purged 1 peers with id=167773709 >> and/or uname=vhbl07 from the membership cache >> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: Lost attribute writer vhbl07 >> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: Removing vhbl07/167773709 from >> the membership list >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: crm_update_peer_proc: >> Node vhbl07[167773709] - state is now lost (was member) >> Feb 21 22:11:12 vhbl06 attrd[8307]: notice: Purged 1 peers with >> id=167773709 and/or uname=vhbl07 from the membership cache >> Feb 21 22:11:12 vhbl06 crmd[8309]: notice: Our peer on the DC (vhbl07) is >> dead >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: Removing vhbl07/167773709 >> from the membership list >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: Purged 1 peers with >> id=167773709 and/or uname=vhbl07 from the membership cache >> Feb 21 22:11:12 vhbl06 crmd[8309]: notice: State transition S_NOT_DC -> >> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK >> origin=peer_update_callback ] >> Feb 21 22:11:12 vhbl06 corosync[3603]: [QUORUM] Members[4]: 167773705 >> 167773706 167773707 167773708 >> Feb 21 22:11:12 vhbl06 corosync[3603]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Feb 21 22:11:12 vhbl06 crmd[8309]: notice: crm_reap_unseen_nodes: Node >> vhbl07[167773709] - state is now lost (was member) >> Feb 21 22:11:12 vhbl06 pacemakerd[8261]: notice: crm_reap_unseen_nodes: >> Node vhbl07[167773709] - state is now lost (was member) >> Feb 21 22:11:12 vhbl06 kernel: [343490.563365] dlm: closing connection to >> node 167773709 >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl05 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl07 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl01 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl02 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl03 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: fencing-vhbl04 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:12 vhbl06 stonith-ng[8305]: notice: Operation reboot of >> 167773709 by <no-one> for stonith-api.20937@vhbl03.9c470723: No such device >> Feb 21 22:11:12 vhbl06 crmd[8309]: notice: Peer 167773709 was not >> terminated (reboot) by <anyone> for vhbl03: No such device >> (ref=9c470723-d318-4c7e-a705-ce9ee5c7ffe5) by client stonith-api.20937 >> Feb 21 22:11:12 vhbl06 dlm_controld[3641]: 343352 tell corosync to remove >> nodeid 167773705 from cluster >> Feb 21 22:11:15 vhbl06 corosync[3603]: [TOTEM ] A processor failed, >> forming new configuration. >> Feb 21 22:11:19 vhbl06 corosync[3603]: [TOTEM ] A new membership >> (10.0.6.10:616) was formed. Members left: 167773705 >> Feb 21 22:11:19 vhbl06 corosync[3603]: [TOTEM ] Failed to receive the >> leave message. failed: 167773705 >> >> However, no fencing agent reported ability to fence the failing node >> (vhbl07), because stonith-ng wasn't looking it up by name, but by >> numeric ID (at least that's what the logs suggest to me), and the >> pcmk_host_list attributes contained strings like vhbl07. >> >> 1. Was it dlm_controld who requested the fencing? >> >> I suspect it because of the "dlm: closing connection to node >> 167773709" kernel message right before the stonith-ng logs. And >> dlm_controld really hasn't got anything to use but the corosync node >> ID. > > Not based on this; dlm would print messages about fencing, with > "dlm_controld.*fence request". > > However it looks like these logs are not from the DC, which will say what > process requested the fencing. It may be DLM or something else. > Also, DLM on any node might initiate fencing, so it's worth looking at all > the nodes' logs around this time. > >> 2. Shouldn't some component translate between node IDs and node names? >> Is this a configuration error in our setup? Should I include both in >> pcmk_host_list? > > Yes, stonithd's create_remote_stonith_op() function will do the translation > if the st_opt_cs_nodeid call option is set in the request XML. If that fails, > you'll see a "Could not expand nodeid" warning in the log. That option is set > by the kick() stonith API used by DLM, so it should happen automatically. > > I'm not sure why it appears not to have worked here; logs from other nodes > might help. Do corosync and pacemaker know the same node names? > That would be necessary to get the node name from corosync. > > Have you tested fencing vhbl07 from the command line with stonith_admin to > make sure fencing is configured correctly? > >> 3. After the failed fence, why was 167773705 (vhbl03) removed from the >> cluster? Because it was chosen to execute the fencing operation, but >> failed? > > dlm_controld explicitly requested it. I'm not familiar enough with DLM to > know why. It doesn't sound like a good idea to me. > >> The logs continue like this: >> >> Feb 21 22:11:19 vhbl06 attrd[8307]: notice: crm_update_peer_proc: Node >> vhbl03[167773705] - state is now lost (was member) >> Feb 21 22:11:19 vhbl06 attrd[8307]: notice: Removing all vhbl03 attributes >> for attrd_peer_change_cb >> Feb 21 22:11:19 vhbl06 attrd[8307]: notice: Removing vhbl03/167773705 from >> the membership list >> Feb 21 22:11:19 vhbl06 attrd[8307]: notice: Purged 1 peers with >> id=167773705 and/or uname=vhbl03 from the membership cache >> Feb 21 22:11:19 vhbl06 corosync[3603]: [QUORUM] Members[3]: 167773706 >> 167773707 167773708 >> Feb 21 22:11:19 vhbl06 corosync[3603]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Feb 21 22:11:19 vhbl06 crmd[8309]: notice: crm_reap_unseen_nodes: Node >> vhbl03[167773705] - state is now lost (was member) >> Feb 21 22:11:19 vhbl06 crmd[8309]: notice: State transition S_ELECTION -> >> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED >> origin=election_timeout_popped ] >> Feb 21 22:11:19 vhbl06 pacemakerd[8261]: notice: crm_reap_unseen_nodes: >> Node vhbl03[167773705] - state is now lost (was member) >> Feb 21 22:11:19 vhbl06 cib[8304]: notice: crm_update_peer_proc: Node >> vhbl03[167773705] - state is now lost (was member) >> Feb 21 22:11:19 vhbl06 cib[8304]: notice: Removing vhbl03/167773705 from >> the membership list >> Feb 21 22:11:19 vhbl06 cib[8304]: notice: Purged 1 peers with id=167773705 >> and/or uname=vhbl03 from the membership cache >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: crm_update_peer_proc: >> Node vhbl03[167773705] - state is now lost (was member) >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: Removing vhbl03/167773705 >> from the membership list >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: Purged 1 peers with >> id=167773705 and/or uname=vhbl03 from the membership cache >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl05 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl07 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl01 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl02 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl03 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: fencing-vhbl04 can not >> fence (reboot) 167773709: static-list >> Feb 21 22:11:19 vhbl06 kernel: [343497.392381] dlm: closing connection >> to node 167773705 >> >> 4. Why can't I see any action above to fence 167773705 (vhbl03)? > > Only the DC and the node that executes the fence will have those logs. > The other nodes will just have the query results ("can/can not fence") and > the final stonith result. > >> Feb 21 22:11:19 vhbl06 crmd[8309]: warning: FSA: Input I_ELECTION_DC from >> do_election_check() received in state S_INTEGRATION >> Feb 21 22:11:19 vhbl06 stonith-ng[8305]: notice: Operation reboot of >> 167773709 by <no-one> for stonith-api.17462@vhbl04.0cd1625d: No such device >> Feb 21 22:11:19 vhbl06 crmd[8309]: notice: Peer 167773709 was not >> terminated (reboot) by <anyone> for vhbl04: No such device >> (ref=0cd1625d-a61e-4f94-930d-bb80a10b89da) by client stonith-api.17462 >> Feb 21 22:11:19 vhbl06 dlm_controld[3641]: 343359 tell corosync to remove >> nodeid 167773706 from cluster >> Feb 21 22:11:22 vhbl06 corosync[3603]: [TOTEM ] A processor failed, >> forming new configuration. >> Feb 21 22:11:26 vhbl06 corosync[3603]: [TOTEM ] A new membership >> (10.0.6.11:620) was formed. Members left: 167773706 >> Feb 21 22:11:26 vhbl06 corosync[3603]: [TOTEM ] Failed to receive the >> leave message. failed: 167773706 >> >> Looks like vhbl04 took over the job of fencing vhbl07 from vhbl03, and >> of course failed the exact same way. So it was expelled, too. >> >> Feb 21 22:11:26 vhbl06 attrd[8307]: notice: crm_update_peer_proc: Node >> vhbl04[167773706] - state is now lost (was member) >> Feb 21 22:11:26 vhbl06 cib[8304]: notice: crm_update_peer_proc: Node >> vhbl04[167773706] - state is now lost (was member) >> Feb 21 22:11:26 vhbl06 attrd[8307]: notice: Removing all vhbl04 attributes >> for attrd_peer_change_cb >> Feb 21 22:11:26 vhbl06 cib[8304]: notice: Removing vhbl04/167773706 from >> the membership list >> Feb 21 22:11:26 vhbl06 cib[8304]: notice: Purged 1 peers with id=167773706 >> and/or uname=vhbl04 from the membership cache >> Feb 21 22:11:26 vhbl06 attrd[8307]: notice: Removing vhbl04/167773706 from >> the membership list >> Feb 21 22:11:26 vhbl06 attrd[8307]: notice: Purged 1 peers with >> id=167773706 and/or uname=vhbl04 from the membership cache >> Feb 21 22:11:26 vhbl06 stonith-ng[8305]: notice: crm_update_peer_proc: >> Node vhbl04[167773706] - state is now lost (was member) >> Feb 21 22:11:26 vhbl06 stonith-ng[8305]: notice: Removing vhbl04/167773706 >> from the membership list >> Feb 21 22:11:26 vhbl06 crmd[8309]: warning: No match for shutdown action on >> 167773706 >> Feb 21 22:11:26 vhbl06 stonith-ng[8305]: notice: Purged 1 peers with >> id=167773706 and/or uname=vhbl04 from the membership cache >> Feb 21 22:11:26 vhbl06 crmd[8309]: notice: Stonith/shutdown of vhbl04 not >> matched >> Feb 21 22:11:26 vhbl06 corosync[3603]: [QUORUM] This node is within the >> non-primary component and will NOT provide any services. >> Feb 21 22:11:26 vhbl06 pacemakerd[8261]: notice: Membership 620: quorum >> lost (2) >> Feb 21 22:11:26 vhbl06 crmd[8309]: notice: Membership 620: quorum lost (2) >> Feb 21 22:11:26 vhbl06 corosync[3603]: [QUORUM] Members[2]: 167773707 >> 167773708 >> >> That, finally, was enough to lose quorum and paralize the cluster. >> Later, vhbl07 was rebooted by the hardware watchdog and came back for >> a cold welcome: >> >> Feb 21 22:24:53 vhbl06 corosync[3603]: [TOTEM ] A new membership >> (10.0.6.12:628) was formed. Members joined: 167773709 >> Feb 21 22:24:53 vhbl06 corosync[3603]: [QUORUM] Members[2]: 167773708 >> 167773709 >> Feb 21 22:24:53 vhbl06 corosync[3603]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Feb 21 22:24:53 vhbl06 crmd[8309]: notice: pcmk_quorum_notification: Node >> vhbl07[167773709] - state is now member (was lost) >> Feb 21 22:24:53 vhbl06 pacemakerd[8261]: notice: pcmk_quorum_notification: >> Node vhbl07[167773709] - state is now member (was lost) >> Feb 21 22:24:53 vhbl06 dlm_controld[3641]: 344173 daemon joined >> 167773709 needs fencing Feb 21 22:25:47 vhbl06 dlm_controld[3641]: 344226 >> clvmd wait for quorum >> Feb 21 22:29:26 vhbl06 crmd[8309]: notice: State transition S_IDLE -> >> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED >> origin=crm_timer_popped ] >> Feb 21 22:29:27 vhbl06 pengine[8308]: notice: We do not have quorum - >> fencing and resource management disabled >> Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Node vhbl04 is unclean >> because the node is no longer part of the cluster Feb 21 22:29:27 >> vhbl06 pengine[8308]: warning: Node vhbl04 is unclean Feb 21 22:29:27 >> vhbl06 pengine[8308]: warning: Node vhbl05 is unclean because the >> node is no longer part of the cluster Feb 21 22:29:27 vhbl06 >> pengine[8308]: warning: Node vhbl05 is unclean Feb 21 22:29:27 vhbl06 >> pengine[8308]: warning: Node vhbl07 is unclean because our peer >> process is no longer available Feb 21 22:29:27 vhbl06 pengine[8308]: >> warning: Node vhbl07 is unclean Feb 21 22:29:27 vhbl06 pengine[8308]: >> warning: Node vhbl03 is unclean because vm-niifdc is thought to be >> active there Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Action >> vm-dogwood_stop_0 on vhbl03 is unrunnable (offline) Feb 21 22:29:27 vhbl06 >> pengine[8308]: warning: Action vm-niifidp_stop_0 on vhbl03 is unrunnable >> (offline) [...] Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Node vhbl03 >> is unclean! >> Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Node vhbl04 is unclean! >> Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Node vhbl05 is unclean! >> Feb 21 22:29:27 vhbl06 pengine[8308]: notice: We can fence vhbl07 without >> quorum because they're in our membership >> Feb 21 22:29:27 vhbl06 pengine[8308]: warning: Scheduling Node vhbl07 for >> STONITH >> Feb 21 22:29:27 vhbl06 pengine[8308]: notice: Cannot fence unclean nodes >> until quorum is attained (or no-quorum-policy is set to ignore) >> [...] >> Feb 21 22:29:27 vhbl06 crmd[8309]: notice: Executing reboot fencing >> operation (212) on vhbl07 (timeout=60000) >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: Client crmd.8309.09cea2e7 >> wants to fence (reboot) 'vhbl07' with device '(any)' >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: Initiating remote >> operation reboot for vhbl07: 31b2023d-3fc5-419e-8490-91eb81254497 (0) >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl05 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl07 can fence >> (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl01 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl02 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl03 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl04 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl05 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl07 can fence >> (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl01 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl02 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl03 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 stonith-ng[8305]: notice: fencing-vhbl04 can not >> fence (reboot) vhbl07: static-list >> Feb 21 22:29:27 vhbl06 dlm_controld[3641]: 344447 daemon remove >> 167773709 already needs fencing Feb 21 22:29:27 vhbl06 dlm_controld[3641]: >> 344447 tell corosync to remove nodeid 167773709 from cluster >> Feb 21 22:29:30 vhbl06 corosync[3603]: [TOTEM ] A processor failed, >> forming new configuration. >> Feb 21 22:29:34 vhbl06 corosync[3603]: [TOTEM ] A new membership >> (10.0.6.12:632) was formed. Members left: 167773709 >> Feb 21 22:29:34 vhbl06 corosync[3603]: [TOTEM ] Failed to receive the >> leave message. failed: 167773709 >> Feb 21 22:29:34 vhbl06 corosync[3603]: [QUORUM] Members[1]: 167773708 >> Feb 21 22:29:34 vhbl06 corosync[3603]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Feb 21 22:29:34 vhbl06 pacemakerd[8261]: notice: crm_reap_unseen_nodes: >> Node vhbl07[167773709] - state is now lost (was member) >> Feb 21 22:29:34 vhbl06 crmd[8309]: notice: crm_reap_unseen_nodes: Node >> vhbl07[167773709] - state is now lost (was member) >> Feb 21 22:29:34 vhbl06 kernel: [344592.424938] dlm: closing connection to >> node 167773709 >> Feb 21 22:29:42 vhbl06 stonith-ng[8305]: notice: Operation 'reboot' [5533] >> (call 2 from crmd.8309) for host 'vhbl07' with device 'fencing-vhbl07' >> returned: 0 (OK) >> Feb 21 22:29:42 vhbl06 stonith-ng[8305]: notice: Operation reboot of >> vhbl07 by vhbl06 for crmd.8309@vhbl06.31b2023d: OK >> Feb 21 22:29:42 vhbl06 crmd[8309]: notice: Stonith operation >> 2/212:1:0:d06e9743-b452-4b6a-b3a9-d352a4454269: OK (0) >> Feb 21 22:29:42 vhbl06 crmd[8309]: notice: Peer vhbl07 was terminated >> (reboot) by vhbl06 for vhbl06: OK (ref=31b2023d-3fc5-419e-8490-91eb81254497) >> by client crmd.8309 >> >> That is, fencing by node name worked all right. >> >> I wonder if I understood the issue right and what would be the best >> way to avoid it in the future. Please advise. >> > > > > > ------------------------------ > > Message: 2 > Date: Mon, 22 Feb 2016 11:39:03 -0600 > From: Ken Gaillot <kgail...@redhat.com> > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] ClusterIP location constraint reappears > after reboot > Message-ID: <56cb47b7.3060...@redhat.com> > Content-Type: text/plain; charset=windows-1252 > > On 02/22/2016 07:26 AM, Jeremy Matthews wrote: >> Thank you, Ken Gaillot, for your response. Sorry for the delayed followup, >> but I have looked and looked at the scripts. There are a couple of scripts >> that have a pcs resource ban command, but they are not executed at the time >> of shutdown which is when I've discovered that the constraint is put back >> in. Our application software did not change on the system. We just updated >> pcs and pacemaker (and dependencies). I had to rollback pcs because it has >> an issue. >> >> Below is from /var/log/cluster/corosync.log. Any clues here as to why the >> constraint might have been added? In my other system without the pacemaker >> update, there is not the addition of the constraint. > > It might help to see the entire log from the time you issued the reboot > command to when the constraint was added. > > Notice in the cib logs it says "origin=local/crm_resource". That means that > crm_resource was what originally added the constraint (pcs resource ban calls > crm_resource). > > I'd be curious whether this makes a difference: after removing the > constraint, run "pcs cib-upgrade". It shouldn't, but it's the only thing I > can think of to try. > > CIB schema versions change when new features are added that require new CIB > syntax. pcs should automatically run cib-upgrade if you ever use a newer > feature than your current CIB version supports. You don't really need to > cib-upgrade explicitly, but it doesn't hurt, and it will get rid of those > "Transformed the configuration" messages. > >> Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: >> do_state_transition: State transition S_POLICY_ENGINE -> >> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE >> origin=handle_response ] >> Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_te_invoke: >> Processing graph 9 (ref=pe_calc-dc-1455920543-46) derived from >> /var/lib/pacemaker/pengine/pe-input-642.bz2 >> Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: run_graph: >> Transition 9 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, >> Source=/var/lib/pacemaker/pengine/pe-input-642.bz2): Complete >> Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: do_log: FSA: >> Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE >> Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: >> do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ >> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] >> Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: process_pe_message: >> Calculated Transition 9: /var/lib/pacemaker/pengine/pe-input-642.bz2 >> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: >> cib_process_request: Forwarding cib_modify operation for section >> constraints to master (origin=local/crm_resource/3) >> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: >> Diff: --- 0.291.8 2 >> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: >> Diff: +++ 0.292.0 (null) >> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: >> + /cib: @epoch=292, @num_updates=0 >> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: cib_perform_op: >> ++ /cib/configuration/constraints: <rsc_location >> id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" role="Started" >> node="g5se-f3efce" score="-INFINITY"/> >> Feb 19 15:22:23 [1994] g5se-f3efce cib: info: >> cib_process_request: Completed cib_modify operation for section >> constraints: OK (rc=0, origin=g5se-f3efce/crm_resource/3, version=0.292.0) >> Feb 19 15:22:23 [1999] g5se-f3efce crmd: info: >> abort_transition_graph: Transition aborted by >> rsc_location.cli-ban-ClusterIP-on-g5se-f3efce 'create': Non-status change >> (cib=0.292.0, source=te_update_diff:383, >> path=/cib/configuration/constraints, 1) >> Feb 19 15:22:23 [1999] g5se-f3efce crmd: notice: >> do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ >> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] >> Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: update_validation: >> pacemaker-1.2-style configuration is also valid for pacemaker-1.3 >> Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: >> Transformation upgrade-1.3.xsl successful >> Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: update_validation: >> Transformed the configuration from pacemaker-1.2 to pacemaker-2.0 >> Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: cli_config_update: >> Your configuration was internally updated to the latest version >> (pacemaker-2.0) >> Feb 19 15:22:23 [1998] g5se-f3efce pengine: notice: unpack_config: >> On loss of CCM Quorum: Ignore >> Feb 19 15:22:23 [1998] g5se-f3efce pengine: info: unpack_status: >> Node g5se-f3efce is in standby-mode >> >> I'm not sure what all has to be included my original email and Ken Gaillot's >> response embedded in it below. >> >> Message: 3 >> Date: Thu, 18 Feb 2016 13:37:31 -0600 >> From: Ken Gaillot <kgail...@redhat.com> >> To: users@clusterlabs.org >> Subject: Re: [ClusterLabs] ClusterIP location constraint reappears >> after reboot >> Message-ID: <56c61d7b.9090...@redhat.com> >> Content-Type: text/plain; charset=windows-1252 >> >> On 02/18/2016 01:07 PM, Jeremy Matthews wrote: >>> Hi, >>> >>> We're having an issue with our cluster where after a reboot of our system a >>> location constraint reappears for the ClusterIP. This causes a problem, >>> because we have a daemon that checks the cluster state and waits until the >>> ClusterIP is started before it kicks off our application. We didn't have >>> this issue when using an earlier version of pacemaker. Here is the >>> constraint as shown by pcs: >>> >>> [root@g5se-f3efce cib]# pcs constraint Location Constraints: >>> Resource: ClusterIP >>> Disabled on: g5se-f3efce (role: Started) Ordering Constraints: >>> Colocation Constraints: >>> >>> ...and here is our cluster status with the ClusterIP being Stopped: >>> >>> [root@g5se-f3efce cib]# pcs status >>> Cluster name: cl-g5se-f3efce >>> Last updated: Thu Feb 18 11:36:01 2016 Last change: Thu Feb 18 >>> 10:48:33 2016 via crm_resource on g5se-f3efce >>> Stack: cman >>> Current DC: g5se-f3efce - partition with quorum >>> Version: 1.1.11-97629de >>> 1 Nodes configured >>> 4 Resources configured >>> >>> >>> Online: [ g5se-f3efce ] >>> >>> Full list of resources: >>> >>> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce >>> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce >>> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce >>> ClusterIP (ocf::heartbeat:IPaddr2): Stopped >>> >>> >>> The cluster really just has one node at this time. >>> >>> I retrieve the constraint ID, remove the constraint, verify that ClusterIP >>> is started, and then reboot: >>> >>> [root@g5se-f3efce cib]# pcs constraint ref ClusterIP >>> Resource: ClusterIP >>> cli-ban-ClusterIP-on-g5se-f3efce >>> [root@g5se-f3efce cib]# pcs constraint remove >>> cli-ban-ClusterIP-on-g5se-f3efce >>> >>> [root@g5se-f3efce cib]# pcs status >>> Cluster name: cl-g5se-f3efce >>> Last updated: Thu Feb 18 11:45:09 2016 Last change: Thu Feb 18 >>> 11:44:53 2016 via crm_resource on g5se-f3efce >>> Stack: cman >>> Current DC: g5se-f3efce - partition with quorum >>> Version: 1.1.11-97629de >>> 1 Nodes configured >>> 4 Resources configured >>> >>> >>> Online: [ g5se-f3efce ] >>> >>> Full list of resources: >>> >>> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce >>> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce >>> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce >>> ClusterIP (ocf::heartbeat:IPaddr2): Started g5se-f3efce >>> >>> >>> [root@g5se-f3efce cib]# reboot >>> >>> ....after reboot, log in, and the constraint is back and ClusterIP has not >>> started. >>> >>> >>> I have noticed in /var/lib/pacemaker/cib that the cib-x.raw files get >>> created when there are changes to the cib (cib.xml). After a reboot, I see >>> the constraint being added in a diff between .raw files: >>> >>> [root@g5se-f3efce cib]# diff cib-7.raw cib-8.raw >>> 1c1 >>> < <cib epoch="239" num_updates="0" admin_epoch="0" >>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:44:53 >>> 2016" update-origin="g5se-f3efce" update-client="crm_resource" >>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce"> >>> --- >>>> <cib epoch="240" num_updates="0" admin_epoch="0" >>>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:46:49 >>>> 2016" update-origin="g5se-f3efce" update-client="crm_resource" >>>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce"> >>> 50c50,52 >>> < <constraints/> >>> --- >>>> <constraints> >>>> <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" >>>> role="Started" node="g5se-f3efce" score="-INFINITY"/> >>>> </constraints> >>> >>> >>> I have also looked in /var/log/cluster/corosync.log and seen logs where it >>> seems the cib is getting updated. I'm not sure if the constraint is being >>> put back in at shutdown or at start up. I just don't understand why it's >>> being put back in. I don't think our daemon code or other scripts are doing >>> this, but it is something I could verify. >> >> I would look at any scripts running around that time first. Constraints that >> start with "cli-" were created by one of the CLI tools, so something must be >> calling it. The most likely candidates are pcs resource move/ban or >> crm_resource -M/--move/-B/--ban. >> >>> ******************************** >>> >>> From "yum info pacemaker", my current version is: >>> >>> Name : pacemaker >>> Arch : x86_64 >>> Version : 1.1.12 >>> Release : 8.el6_7.2 >>> >>> My earlier version was: >>> >>> Name : pacemaker >>> Arch : x86_64 >>> Version : 1.1.10 >>> Release : 1.el6_4.4 >>> >>> I'm still using an earlier version pcs, because the new one seems to have >>> issues with python: >>> >>> Name : pcs >>> Arch : noarch >>> Version : 0.9.90 >>> Release : 1.0.1.el6.centos >>> >>> ******************************* >>> >>> If anyone has ideas on the cause or thoughts on this, anything would be >>> greatly appreciated. >>> >>> Thanks! >>> >>> >>> >>> Jeremy Matthews >> >> -----Original Message----- >> From: users-requ...@clusterlabs.org >> [mailto:users-requ...@clusterlabs.org] >> Sent: Friday, February 19, 2016 2:21 AM >> To: users@clusterlabs.org >> Subject: Users Digest, Vol 13, Issue 35 >> >> Send Users mailing list submissions to >> users@clusterlabs.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://clusterlabs.org/mailman/listinfo/users >> or, via email, send a message with subject or body 'help' to >> users-requ...@clusterlabs.org >> >> You can reach the person managing the list at >> users-ow...@clusterlabs.org >> >> When replying, please edit your Subject line so it is more specific than >> "Re: Contents of Users digest..." >> >> >> Today's Topics: >> >> 1. Re: Too quick node reboot leads to failed corosync assert on >> other node(s) (Michal Koutn?) >> 2. ClusterIP location constraint reappears after reboot >> (Jeremy Matthews) >> 3. Re: ClusterIP location constraint reappears after reboot >> (Ken Gaillot) >> 4. Re: Too quick node reboot leads to failed corosync assert on >> other node(s) (Jan Friesse) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Thu, 18 Feb 2016 17:32:48 +0100 >> From: Michal Koutn? <mkou...@suse.com> >> To: users@clusterlabs.org >> Subject: Re: [ClusterLabs] Too quick node reboot leads to failed >> corosync assert on other node(s) >> Message-ID: <56c5f230.6020...@suse.com> >> Content-Type: text/plain; charset="windows-1252" >> >> On 02/18/2016 10:40 AM, Christine Caulfield wrote: >>> I definitely remember looking into this, or something very like it, >>> ages ago. I can't find anything in the commit logs for either >>> corosync or cman that looks relevant though. If you're seeing it on >>> recent builds then it's obviously still a problem anyway and we ought to >>> look into it! >> Thanks for you replies. >> >> So far this happened only once and we've done only "post mortem", alas no >> available reproducer. If I have time, I'll try to reproduce it locally and >> check whether it exists in the current version. >> >> Michal >> >> -------------- next part -------------- A non-text attachment was >> scrubbed... >> Name: signature.asc >> Type: application/pgp-signature >> Size: 819 bytes >> Desc: OpenPGP digital signature >> URL: >> <http://clusterlabs.org/pipermail/users/attachments/20160218/97908c9d/ >> attachment-0001.sig> >> >> ------------------------------ >> >> Message: 2 >> Date: Thu, 18 Feb 2016 19:07:19 +0000 >> From: Jeremy Matthews <jeremy.matth...@genband.com> >> To: "users@clusterlabs.org" <users@clusterlabs.org> >> Subject: [ClusterLabs] ClusterIP location constraint reappears after >> reboot >> Message-ID: >> <ba3fced1d982a94aa64964f08b104956012d760...@gbplmail01.genband.com> >> Content-Type: text/plain; charset="windows-1252" >> >> Hi, >> >> We're having an issue with our cluster where after a reboot of our system a >> location constraint reappears for the ClusterIP. This causes a problem, >> because we have a daemon that checks the cluster state and waits until the >> ClusterIP is started before it kicks off our application. We didn't have >> this issue when using an earlier version of pacemaker. Here is the >> constraint as shown by pcs: >> >> [root@g5se-f3efce cib]# pcs constraint Location Constraints: >> Resource: ClusterIP >> Disabled on: g5se-f3efce (role: Started) Ordering Constraints: >> Colocation Constraints: >> >> ...and here is our cluster status with the ClusterIP being Stopped: >> >> [root@g5se-f3efce cib]# pcs status >> Cluster name: cl-g5se-f3efce >> Last updated: Thu Feb 18 11:36:01 2016 Last change: Thu Feb 18 >> 10:48:33 2016 via crm_resource on g5se-f3efce >> Stack: cman >> Current DC: g5se-f3efce - partition with quorum >> Version: 1.1.11-97629de >> 1 Nodes configured >> 4 Resources configured >> >> >> Online: [ g5se-f3efce ] >> >> Full list of resources: >> >> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce >> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce >> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce >> ClusterIP (ocf::heartbeat:IPaddr2): Stopped >> >> >> The cluster really just has one node at this time. >> >> I retrieve the constraint ID, remove the constraint, verify that ClusterIP >> is started, and then reboot: >> >> [root@g5se-f3efce cib]# pcs constraint ref ClusterIP >> Resource: ClusterIP >> cli-ban-ClusterIP-on-g5se-f3efce >> [root@g5se-f3efce cib]# pcs constraint remove >> cli-ban-ClusterIP-on-g5se-f3efce >> >> [root@g5se-f3efce cib]# pcs status >> Cluster name: cl-g5se-f3efce >> Last updated: Thu Feb 18 11:45:09 2016 Last change: Thu Feb 18 >> 11:44:53 2016 via crm_resource on g5se-f3efce >> Stack: cman >> Current DC: g5se-f3efce - partition with quorum >> Version: 1.1.11-97629de >> 1 Nodes configured >> 4 Resources configured >> >> >> Online: [ g5se-f3efce ] >> >> Full list of resources: >> >> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce >> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce >> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce >> ClusterIP (ocf::heartbeat:IPaddr2): Started g5se-f3efce >> >> >> [root@g5se-f3efce cib]# reboot >> >> ....after reboot, log in, and the constraint is back and ClusterIP has not >> started. >> >> >> I have noticed in /var/lib/pacemaker/cib that the cib-x.raw files get >> created when there are changes to the cib (cib.xml). After a reboot, I see >> the constraint being added in a diff between .raw files: >> >> [root@g5se-f3efce cib]# diff cib-7.raw cib-8.raw >> 1c1 >> < <cib epoch="239" num_updates="0" admin_epoch="0" >> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:44:53 >> 2016" update-origin="g5se-f3efce" update-client="crm_resource" >> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce"> >> --- >>> <cib epoch="240" num_updates="0" admin_epoch="0" >>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:46:49 >>> 2016" update-origin="g5se-f3efce" update-client="crm_resource" >>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce"> >> 50c50,52 >> < <constraints/> >> --- >>> <constraints> >>> <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" >>> role="Started" node="g5se-f3efce" score="-INFINITY"/> >>> </constraints> >> >> >> I have also looked in /var/log/cluster/corosync.log and seen logs where it >> seems the cib is getting updated. I'm not sure if the constraint is being >> put back in at shutdown or at start up. I just don't understand why it's >> being put back in. I don't think our daemon code or other scripts are doing >> this, but it is something I could verify. >> >> ******************************** >> >> >From "yum info pacemaker", my current version is: >> >> Name : pacemaker >> Arch : x86_64 >> Version : 1.1.12 >> Release : 8.el6_7.2 >> >> My earlier version was: >> >> Name : pacemaker >> Arch : x86_64 >> Version : 1.1.10 >> Release : 1.el6_4.4 >> >> I'm still using an earlier version pcs, because the new one seems to have >> issues with python: >> >> Name : pcs >> Arch : noarch >> Version : 0.9.90 >> Release : 1.0.1.el6.centos >> >> ******************************* >> >> If anyone has ideas on the cause or thoughts on this, anything would be >> greatly appreciated. >> >> Thanks! >> >> >> >> Jeremy Matthews >> >> >> -------------- next part -------------- An HTML attachment was >> scrubbed... >> URL: >> <http://clusterlabs.org/pipermail/users/attachments/20160218/8a4b99fd/ >> attachment-0001.html> >> >> ------------------------------ >> >> Message: 3 >> Date: Thu, 18 Feb 2016 13:37:31 -0600 >> From: Ken Gaillot <kgail...@redhat.com> >> To: users@clusterlabs.org >> Subject: Re: [ClusterLabs] ClusterIP location constraint reappears >> after reboot >> Message-ID: <56c61d7b.9090...@redhat.com> >> Content-Type: text/plain; charset=windows-1252 >> >> On 02/18/2016 01:07 PM, Jeremy Matthews wrote: >>> Hi, >>> >>> We're having an issue with our cluster where after a reboot of our system a >>> location constraint reappears for the ClusterIP. This causes a problem, >>> because we have a daemon that checks the cluster state and waits until the >>> ClusterIP is started before it kicks off our application. We didn't have >>> this issue when using an earlier version of pacemaker. Here is the >>> constraint as shown by pcs: >>> >>> [root@g5se-f3efce cib]# pcs constraint Location Constraints: >>> Resource: ClusterIP >>> Disabled on: g5se-f3efce (role: Started) Ordering Constraints: >>> Colocation Constraints: >>> >>> ...and here is our cluster status with the ClusterIP being Stopped: >>> >>> [root@g5se-f3efce cib]# pcs status >>> Cluster name: cl-g5se-f3efce >>> Last updated: Thu Feb 18 11:36:01 2016 Last change: Thu Feb 18 >>> 10:48:33 2016 via crm_resource on g5se-f3efce >>> Stack: cman >>> Current DC: g5se-f3efce - partition with quorum >>> Version: 1.1.11-97629de >>> 1 Nodes configured >>> 4 Resources configured >>> >>> >>> Online: [ g5se-f3efce ] >>> >>> Full list of resources: >>> >>> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce >>> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce >>> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce >>> ClusterIP (ocf::heartbeat:IPaddr2): Stopped >>> >>> >>> The cluster really just has one node at this time. >>> >>> I retrieve the constraint ID, remove the constraint, verify that ClusterIP >>> is started, and then reboot: >>> >>> [root@g5se-f3efce cib]# pcs constraint ref ClusterIP >>> Resource: ClusterIP >>> cli-ban-ClusterIP-on-g5se-f3efce >>> [root@g5se-f3efce cib]# pcs constraint remove >>> cli-ban-ClusterIP-on-g5se-f3efce >>> >>> [root@g5se-f3efce cib]# pcs status >>> Cluster name: cl-g5se-f3efce >>> Last updated: Thu Feb 18 11:45:09 2016 Last change: Thu Feb 18 >>> 11:44:53 2016 via crm_resource on g5se-f3efce >>> Stack: cman >>> Current DC: g5se-f3efce - partition with quorum >>> Version: 1.1.11-97629de >>> 1 Nodes configured >>> 4 Resources configured >>> >>> >>> Online: [ g5se-f3efce ] >>> >>> Full list of resources: >>> >>> sw-ready-g5se-f3efce (ocf::pacemaker:GBmon): Started g5se-f3efce >>> meta-data (ocf::pacemaker:GBmon): Started g5se-f3efce >>> netmon (ocf::heartbeat:ethmonitor): Started g5se-f3efce >>> ClusterIP (ocf::heartbeat:IPaddr2): Started g5se-f3efce >>> >>> >>> [root@g5se-f3efce cib]# reboot >>> >>> ....after reboot, log in, and the constraint is back and ClusterIP has not >>> started. >>> >>> >>> I have noticed in /var/lib/pacemaker/cib that the cib-x.raw files get >>> created when there are changes to the cib (cib.xml). After a reboot, I see >>> the constraint being added in a diff between .raw files: >>> >>> [root@g5se-f3efce cib]# diff cib-7.raw cib-8.raw >>> 1c1 >>> < <cib epoch="239" num_updates="0" admin_epoch="0" >>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:44:53 >>> 2016" update-origin="g5se-f3efce" update-client="crm_resource" >>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce"> >>> --- >>>> <cib epoch="240" num_updates="0" admin_epoch="0" >>>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:46:49 >>>> 2016" update-origin="g5se-f3efce" update-client="crm_resource" >>>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce"> >>> 50c50,52 >>> < <constraints/> >>> --- >>>> <constraints> >>>> <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" >>>> role="Started" node="g5se-f3efce" score="-INFINITY"/> >>>> </constraints> >>> >>> >>> I have also looked in /var/log/cluster/corosync.log and seen logs where it >>> seems the cib is getting updated. I'm not sure if the constraint is being >>> put back in at shutdown or at start up. I just don't understand why it's >>> being put back in. I don't think our daemon code or other scripts are doing >>> this, but it is something I could verify. >> >> I would look at any scripts running around that time first. Constraints that >> start with "cli-" were created by one of the CLI tools, so something must be >> calling it. The most likely candidates are pcs resource move/ban or >> crm_resource -M/--move/-B/--ban. >> >>> ******************************** >>> >>> From "yum info pacemaker", my current version is: >>> >>> Name : pacemaker >>> Arch : x86_64 >>> Version : 1.1.12 >>> Release : 8.el6_7.2 >>> >>> My earlier version was: >>> >>> Name : pacemaker >>> Arch : x86_64 >>> Version : 1.1.10 >>> Release : 1.el6_4.4 >>> >>> I'm still using an earlier version pcs, because the new one seems to have >>> issues with python: >>> >>> Name : pcs >>> Arch : noarch >>> Version : 0.9.90 >>> Release : 1.0.1.el6.centos >>> >>> ******************************* >>> >>> If anyone has ideas on the cause or thoughts on this, anything would be >>> greatly appreciated. >>> >>> Thanks! >>> >>> >>> >>> Jeremy Matthews >> >> >> >> >> ------------------------------ >> >> Message: 4 >> Date: Fri, 19 Feb 2016 09:18:22 +0100 >> From: Jan Friesse <jfrie...@redhat.com> >> To: Cluster Labs - All topics related to open-source clustering >> welcomed <users@clusterlabs.org> >> Subject: Re: [ClusterLabs] Too quick node reboot leads to failed >> corosync assert on other node(s) >> Message-ID: <56c6cfce.5060...@redhat.com> >> Content-Type: text/plain; charset=ISO-8859-1; format=flowed >> >> Michal Koutn? napsal(a): >>> On 02/18/2016 10:40 AM, Christine Caulfield wrote: >>>> I definitely remember looking into this, or something very like it, >>>> ages ago. I can't find anything in the commit logs for either >>>> corosync or cman that looks relevant though. If you're seeing it on >>>> recent builds then it's obviously still a problem anyway and we ought to >>>> look into it! >>> Thanks for you replies. >>> >>> So far this happened only once and we've done only "post mortem", >>> alas no available reproducer. If I have time, I'll try to reproduce >>> it >> >> Ok. Actually I was trying to reproduce and was really not successful >> (current master). Steps I've used: >> - 2 nodes, token set to 30 sec >> - execute cpgbench on node2 >> - pause node1 corosync (ctrl+z), kill node1 corosync (kill -9 %1) >> - wait until corosync on node2 move into "entering GATHER state from..." >> - execute corosync on node1 >> >> Basically during recovery new node trans list was never send (and/or ignored >> by node2). >> >> I'm going to try test v1.4.7, but it's also possible that bug is fixed >> by other commits (my favorites are >> cfbb021e130337603fe5b545d1e377296ecb92ea, >> 4ee84c51fa73c4ec7cbee922111a140a3aaf75df, >> f135b680967aaef1d466f40170c75ae3e470e147). >> >> Regards, >> Honza >> >>> locally and check whether it exists in the current version. >>> >>> Michal _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org