Re: [ClusterLabs] ClusterIP location constraint reappears after reboot

Ken Gaillot Mon, 14 Mar 2016 09:42:22 -0700

On 02/22/2016 05:23 PM, Jeremy Matthews wrote:
> Thanks for the quick response again, and pardon for the delay in responding. 
> A colleague of mine and I have been trying some different things today.
> 
> But from the reboot on Friday, further below are the logs from corosync.log 
> from the time of the reboot command to the constraint being added.
> 
> I am not able to perform a "pcs cluster cib-upgrade". The version of pcs that 
> I have does not have that option (just cib [filename] and cib-push 
> <filename>). My versions at the time of these logs were:


I'm curious whether you were able to solve your issue.

Regarding cib-upgrade, you can use the "cibadmin --upgrade" command
instead, which is what pcs does behind the scenes. For a
better-safe-than-sorry how-to, see:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_upgrading_the_configuration

> [root@g5se-f3efce Packages]# pcs --version
> 0.9.90
> [root@g5se-f3efce Packages]# pacemakerd --version
> Pacemaker 1.1.11
> Written by Andrew Beekhof
> 
> I think you're right in that we had a script banning the ClusterIP. It is 
> called from a message daemon that we created that acts as middleware between 
> the cluster software and our application. In this daemon, it has an exit 
> handler that calls a script which runs:
> 
> pcs resource ban ClusterIP $host                      # where $host is the 
> result of "host =`hostname`
> 
> ...cause we normally try to push the cluster IP to the other side (though in 
> this case, we just have one node), but then right after that the script calls:
> 
> pcs resource clear ClusterIP
> 
> 
> ...but for some reason, it doesn't seem to result in the constraint being 
> removed (see even FURTHER below where I show a /var/log/message log snippet 
> with both the constraint addition and removal; this was using an earlier 
> version of pacemaker, Pacemaker 1.1.10-1.el6_4.4). I guess with the earlier 
> pcs or pacemaker version, these logs went to messages rather than 
> corosync.log today.
> 
> I am in a bit of a conundrum in that if I upgrade pcs to the 0.9.149 
> (retrieved and "make install" 'ed from github.com because 0.9.139 had a pcs 
> issue with one node clusters) which has the cib-upgrade option), then if I 
> manually remove the ClusterIP constraint this causes a problem for our 
> message daemon in that it thinks neither side in the cluster is active; 
> something to look at on our end. So it seems the removal of the constraint 
> affects our daemon in the new pcs. For the time being, I've rolled back pcs 
> to the above 0.9.90 version.
> 
> One other thing to mention is that the timing of pacemaker's start may have 
> been delayed by what I found out was a change to its initialization header 
> (by either our daemon or application installation script) from 90 1 to 70 20. 
> So in /etc/rc3.d, there is S70pacemaker rather than S90pacemaker. I am not a 
> Linux expert by any means. I guess that may affect start up, but I'm not sure 
> about shutdown.
> 
> Corosync logs from the time reboot was issued to the constraint being added:
> 
> Feb 19 15:22:22 [1997] g5se-f3efce      attrd:   notice: 
> attrd_trigger_update:  Sending flush op to all hosts for: standby (true)
> Feb 19 15:22:22 [1997] g5se-f3efce      attrd:   notice: 
> attrd_perform_update:  Sent update 24: standby=true
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Forwarding cib_modify operation for section status to master 
> (origin=local/attrd/24)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: --- 0.291.2 2
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: +++ 0.291.3 (null)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  /cib:  @num_updates=3
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   ++ 
> /cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce']:
>   <nvpair id="status-g5se-f3efce-standby" name="standby" value="true"/>
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: 
> abort_transition_graph:        Transition aborted by 
> status-g5se-f3efce-standby, standby=true: Transient attribute change (create 
> cib=0.291.3, source=te_update_diff:391, 
> path=/cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce'],
>  1)
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:   notice: do_state_transition: 
>   State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Completed cib_modify operation for section status: OK (rc=0, 
> origin=g5se-f3efce/attrd/24, version=0.291.3)
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:   notice: update_validation:   
>   pacemaker-1.2-style configuration is also valid for pacemaker-1.3
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: update_validation:   
>   Transformation upgrade-1.3.xsl successful
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: update_validation:   
>   Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: cli_config_update:   
>   Your configuration was internally updated to the latest version 
> (pacemaker-2.0)
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:   notice: unpack_config:       
>   On loss of CCM Quorum: Ignore
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: unpack_status:       
>   Node g5se-f3efce is in standby-mode
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: 
> determine_online_status:       Node g5se-f3efce is standby
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: native_print:  
> sw-ready-g5se-f3efce    (ocf::pacemaker:GBmon): Started g5se-f3efce
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: native_print:  
> meta-data       (ocf::pacemaker:GBmon): Started g5se-f3efce
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: native_print:  
> netmon  (ocf::heartbeat:ethmonitor):    Started g5se-f3efce
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: native_print:  
> ClusterIP       (ocf::heartbeat:IPaddr2):       Started g5se-f3efce
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: native_color:  
> Resource sw-ready-g5se-f3efce cannot run anywhere
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: native_color:  
> Resource meta-data cannot run anywhere
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: native_color:  
> Resource netmon cannot run anywhere
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:     info: native_color:  
> Resource ClusterIP cannot run anywhere
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:   notice: LogActions:    Stop  
>   sw-ready-g5se-f3efce    (g5se-f3efce)
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:   notice: LogActions:    Stop  
>   meta-data       (g5se-f3efce)
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:   notice: LogActions:    Stop  
>   netmon  (g5se-f3efce)
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:   notice: LogActions:    Stop  
>   ClusterIP       (g5se-f3efce)
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: do_state_transition: 
>   State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ 
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: do_te_invoke:  
> Processing graph 8 (ref=pe_calc-dc-1455920542-41) derived from 
> /var/lib/pacemaker/pengine/pe-input-641.bz2
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:   notice: te_rsc_command:      
>   Initiating action 8: stop sw-ready-g5se-f3efce_stop_0 on g5se-f3efce (local)
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: 
> cancel_recurring_action:       Cancelling operation 
> sw-ready-g5se-f3efce_monitor_10000
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: do_lrm_rsc_op:       
>   Performing key=8:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed 
> op=sw-ready-g5se-f3efce_stop_0
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: log_execute:   
> executing - rsc:sw-ready-g5se-f3efce action:stop call_id:31
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:   notice: te_rsc_command:      
>   Initiating action 9: stop meta-data_stop_0 on g5se-f3efce (local)
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: 
> cancel_recurring_action:       Cancelling operation meta-data_monitor_60000
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: do_lrm_rsc_op:       
>   Performing key=9:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed 
> op=meta-data_stop_0
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: log_execute:   
> executing - rsc:meta-data action:stop call_id:33
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:   notice: te_rsc_command:      
>   Initiating action 10: stop netmon_stop_0 on g5se-f3efce (local)
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: 
> cancel_recurring_action:       Cancelling operation netmon_monitor_10000
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: do_lrm_rsc_op:       
>   Performing key=10:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed op=netmon_stop_0
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: log_execute:   
> executing - rsc:netmon action:stop call_id:35
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:   notice: te_rsc_command:      
>   Initiating action 11: stop ClusterIP_stop_0 on g5se-f3efce (local)
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: 
> cancel_recurring_action:       Cancelling operation ClusterIP_monitor_30000
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: do_lrm_rsc_op:       
>   Performing key=11:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed 
> op=ClusterIP_stop_0
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: log_execute:   
> executing - rsc:ClusterIP action:stop call_id:37
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: process_lrm_event:   
>   Operation sw-ready-g5se-f3efce_monitor_10000: Cancelled (node=g5se-f3efce, 
> call=29, confirmed=true)
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: process_lrm_event:   
>   Operation meta-data_monitor_60000: Cancelled (node=g5se-f3efce, call=21, 
> confirmed=true)
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: process_lrm_event:   
>   Operation netmon_monitor_10000: Cancelled (node=g5se-f3efce, call=23, 
> confirmed=true)
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: process_lrm_event:   
>   Operation ClusterIP_monitor_30000: Cancelled (node=g5se-f3efce, call=25, 
> confirmed=true)
> Feb 19 15:22:22 [1998] g5se-f3efce    pengine:   notice: process_pe_message:  
>   Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-641.bz2
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: log_finished:  
> finished - rsc:sw-ready-g5se-f3efce action:stop call_id:31 pid:6013 
> exit-code:0 exec-time:56ms queue-time:0ms
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:   notice: process_lrm_event:   
>   Operation sw-ready-g5se-f3efce_stop_0: ok (node=g5se-f3efce, call=31, rc=0, 
> cib-update=72, confirmed=true)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Forwarding cib_modify operation for section status to master 
> (origin=local/crmd/72)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: --- 0.291.3 2
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: +++ 0.291.4 (null)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  /cib:  @num_updates=4
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  
> /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='sw-ready-g5se-f3efce']/lrm_rsc_op[@id='sw-ready-g5se-f3efce_last_0']:
>   @operation_key=sw-ready-g5se-f3efce_stop_0, @operation=stop, 
> @transition-key=8:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, 
> @transition-magic=0:0;8:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, 
> @call-id=31, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=56
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Completed cib_modify operation for section status: OK (rc=0, 
> origin=g5se-f3efce/crmd/72, version=0.291.4)
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: match_graph_event:   
>   Action sw-ready-g5se-f3efce_stop_0 (8) confirmed on g5se-f3efce (rc=0)
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: log_finished:  
> finished - rsc:meta-data action:stop call_id:33 pid:6014 exit-code:0 
> exec-time:72ms queue-time:0ms
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:   notice: process_lrm_event:   
>   Operation meta-data_stop_0: ok (node=g5se-f3efce, call=33, rc=0, 
> cib-update=73, confirmed=true)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Forwarding cib_modify operation for section status to master 
> (origin=local/crmd/73)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: --- 0.291.4 2
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: +++ 0.291.5 (null)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  /cib:  @num_updates=5
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  
> /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='meta-data']/lrm_rsc_op[@id='meta-data_last_0']:
>   @operation_key=meta-data_stop_0, @operation=stop, 
> @crm-debug-origin=do_update_resource, 
> @transition-key=9:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, 
> @transition-magic=0:0;9:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, 
> @call-id=33, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Completed cib_modify operation for section status: OK (rc=0, 
> origin=g5se-f3efce/crmd/73, version=0.291.5)
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:     info: match_graph_event:   
>   Action meta-data_stop_0 (9) confirmed on g5se-f3efce (rc=0)
> Feb 19 15:22:22 [1997] g5se-f3efce      attrd:   notice: 
> attrd_trigger_update:  Sending flush op to all hosts for: ethmonitor-eth0 
> (<null>)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Forwarding cib_delete operation for section status to master 
> (origin=local/attrd/26)
> Feb 19 15:22:22 [1997] g5se-f3efce      attrd:   notice: 
> attrd_perform_update:  Sent delete 26: node=g5se-f3efce, 
> attr=ethmonitor-eth0, id=<n/a>, set=(null), section=status
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: --- 0.291.5 2
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: +++ 0.291.6 (null)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   -- 
> /cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce']/nvpair[@id='status-g5se-f3efce-ethmonitor-eth0']
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  /cib:  @num_updates=6
> Feb 19 15:22:22 [1996] g5se-f3efce       lrmd:     info: log_finished:  
> finished - rsc:netmon action:stop call_id:35 pid:6015 exit-code:0 
> exec-time:99ms queue-time:0ms
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Completed cib_delete operation for section status: OK (rc=0, 
> origin=g5se-f3efce/attrd/26, version=0.291.6)
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:   notice: 
> abort_transition_graph:        Transition aborted by deletion of 
> nvpair[@id='status-g5se-f3efce-ethmonitor-eth0']: Transient attribute change 
> (cib=0.291.6, source=te_update_diff:391, 
> path=/cib/status/node_state[@id='g5se-f3efce']/transient_attributes[@id='g5se-f3efce']/instance_attributes[@id='status-g5se-f3efce']/nvpair[@id='status-g5se-f3efce-ethmonitor-eth0'],
>  0)
> Feb 19 15:22:22 [1999] g5se-f3efce       crmd:   notice: process_lrm_event:   
>   Operation netmon_stop_0: ok (node=g5se-f3efce, call=35, rc=0, 
> cib-update=74, confirmed=true)
> Feb 19 15:22:22 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Forwarding cib_modify operation for section status to master 
> (origin=local/crmd/74)
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: --- 0.291.6 2
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: +++ 0.291.7 (null)
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  /cib:  @num_updates=7
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  
> /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='netmon']/lrm_rsc_op[@id='netmon_last_0']:
>   @operation_key=netmon_stop_0, @operation=stop, 
> @crm-debug-origin=do_update_resource, 
> @transition-key=10:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, 
> @transition-magic=0:0;10:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, 
> @call-id=35, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=99
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Completed cib_modify operation for section status: OK (rc=0, 
> origin=g5se-f3efce/crmd/74, version=0.291.7)
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: match_graph_event:   
>   Action netmon_stop_0 (10) confirmed on g5se-f3efce (rc=0)
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Forwarding cib_delete operation for section constraints to master 
> (origin=local/crm_resource/3)
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Completed cib_delete operation for section constraints: OK (rc=0, 
> origin=g5se-f3efce/crm_resource/3, version=0.291.7)
> IPaddr2[6016]:  2016/02/19_15:22:23 INFO: IP status = ok, IP_CIP=
> Feb 19 15:22:23 [1996] g5se-f3efce       lrmd:     info: log_finished:  
> finished - rsc:ClusterIP action:stop call_id:37 pid:6016 exit-code:0 
> exec-time:127ms queue-time:0ms
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:   notice: process_lrm_event:   
>   Operation ClusterIP_stop_0: ok (node=g5se-f3efce, call=37, rc=0, 
> cib-update=75, confirmed=true)
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Forwarding cib_modify operation for section status to master 
> (origin=local/crmd/75)
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: --- 0.291.7 2
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: +++ 0.291.8 (null)
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  /cib:  @num_updates=8
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  
> /cib/status/node_state[@id='g5se-f3efce']/lrm[@id='g5se-f3efce']/lrm_resources/lrm_resource[@id='ClusterIP']/lrm_rsc_op[@id='ClusterIP_last_0']:
>   @operation_key=ClusterIP_stop_0, @operation=stop, 
> @transition-key=11:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, 
> @transition-magic=0:0;11:8:0:b7b85b39-a745-4cd7-abc4-059a684da6ed, 
> @call-id=37, @last-run=1455920542, @last-rc-change=1455920542, @exec-time=127
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Completed cib_modify operation for section status: OK (rc=0, 
> origin=g5se-f3efce/crmd/75, version=0.291.8)
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: match_graph_event:   
>   Action ClusterIP_stop_0 (11) confirmed on g5se-f3efce (rc=0)
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:   notice: run_graph:     
> Transition 8 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-641.bz2): Stopped
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: do_state_transition: 
>   State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:   notice: update_validation:   
>   pacemaker-1.2-style configuration is also valid for pacemaker-1.3
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: update_validation:   
>   Transformation upgrade-1.3.xsl successful
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: update_validation:   
>   Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: cli_config_update:   
>   Your configuration was internally updated to the latest version 
> (pacemaker-2.0)
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:   notice: unpack_config:       
>   On loss of CCM Quorum: Ignore
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: unpack_status:       
>   Node g5se-f3efce is in standby-mode
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: 
> determine_online_status:       Node g5se-f3efce is standby
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: native_print:  
> sw-ready-g5se-f3efce    (ocf::pacemaker:GBmon): Stopped
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: native_print:  
> meta-data       (ocf::pacemaker:GBmon): Stopped
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: native_print:  
> netmon  (ocf::heartbeat:ethmonitor):    Stopped
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: native_print:  
> ClusterIP       (ocf::heartbeat:IPaddr2):       Stopped
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: native_color:  
> Resource sw-ready-g5se-f3efce cannot run anywhere
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: native_color:  
> Resource meta-data cannot run anywhere
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: native_color:  
> Resource netmon cannot run anywhere
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: native_color:  
> Resource ClusterIP cannot run anywhere
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: LogActions:    Leave 
>   sw-ready-g5se-f3efce    (Stopped)
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: LogActions:    Leave 
>   meta-data       (Stopped)
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: LogActions:    Leave 
>   netmon  (Stopped)
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: LogActions:    Leave 
>   ClusterIP       (Stopped)
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: do_state_transition: 
>   State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ 
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: do_te_invoke:  
> Processing graph 9 (ref=pe_calc-dc-1455920543-46) derived from 
> /var/lib/pacemaker/pengine/pe-input-642.bz2
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:   notice: run_graph:     
> Transition 9 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-642.bz2): Complete
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: do_log:        FSA: 
> Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:   notice: do_state_transition: 
>   State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:   notice: process_pe_message:  
>   Calculated Transition 9: /var/lib/pacemaker/pengine/pe-input-642.bz2
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Forwarding cib_modify operation for section constraints to master 
> (origin=local/crm_resource/3)
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: --- 0.291.8 2
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   Diff: +++ 0.292.0 (null)
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   +  /cib:  @epoch=292, @num_updates=0
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:      
>   ++ /cib/configuration/constraints:  <rsc_location 
> id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" role="Started" 
> node="g5se-f3efce" score="-INFINITY"/>
> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_process_request: 
>   Completed cib_modify operation for section constraints: OK (rc=0, 
> origin=g5se-f3efce/crm_resource/3, version=0.292.0)
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: 
> abort_transition_graph:        Transition aborted by 
> rsc_location.cli-ban-ClusterIP-on-g5se-f3efce 'create': Non-status change 
> (cib=0.292.0, source=te_update_diff:383, path=/cib/configuration/constraints, 
> 1)
> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:   notice: do_state_transition: 
>   State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:   notice: update_validation:   
>   pacemaker-1.2-style configuration is also valid for pacemaker-1.3
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: update_validation:   
>   Transformation upgrade-1.3.xsl successful
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: update_validation:   
>   Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: cli_config_update:   
>   Your configuration was internally updated to the latest version 
> (pacemaker-2.0)
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:   notice: unpack_config:       
>   On loss of CCM Quorum: Ignore
> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: unpack_status:       
>   Node g5se-f3efce is in standby-mode
> 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> /var/log/messages snippet showing at the bottom addition and removal of 
> constraint (this is with pcs 0.9.90 and pacemakerd 1.1.10-1.el6_4.4):
> 
> Feb 21 23:10:38 g5se-dea2b1 azMD[1584]: Sending INIT message to partner. 
> Count 21
> Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty1) main process (1732) killed 
> by TERM signal
> Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty2) main process (1734) killed 
> by TERM signal
> Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty3) main process (1736) killed 
> by TERM signal
> Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty4) main process (1738) killed 
> by TERM signal
> Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty5) main process (1740) killed 
> by TERM signal
> Feb 21 23:10:41 g5se-dea2b1 init: tty (/dev/tty6) main process (1742) killed 
> by TERM signal
> Feb 21 23:10:41 g5se-dea2b1 avahi-daemon[1473]: Got SIGTERM, quitting.
> Feb 21 23:10:41 g5se-dea2b1 avahi-daemon[1473]: Leaving mDNS multicast group 
> on interface eth0.IPv4 with address 172.20.240.124.
> Feb 21 23:10:42 g5se-dea2b1 azMD[1584]: [azIntTrmHandler] Int Trm handler 15
> Feb 21 23:10:42 g5se-dea2b1 azMD[1584]: [azExitHandler] exit handler
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: --- 0.66.3
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: +++ 0.67.1 
> 44c794d4381e36ea4f5d51d0dd7fde1d
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: -- <cib 
> admin_epoch="0" epoch="66" num_updates="3"/>
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: ++           
> <nvpair id="sw-ready-g5se-dea2b1-meta_attributes-target-role" 
> name="target-role" value="Stopped"/>
> Feb 21 23:10:42 g5se-dea2b1 stonith-ng[1558]:   notice: unpack_config: On 
> loss of CCM Quorum: Ignore
> Feb 21 23:10:42 g5se-dea2b1 pengine[1561]:   notice: unpack_config: On loss 
> of CCM Quorum: Ignore
> Feb 21 23:10:42 g5se-dea2b1 pengine[1561]:   notice: LogActions: Stop    
> sw-ready-g5se-dea2b1#011(g5se-dea2b1)
> Feb 21 23:10:42 g5se-dea2b1 pengine[1561]:   notice: process_pe_message: 
> Calculated Transition 32: /var/lib/pacemaker/pengine/pe-input-134.bz2
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: te_rsc_command: Initiating 
> action 10: stop sw-ready-g5se-dea2b1_stop_0 on g5se-dea2b1 (local)
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: process_lrm_event: LRM 
> operation sw-ready-g5se-dea2b1_stop_0 (call=48, rc=0, cib-update=67, 
> confirmed=true) ok
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: run_graph: Transition 32 
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-134.bz2): Complete
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Feb 21 23:10:42 g5se-dea2b1 stonith-ng[1558]:   notice: unpack_config: On 
> loss of CCM Quorum: Ignore
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: --- 0.69.3
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: +++ 0.70.1 
> 216351853e036a12a96b442b30522287
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: -- <cib 
> admin_epoch="0" epoch="69" num_updates="3"/>
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: ++       
> <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" 
> role="Started" node="g5se-dea2b1" score="-INFINITY"/>
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Feb 21 23:10:42 g5se-dea2b1 pengine[1561]:   notice: unpack_config: On loss 
> of CCM Quorum: Ignore
> Feb 21 23:10:42 g5se-dea2b1 pengine[1561]:   notice: LogActions: Stop    
> ClusterIP#011(g5se-dea2b1)
> Feb 21 23:10:42 g5se-dea2b1 pengine[1561]:   notice: process_pe_message: 
> Calculated Transition 35: /var/lib/pacemaker/pengine/pe-input-137.bz2
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: te_rsc_command: Initiating 
> action 7: stop ClusterIP_stop_0 on g5se-dea2b1 (local)
> Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13237]: INFO: IP status = ok, IP_CIP=
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: process_lrm_event: LRM 
> operation ClusterIP_stop_0 (call=64, rc=0, cib-update=74, confirmed=true) ok
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: run_graph: Transition 35 
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-137.bz2): Complete
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Feb 21 23:10:42 g5se-dea2b1 stonith-ng[1558]:   notice: unpack_config: On 
> loss of CCM Quorum: Ignore
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: --- 0.70.2
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: +++ 0.71.1 
> 453ef48657244dc188b444348eb547ed
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: --       
> <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" 
> role="Started" node="g5se-dea2b1" score="-INFINITY"/>
> Feb 21 23:10:42 g5se-dea2b1 cib[1557]:   notice: cib:diff: ++ <cib epoch="71" 
> num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" 
> cib-last-written="Sun Feb 21 23:10:42 2016" update-origin="g5se-dea2b1" 
> update-client="crm_resource" crm_feature_set="3.0.7" have-quorum="1" 
> dc-uuid="g5se-dea2b1"/>
> Feb 21 23:10:42 g5se-dea2b1 pengine[1561]:   notice: unpack_config: On loss 
> of CCM Quorum: Ignore
> Feb 21 23:10:42 g5se-dea2b1 pengine[1561]:   notice: LogActions: Start   
> ClusterIP#011(g5se-dea2b1)
> Feb 21 23:10:42 g5se-dea2b1 pengine[1561]:   notice: process_pe_message: 
> Calculated Transition 36: /var/lib/pacemaker/pengine/pe-input-138.bz2
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: te_rsc_command: Initiating 
> action 6: start ClusterIP_start_0 on g5se-dea2b1 (local)
> Feb 21 23:10:42 g5se-dea2b1 azMD[1584]: [azExitHandler] exit handler
> Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13282]: INFO: ip -f inet addr add 
> 172.20.240.123/24 brd 172.20.240.255 dev eth0
> Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13282]: INFO: ip link set eth0 up
> Feb 21 23:10:42 g5se-dea2b1 IPaddr2[13282]: INFO: 
> /usr/lib64/heartbeat/send_arp -i 200 -r 5 -p 
> /var/run/heartbeat/rsctmp/send_arp-172.20.240.123 eth0 172.20.240.123 auto 
> not_used not_used
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: process_lrm_event: LRM 
> operation ClusterIP_start_0 (call=68, rc=0, cib-update=76, confirmed=true) ok
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: te_rsc_command: Initiating 
> action 7: monitor ClusterIP_monitor_30000 on g5se-dea2b1 (local)
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: process_lrm_event: LRM 
> operation ClusterIP_monitor_30000 (call=71, rc=0, cib-update=77, 
> confirmed=false) ok
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: run_graph: Transition 36 
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-138.bz2): Complete
> Feb 21 23:10:42 g5se-dea2b1 crmd[1562]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Feb 21 23:10:43 g5se-dea2b1 crmd[1562]:   notice: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Feb 21 23:10:43 g5se-dea2b1 stonith-ng[1558]:   notice: unpack_config: On 
> loss of CCM Quorum: Ignore
> Feb 21 23:10:43 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: --- 0.71.3
> Feb 21 23:10:43 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: +++ 0.72.1 
> 4e5a3b6259a59f84bcfec6d0f16ad3ba
> Feb 21 23:10:43 g5se-dea2b1 cib[1557]:   notice: cib:diff: -- <cib 
> admin_epoch="0" epoch="71" num_updates="3"/>
> Feb 21 23:10:43 g5se-dea2b1 cib[1557]:   notice: cib:diff: ++       
> <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" 
> role="Started" node="g5se-dea2b1" score="-INFINITY"/>
> Feb 21 23:10:43 g5se-dea2b1 pengine[1561]:   notice: unpack_config: On loss 
> of CCM Quorum: Ignore
> Feb 21 23:10:43 g5se-dea2b1 pengine[1561]:   notice: LogActions: Stop    
> ClusterIP#011(g5se-dea2b1)
> Feb 21 23:10:43 g5se-dea2b1 pengine[1561]:   notice: process_pe_message: 
> Calculated Transition 37: /var/lib/pacemaker/pengine/pe-input-139.bz2
> Feb 21 23:10:43 g5se-dea2b1 crmd[1562]:   notice: te_rsc_command: Initiating 
> action 7: stop ClusterIP_stop_0 on g5se-dea2b1 (local)
> Feb 21 23:10:43 g5se-dea2b1 IPaddr2[13372]: INFO: IP status = ok, IP_CIP=
> Feb 21 23:10:43 g5se-dea2b1 crmd[1562]:   notice: process_lrm_event: LRM 
> operation ClusterIP_stop_0 (call=75, rc=0, cib-update=79, confirmed=true) ok
> Feb 21 23:10:43 g5se-dea2b1 crmd[1562]:   notice: run_graph: Transition 37 
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-139.bz2): Complete
> Feb 21 23:10:43 g5se-dea2b1 crmd[1562]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Feb 21 23:10:43 g5se-dea2b1 stonith-ng[1558]:   notice: unpack_config: On 
> loss of CCM Quorum: Ignore
> Feb 21 23:10:43 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: --- 0.72.2
> Feb 21 23:10:43 g5se-dea2b1 cib[1557]:   notice: cib:diff: Diff: +++ 0.73.1 
> 93f902fd51a6750b828144d42f8c7a6e
> Feb 21 23:10:43 g5se-dea2b1 cib[1557]:   notice: cib:diff: --       
> <rsc_location id="cli-ban-ClusterIP-on-g5se-dea2b1" rsc="ClusterIP" 
> role="Started" node="g5se-dea2b1" score="-INFINITY"/>
> Feb 21 23:10:43 g5se-dea2b1 cib[1557]:   notice: cib:diff: ++ <cib epoch="73" 
> num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" 
> cib-last-written="Sun Feb 21 23:10:43 2016" update-origin="g5se-dea2b1" 
> update-client="crm_resource" crm_feature_set="3.0.7" have-quorum="1" 
> dc-uuid="g5se-dea2b1"/>
> Feb 21 23:10:43 g5se-dea2b1 crmd[1562]:   notice: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Feb 21 23:10:43 g5se-dea2b1 pengine[1561]:   notice: unpack_config: On loss 
> of CCM Quorum: Ignore
> Feb 21 23:10:43 g5se-dea2b1 pengine[1561]:   notice: LogActions: Start   
> ClusterIP#011(g5se-dea2b1)
> Feb 21 23:10:43 g5se-dea2b1 pengine[1561]:   notice: process_pe_message: 
> Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-140.bz2
> Feb 21 23:10:43 g5se-dea2b1 crmd[1562]:   notice: te_rsc_command: Initiating 
> action 6: start ClusterIP_start_0 on g5se-dea2b1 (local)
> 
> 
> 
> -----Original Message-----
> From: users-requ...@clusterlabs.org [mailto:users-requ...@clusterlabs.org] 
> Sent: Monday, February 22, 2016 11:42 AM
> To: users@clusterlabs.org
> Subject: Users Digest, Vol 13, Issue 44
> 
> Send Users mailing list submissions to
>       users@clusterlabs.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>       http://clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>       users-requ...@clusterlabs.org
> 
> You can reach the person managing the list at
>       users-ow...@clusterlabs.org
> 
> When replying, please edit your Subject line so it is more specific than "Re: 
> Contents of Users digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: fencing by node name or by node ID (Ken Gaillot)
>    2. Re: ClusterIP location constraint reappears after reboot
>       (Ken Gaillot)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 22 Feb 2016 11:10:57 -0600
> From: Ken Gaillot <kgail...@redhat.com>
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] fencing by node name or by node ID
> Message-ID: <56cb4121.7000...@redhat.com>
> Content-Type: text/plain; charset=windows-1252
> 
> On 02/21/2016 06:19 PM, Ferenc W?gner wrote:
>> Hi,
>>
>> Last night a node in our cluster (Corosync 2.3.5, Pacemaker 1.1.14) 
>> experienced some failure and fell out of the cluster:
>>
>> Feb 21 22:11:12 vhbl06 corosync[3603]:   [TOTEM ] A new membership 
>> (10.0.6.9:612) was formed. Members left: 167773709
>> Feb 21 22:11:12 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the 
>> leave message. failed: 167773709
>> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: crm_update_peer_proc: Node 
>> vhbl07[167773709] - state is now lost (was member)
>> Feb 21 22:11:12 vhbl06 cib[8304]:   notice: crm_update_peer_proc: Node 
>> vhbl07[167773709] - state is now lost (was member)
>> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Removing all vhbl07 attributes 
>> for attrd_peer_change_cb
>> Feb 21 22:11:12 vhbl06 cib[8304]:   notice: Removing vhbl07/167773709 from 
>> the membership list
>> Feb 21 22:11:12 vhbl06 cib[8304]:   notice: Purged 1 peers with id=167773709 
>> and/or uname=vhbl07 from the membership cache
>> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Lost attribute writer vhbl07
>> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Removing vhbl07/167773709 from 
>> the membership list
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: crm_update_peer_proc: 
>> Node vhbl07[167773709] - state is now lost (was member)
>> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Purged 1 peers with 
>> id=167773709 and/or uname=vhbl07 from the membership cache
>> Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: Our peer on the DC (vhbl07) is 
>> dead
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: Removing vhbl07/167773709 
>> from the membership list
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: Purged 1 peers with 
>> id=167773709 and/or uname=vhbl07 from the membership cache
>> Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: State transition S_NOT_DC -> 
>> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK 
>> origin=peer_update_callback ]
>> Feb 21 22:11:12 vhbl06 corosync[3603]:   [QUORUM] Members[4]: 167773705 
>> 167773706 167773707 167773708
>> Feb 21 22:11:12 vhbl06 corosync[3603]:   [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>> Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: crm_reap_unseen_nodes: Node 
>> vhbl07[167773709] - state is now lost (was member)
>> Feb 21 22:11:12 vhbl06 pacemakerd[8261]:   notice: crm_reap_unseen_nodes: 
>> Node vhbl07[167773709] - state is now lost (was member)
>> Feb 21 22:11:12 vhbl06 kernel: [343490.563365] dlm: closing connection to 
>> node 167773709
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: Operation reboot of 
>> 167773709 by <no-one> for stonith-api.20937@vhbl03.9c470723: No such device
>> Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: Peer 167773709 was not 
>> terminated (reboot) by <anyone> for vhbl03: No such device 
>> (ref=9c470723-d318-4c7e-a705-ce9ee5c7ffe5) by client stonith-api.20937
>> Feb 21 22:11:12 vhbl06 dlm_controld[3641]: 343352 tell corosync to remove 
>> nodeid 167773705 from cluster
>> Feb 21 22:11:15 vhbl06 corosync[3603]:   [TOTEM ] A processor failed, 
>> forming new configuration.
>> Feb 21 22:11:19 vhbl06 corosync[3603]:   [TOTEM ] A new membership 
>> (10.0.6.10:616) was formed. Members left: 167773705
>> Feb 21 22:11:19 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the 
>> leave message. failed: 167773705
>>
>> However, no fencing agent reported ability to fence the failing node 
>> (vhbl07), because stonith-ng wasn't looking it up by name, but by 
>> numeric ID (at least that's what the logs suggest to me), and the 
>> pcmk_host_list attributes contained strings like vhbl07.
>>
>> 1. Was it dlm_controld who requested the fencing?
>>
>>    I suspect it because of the "dlm: closing connection to node
>>    167773709" kernel message right before the stonith-ng logs.  And
>>    dlm_controld really hasn't got anything to use but the corosync node
>>    ID.
> 
> Not based on this; dlm would print messages about fencing, with 
> "dlm_controld.*fence request".
> 
> However it looks like these logs are not from the DC, which will say what 
> process requested the fencing. It may be DLM or something else.
> Also, DLM on any node might initiate fencing, so it's worth looking at all 
> the nodes' logs around this time.
> 
>> 2. Shouldn't some component translate between node IDs and node names?
>>    Is this a configuration error in our setup?  Should I include both in
>>    pcmk_host_list?
> 
> Yes, stonithd's create_remote_stonith_op() function will do the translation 
> if the st_opt_cs_nodeid call option is set in the request XML. If that fails, 
> you'll see a "Could not expand nodeid" warning in the log. That option is set 
> by the kick() stonith API used by DLM, so it should happen automatically.
> 
> I'm not sure why it appears not to have worked here; logs from other nodes 
> might help. Do corosync and pacemaker know the same node names?
> That would be necessary to get the node name from corosync.
> 
> Have you tested fencing vhbl07 from the command line with stonith_admin to 
> make sure fencing is configured correctly?
> 
>> 3. After the failed fence, why was 167773705 (vhbl03) removed from the
>>    cluster?  Because it was chosen to execute the fencing operation, but
>>    failed?
> 
> dlm_controld explicitly requested it. I'm not familiar enough with DLM to 
> know why. It doesn't sound like a good idea to me.
> 
>> The logs continue like this:
>>
>> Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: crm_update_peer_proc: Node 
>> vhbl03[167773705] - state is now lost (was member)
>> Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: Removing all vhbl03 attributes 
>> for attrd_peer_change_cb
>> Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: Removing vhbl03/167773705 from 
>> the membership list
>> Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: Purged 1 peers with 
>> id=167773705 and/or uname=vhbl03 from the membership cache
>> Feb 21 22:11:19 vhbl06 corosync[3603]:   [QUORUM] Members[3]: 167773706 
>> 167773707 167773708
>> Feb 21 22:11:19 vhbl06 corosync[3603]:   [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>> Feb 21 22:11:19 vhbl06 crmd[8309]:   notice: crm_reap_unseen_nodes: Node 
>> vhbl03[167773705] - state is now lost (was member)
>> Feb 21 22:11:19 vhbl06 crmd[8309]:   notice: State transition S_ELECTION -> 
>> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED 
>> origin=election_timeout_popped ]
>> Feb 21 22:11:19 vhbl06 pacemakerd[8261]:   notice: crm_reap_unseen_nodes: 
>> Node vhbl03[167773705] - state is now lost (was member)
>> Feb 21 22:11:19 vhbl06 cib[8304]:   notice: crm_update_peer_proc: Node 
>> vhbl03[167773705] - state is now lost (was member)
>> Feb 21 22:11:19 vhbl06 cib[8304]:   notice: Removing vhbl03/167773705 from 
>> the membership list
>> Feb 21 22:11:19 vhbl06 cib[8304]:   notice: Purged 1 peers with id=167773705 
>> and/or uname=vhbl03 from the membership cache
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: crm_update_peer_proc: 
>> Node vhbl03[167773705] - state is now lost (was member)
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: Removing vhbl03/167773705 
>> from the membership list
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: Purged 1 peers with 
>> id=167773705 and/or uname=vhbl03 from the membership cache
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not 
>> fence (reboot) 167773709: static-list
>> Feb 21 22:11:19 vhbl06 kernel: [343497.392381] dlm: closing connection 
>> to node 167773705
>>
>> 4. Why can't I see any action above to fence 167773705 (vhbl03)?
> 
> Only the DC and the node that executes the fence will have those logs.
> The other nodes will just have the query results ("can/can not fence") and 
> the final stonith result.
> 
>> Feb 21 22:11:19 vhbl06 crmd[8309]:  warning: FSA: Input I_ELECTION_DC from 
>> do_election_check() received in state S_INTEGRATION
>> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: Operation reboot of 
>> 167773709 by <no-one> for stonith-api.17462@vhbl04.0cd1625d: No such device
>> Feb 21 22:11:19 vhbl06 crmd[8309]:   notice: Peer 167773709 was not 
>> terminated (reboot) by <anyone> for vhbl04: No such device 
>> (ref=0cd1625d-a61e-4f94-930d-bb80a10b89da) by client stonith-api.17462
>> Feb 21 22:11:19 vhbl06 dlm_controld[3641]: 343359 tell corosync to remove 
>> nodeid 167773706 from cluster
>> Feb 21 22:11:22 vhbl06 corosync[3603]:   [TOTEM ] A processor failed, 
>> forming new configuration.
>> Feb 21 22:11:26 vhbl06 corosync[3603]:   [TOTEM ] A new membership 
>> (10.0.6.11:620) was formed. Members left: 167773706
>> Feb 21 22:11:26 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the 
>> leave message. failed: 167773706
>>
>> Looks like vhbl04 took over the job of fencing vhbl07 from vhbl03, and 
>> of course failed the exact same way.  So it was expelled, too.
>>
>> Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: crm_update_peer_proc: Node 
>> vhbl04[167773706] - state is now lost (was member)
>> Feb 21 22:11:26 vhbl06 cib[8304]:   notice: crm_update_peer_proc: Node 
>> vhbl04[167773706] - state is now lost (was member)
>> Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: Removing all vhbl04 attributes 
>> for attrd_peer_change_cb
>> Feb 21 22:11:26 vhbl06 cib[8304]:   notice: Removing vhbl04/167773706 from 
>> the membership list
>> Feb 21 22:11:26 vhbl06 cib[8304]:   notice: Purged 1 peers with id=167773706 
>> and/or uname=vhbl04 from the membership cache
>> Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: Removing vhbl04/167773706 from 
>> the membership list
>> Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: Purged 1 peers with 
>> id=167773706 and/or uname=vhbl04 from the membership cache
>> Feb 21 22:11:26 vhbl06 stonith-ng[8305]:   notice: crm_update_peer_proc: 
>> Node vhbl04[167773706] - state is now lost (was member)
>> Feb 21 22:11:26 vhbl06 stonith-ng[8305]:   notice: Removing vhbl04/167773706 
>> from the membership list
>> Feb 21 22:11:26 vhbl06 crmd[8309]:  warning: No match for shutdown action on 
>> 167773706
>> Feb 21 22:11:26 vhbl06 stonith-ng[8305]:   notice: Purged 1 peers with 
>> id=167773706 and/or uname=vhbl04 from the membership cache
>> Feb 21 22:11:26 vhbl06 crmd[8309]:   notice: Stonith/shutdown of vhbl04 not 
>> matched
>> Feb 21 22:11:26 vhbl06 corosync[3603]:   [QUORUM] This node is within the 
>> non-primary component and will NOT provide any services.
>> Feb 21 22:11:26 vhbl06 pacemakerd[8261]:   notice: Membership 620: quorum 
>> lost (2)
>> Feb 21 22:11:26 vhbl06 crmd[8309]:   notice: Membership 620: quorum lost (2)
>> Feb 21 22:11:26 vhbl06 corosync[3603]:   [QUORUM] Members[2]: 167773707 
>> 167773708
>>
>> That, finally, was enough to lose quorum and paralize the cluster.
>> Later, vhbl07 was rebooted by the hardware watchdog and came back for 
>> a cold welcome:
>>
>> Feb 21 22:24:53 vhbl06 corosync[3603]:   [TOTEM ] A new membership 
>> (10.0.6.12:628) was formed. Members joined: 167773709
>> Feb 21 22:24:53 vhbl06 corosync[3603]:   [QUORUM] Members[2]: 167773708 
>> 167773709
>> Feb 21 22:24:53 vhbl06 corosync[3603]:   [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>> Feb 21 22:24:53 vhbl06 crmd[8309]:   notice: pcmk_quorum_notification: Node 
>> vhbl07[167773709] - state is now member (was lost)
>> Feb 21 22:24:53 vhbl06 pacemakerd[8261]:   notice: pcmk_quorum_notification: 
>> Node vhbl07[167773709] - state is now member (was lost)
>> Feb 21 22:24:53 vhbl06 dlm_controld[3641]: 344173 daemon joined 
>> 167773709 needs fencing Feb 21 22:25:47 vhbl06 dlm_controld[3641]: 344226 
>> clvmd wait for quorum
>> Feb 21 22:29:26 vhbl06 crmd[8309]:   notice: State transition S_IDLE -> 
>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
>> origin=crm_timer_popped ]
>> Feb 21 22:29:27 vhbl06 pengine[8308]:   notice: We do not have quorum - 
>> fencing and resource management disabled
>> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl04 is unclean 
>> because the node is no longer part of the cluster Feb 21 22:29:27 
>> vhbl06 pengine[8308]:  warning: Node vhbl04 is unclean Feb 21 22:29:27 
>> vhbl06 pengine[8308]:  warning: Node vhbl05 is unclean because the 
>> node is no longer part of the cluster Feb 21 22:29:27 vhbl06 
>> pengine[8308]:  warning: Node vhbl05 is unclean Feb 21 22:29:27 vhbl06 
>> pengine[8308]:  warning: Node vhbl07 is unclean because our peer 
>> process is no longer available Feb 21 22:29:27 vhbl06 pengine[8308]:  
>> warning: Node vhbl07 is unclean Feb 21 22:29:27 vhbl06 pengine[8308]:  
>> warning: Node vhbl03 is unclean because vm-niifdc is thought to be 
>> active there Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Action 
>> vm-dogwood_stop_0 on vhbl03 is unrunnable (offline) Feb 21 22:29:27 vhbl06 
>> pengine[8308]:  warning: Action vm-niifidp_stop_0 on vhbl03 is unrunnable 
>> (offline) [...] Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl03 
>> is unclean!
>> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl04 is unclean!
>> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl05 is unclean!
>> Feb 21 22:29:27 vhbl06 pengine[8308]:   notice: We can fence vhbl07 without 
>> quorum because they're in our membership
>> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Scheduling Node vhbl07 for 
>> STONITH
>> Feb 21 22:29:27 vhbl06 pengine[8308]:   notice: Cannot fence unclean nodes 
>> until quorum is attained (or no-quorum-policy is set to ignore)
>> [...]
>> Feb 21 22:29:27 vhbl06 crmd[8309]:   notice: Executing reboot fencing 
>> operation (212) on vhbl07 (timeout=60000)
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: Client crmd.8309.09cea2e7 
>> wants to fence (reboot) 'vhbl07' with device '(any)'
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: Initiating remote 
>> operation reboot for vhbl07: 31b2023d-3fc5-419e-8490-91eb81254497 (0)
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can fence 
>> (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can fence 
>> (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not 
>> fence (reboot) vhbl07: static-list
>> Feb 21 22:29:27 vhbl06 dlm_controld[3641]: 344447 daemon remove 
>> 167773709 already needs fencing Feb 21 22:29:27 vhbl06 dlm_controld[3641]: 
>> 344447 tell corosync to remove nodeid 167773709 from cluster
>> Feb 21 22:29:30 vhbl06 corosync[3603]:   [TOTEM ] A processor failed, 
>> forming new configuration.
>> Feb 21 22:29:34 vhbl06 corosync[3603]:   [TOTEM ] A new membership 
>> (10.0.6.12:632) was formed. Members left: 167773709
>> Feb 21 22:29:34 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the 
>> leave message. failed: 167773709
>> Feb 21 22:29:34 vhbl06 corosync[3603]:   [QUORUM] Members[1]: 167773708
>> Feb 21 22:29:34 vhbl06 corosync[3603]:   [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>> Feb 21 22:29:34 vhbl06 pacemakerd[8261]:   notice: crm_reap_unseen_nodes: 
>> Node vhbl07[167773709] - state is now lost (was member)
>> Feb 21 22:29:34 vhbl06 crmd[8309]:   notice: crm_reap_unseen_nodes: Node 
>> vhbl07[167773709] - state is now lost (was member)
>> Feb 21 22:29:34 vhbl06 kernel: [344592.424938] dlm: closing connection to 
>> node 167773709
>> Feb 21 22:29:42 vhbl06 stonith-ng[8305]:   notice: Operation 'reboot' [5533] 
>> (call 2 from crmd.8309) for host 'vhbl07' with device 'fencing-vhbl07' 
>> returned: 0 (OK)
>> Feb 21 22:29:42 vhbl06 stonith-ng[8305]:   notice: Operation reboot of 
>> vhbl07 by vhbl06 for crmd.8309@vhbl06.31b2023d: OK
>> Feb 21 22:29:42 vhbl06 crmd[8309]:   notice: Stonith operation 
>> 2/212:1:0:d06e9743-b452-4b6a-b3a9-d352a4454269: OK (0)
>> Feb 21 22:29:42 vhbl06 crmd[8309]:   notice: Peer vhbl07 was terminated 
>> (reboot) by vhbl06 for vhbl06: OK (ref=31b2023d-3fc5-419e-8490-91eb81254497) 
>> by client crmd.8309
>>
>> That is, fencing by node name worked all right.
>>
>> I wonder if I understood the issue right and what would be the best 
>> way to avoid it in the future.  Please advise.
>>
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 22 Feb 2016 11:39:03 -0600
> From: Ken Gaillot <kgail...@redhat.com>
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] ClusterIP location constraint reappears
>       after reboot
> Message-ID: <56cb47b7.3060...@redhat.com>
> Content-Type: text/plain; charset=windows-1252
> 
> On 02/22/2016 07:26 AM, Jeremy Matthews wrote:
>> Thank you, Ken Gaillot, for your response. Sorry for the delayed followup, 
>> but I have looked and looked at the scripts. There are a couple of scripts 
>> that have a pcs resource ban command, but they are not executed at the time 
>> of shutdown which is when I've discovered that the constraint is put back 
>> in. Our application software did not change on the system. We just updated 
>> pcs and pacemaker (and dependencies). I had to rollback pcs because it has 
>> an issue. 
>>
>> Below is from /var/log/cluster/corosync.log. Any clues here as to why the 
>> constraint might have been added? In my other system without the pacemaker 
>> update, there is not the addition of the constraint. 
> 
> It might help to see the entire log from the time you issued the reboot 
> command to when the constraint was added.
> 
> Notice in the cib logs it says "origin=local/crm_resource". That means that 
> crm_resource was what originally added the constraint (pcs resource ban calls 
> crm_resource).
> 
> I'd be curious whether this makes a difference: after removing the 
> constraint, run "pcs cib-upgrade". It shouldn't, but it's the only thing I 
> can think of to try.
> 
> CIB schema versions change when new features are added that require new CIB 
> syntax. pcs should automatically run cib-upgrade if you ever use a newer 
> feature than your current CIB version supports. You don't really need to 
> cib-upgrade explicitly, but it doesn't hurt, and it will get rid of those 
> "Transformed the configuration" messages.
> 
>> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: 
>> do_state_transition:   State transition S_POLICY_ENGINE -> 
>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
>> origin=handle_response ]
>> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: do_te_invoke:  
>> Processing graph 9 (ref=pe_calc-dc-1455920543-46) derived from 
>> /var/lib/pacemaker/pengine/pe-input-642.bz2
>> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:   notice: run_graph:     
>> Transition 9 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
>> Source=/var/lib/pacemaker/pengine/pe-input-642.bz2): Complete
>> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: do_log:        FSA: 
>> Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
>> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:   notice: 
>> do_state_transition:   State transition S_TRANSITION_ENGINE -> S_IDLE [ 
>> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:   notice: process_pe_message: 
>>    Calculated Transition 9: /var/lib/pacemaker/pengine/pe-input-642.bz2
>> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: 
>> cib_process_request:   Forwarding cib_modify operation for section 
>> constraints to master (origin=local/crm_resource/3)
>> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:     
>>    Diff: --- 0.291.8 2
>> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:     
>>    Diff: +++ 0.292.0 (null)
>> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:     
>>    +  /cib:  @epoch=292, @num_updates=0
>> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: cib_perform_op:     
>>    ++ /cib/configuration/constraints:  <rsc_location 
>> id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" role="Started" 
>> node="g5se-f3efce" score="-INFINITY"/>
>> Feb 19 15:22:23 [1994] g5se-f3efce        cib:     info: 
>> cib_process_request:   Completed cib_modify operation for section 
>> constraints: OK (rc=0, origin=g5se-f3efce/crm_resource/3, version=0.292.0)
>> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:     info: 
>> abort_transition_graph:        Transition aborted by 
>> rsc_location.cli-ban-ClusterIP-on-g5se-f3efce 'create': Non-status change 
>> (cib=0.292.0, source=te_update_diff:383, 
>> path=/cib/configuration/constraints, 1)
>> Feb 19 15:22:23 [1999] g5se-f3efce       crmd:   notice: 
>> do_state_transition:   State transition S_IDLE -> S_POLICY_ENGINE [ 
>> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:   notice: update_validation:  
>>    pacemaker-1.2-style configuration is also valid for pacemaker-1.3
>> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: update_validation:  
>>    Transformation upgrade-1.3.xsl successful
>> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: update_validation:  
>>    Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
>> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: cli_config_update:  
>>    Your configuration was internally updated to the latest version 
>> (pacemaker-2.0)
>> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:   notice: unpack_config:      
>>    On loss of CCM Quorum: Ignore
>> Feb 19 15:22:23 [1998] g5se-f3efce    pengine:     info: unpack_status:      
>>    Node g5se-f3efce is in standby-mode
>>
>> I'm not sure what all has to be included my original email and Ken Gaillot's 
>> response embedded in it below. 
>>
>> Message: 3
>> Date: Thu, 18 Feb 2016 13:37:31 -0600
>> From: Ken Gaillot <kgail...@redhat.com>
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] ClusterIP location constraint reappears
>>      after reboot
>> Message-ID: <56c61d7b.9090...@redhat.com>
>> Content-Type: text/plain; charset=windows-1252
>>
>> On 02/18/2016 01:07 PM, Jeremy Matthews wrote:
>>> Hi,
>>>
>>> We're having an issue with our cluster where after a reboot of our system a 
>>> location constraint reappears for the ClusterIP. This causes a problem, 
>>> because we have a daemon that checks the cluster state and waits until the 
>>> ClusterIP is started before it kicks off our application. We didn't have 
>>> this issue when using an earlier version of pacemaker. Here is the 
>>> constraint as shown by pcs:
>>>
>>> [root@g5se-f3efce cib]# pcs constraint Location Constraints:
>>>   Resource: ClusterIP
>>>     Disabled on: g5se-f3efce (role: Started) Ordering Constraints:
>>> Colocation Constraints:
>>>
>>> ...and here is our cluster status with the ClusterIP being Stopped:
>>>
>>> [root@g5se-f3efce cib]# pcs status
>>> Cluster name: cl-g5se-f3efce
>>> Last updated: Thu Feb 18 11:36:01 2016 Last change: Thu Feb 18
>>> 10:48:33 2016 via crm_resource on g5se-f3efce
>>> Stack: cman
>>> Current DC: g5se-f3efce - partition with quorum
>>> Version: 1.1.11-97629de
>>> 1 Nodes configured
>>> 4 Resources configured
>>>
>>>
>>> Online: [ g5se-f3efce ]
>>>
>>> Full list of resources:
>>>
>>> sw-ready-g5se-f3efce   (ocf::pacemaker:GBmon): Started g5se-f3efce
>>> meta-data      (ocf::pacemaker:GBmon): Started g5se-f3efce
>>> netmon (ocf::heartbeat:ethmonitor):    Started g5se-f3efce
>>> ClusterIP      (ocf::heartbeat:IPaddr2):       Stopped
>>>
>>>
>>> The cluster really just has one node at this time.
>>>
>>> I retrieve the constraint ID, remove the constraint, verify that ClusterIP 
>>> is started,  and then reboot:
>>>
>>> [root@g5se-f3efce cib]# pcs constraint ref ClusterIP
>>> Resource: ClusterIP
>>>   cli-ban-ClusterIP-on-g5se-f3efce
>>> [root@g5se-f3efce cib]# pcs constraint remove 
>>> cli-ban-ClusterIP-on-g5se-f3efce
>>>
>>> [root@g5se-f3efce cib]# pcs status
>>> Cluster name: cl-g5se-f3efce
>>> Last updated: Thu Feb 18 11:45:09 2016 Last change: Thu Feb 18
>>> 11:44:53 2016 via crm_resource on g5se-f3efce
>>> Stack: cman
>>> Current DC: g5se-f3efce - partition with quorum
>>> Version: 1.1.11-97629de
>>> 1 Nodes configured
>>> 4 Resources configured
>>>
>>>
>>> Online: [ g5se-f3efce ]
>>>
>>> Full list of resources:
>>>
>>> sw-ready-g5se-f3efce   (ocf::pacemaker:GBmon): Started g5se-f3efce
>>> meta-data      (ocf::pacemaker:GBmon): Started g5se-f3efce
>>> netmon (ocf::heartbeat:ethmonitor):    Started g5se-f3efce
>>> ClusterIP      (ocf::heartbeat:IPaddr2):       Started g5se-f3efce
>>>
>>>
>>> [root@g5se-f3efce cib]# reboot
>>>
>>> ....after reboot, log in, and the constraint is back and ClusterIP has not 
>>> started.
>>>
>>>
>>> I have noticed in /var/lib/pacemaker/cib that the cib-x.raw files get 
>>> created when there are changes to the cib (cib.xml). After a reboot, I see 
>>> the constraint being added in a diff between .raw files:
>>>
>>> [root@g5se-f3efce cib]# diff cib-7.raw cib-8.raw
>>> 1c1
>>> < <cib epoch="239" num_updates="0" admin_epoch="0" 
>>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:44:53 
>>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>>> ---
>>>> <cib epoch="240" num_updates="0" admin_epoch="0" 
>>>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:46:49 
>>>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>>>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>>> 50c50,52
>>> <     <constraints/>
>>> ---
>>>>     <constraints>
>>>>       <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" 
>>>> role="Started" node="g5se-f3efce" score="-INFINITY"/>
>>>>     </constraints>
>>>
>>>
>>> I have also looked in /var/log/cluster/corosync.log and seen logs where it 
>>> seems the cib is getting updated. I'm not sure if the constraint is being 
>>> put back in at shutdown or at start up. I just don't understand why it's 
>>> being put back in. I don't think our daemon code or other scripts are doing 
>>> this,  but it is something I could verify.
>>
>> I would look at any scripts running around that time first. Constraints that 
>> start with "cli-" were created by one of the CLI tools, so something must be 
>> calling it. The most likely candidates are pcs resource move/ban or 
>> crm_resource -M/--move/-B/--ban.
>>
>>> ********************************
>>>
>>> From "yum info pacemaker", my current version is:
>>>
>>> Name        : pacemaker
>>> Arch        : x86_64
>>> Version     : 1.1.12
>>> Release     : 8.el6_7.2
>>>
>>> My earlier version was:
>>>
>>> Name        : pacemaker
>>> Arch        : x86_64
>>> Version     : 1.1.10
>>> Release     : 1.el6_4.4
>>>
>>> I'm still using an earlier version pcs, because the new one seems to have 
>>> issues with python:
>>>
>>> Name        : pcs
>>> Arch        : noarch
>>> Version     : 0.9.90
>>> Release     : 1.0.1.el6.centos
>>>
>>> *******************************
>>>
>>> If anyone has ideas on the cause or thoughts on this, anything would be 
>>> greatly appreciated.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Jeremy Matthews
>>
>> -----Original Message-----
>> From: users-requ...@clusterlabs.org 
>> [mailto:users-requ...@clusterlabs.org]
>> Sent: Friday, February 19, 2016 2:21 AM
>> To: users@clusterlabs.org
>> Subject: Users Digest, Vol 13, Issue 35
>>
>> Send Users mailing list submissions to
>>      users@clusterlabs.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>      http://clusterlabs.org/mailman/listinfo/users
>> or, via email, send a message with subject or body 'help' to
>>      users-requ...@clusterlabs.org
>>
>> You can reach the person managing the list at
>>      users-ow...@clusterlabs.org
>>
>> When replying, please edit your Subject line so it is more specific than 
>> "Re: Contents of Users digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Re: Too quick node reboot leads to failed corosync assert on
>>       other node(s) (Michal Koutn?)
>>    2. ClusterIP location constraint reappears after reboot
>>       (Jeremy Matthews)
>>    3. Re: ClusterIP location constraint reappears after reboot
>>       (Ken Gaillot)
>>    4. Re: Too quick node reboot leads to failed corosync assert on
>>       other node(s) (Jan Friesse)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Thu, 18 Feb 2016 17:32:48 +0100
>> From: Michal Koutn? <mkou...@suse.com>
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Too quick node reboot leads to failed
>>      corosync assert on other node(s)
>> Message-ID: <56c5f230.6020...@suse.com>
>> Content-Type: text/plain; charset="windows-1252"
>>
>> On 02/18/2016 10:40 AM, Christine Caulfield wrote:
>>> I definitely remember looking into this, or something very like it, 
>>> ages ago. I can't find anything in the commit logs for either 
>>> corosync or cman that looks relevant though. If you're seeing it on 
>>> recent builds then it's obviously still a problem anyway and we ought to 
>>> look into it!
>> Thanks for you replies.
>>
>> So far this happened only once and we've done only "post mortem", alas no 
>> available reproducer. If I have time, I'll try to reproduce it locally and 
>> check whether it exists in the current version.
>>
>> Michal
>>
>> -------------- next part -------------- A non-text attachment was 
>> scrubbed...
>> Name: signature.asc
>> Type: application/pgp-signature
>> Size: 819 bytes
>> Desc: OpenPGP digital signature
>> URL: 
>> <http://clusterlabs.org/pipermail/users/attachments/20160218/97908c9d/
>> attachment-0001.sig>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Thu, 18 Feb 2016 19:07:19 +0000
>> From: Jeremy Matthews <jeremy.matth...@genband.com>
>> To: "users@clusterlabs.org" <users@clusterlabs.org>
>> Subject: [ClusterLabs] ClusterIP location constraint reappears after
>>      reboot
>> Message-ID:
>>      <ba3fced1d982a94aa64964f08b104956012d760...@gbplmail01.genband.com>
>> Content-Type: text/plain; charset="windows-1252"
>>
>> Hi,
>>
>> We're having an issue with our cluster where after a reboot of our system a 
>> location constraint reappears for the ClusterIP. This causes a problem, 
>> because we have a daemon that checks the cluster state and waits until the 
>> ClusterIP is started before it kicks off our application. We didn't have 
>> this issue when using an earlier version of pacemaker. Here is the 
>> constraint as shown by pcs:
>>
>> [root@g5se-f3efce cib]# pcs constraint Location Constraints:
>>   Resource: ClusterIP
>>     Disabled on: g5se-f3efce (role: Started) Ordering Constraints:
>> Colocation Constraints:
>>
>> ...and here is our cluster status with the ClusterIP being Stopped:
>>
>> [root@g5se-f3efce cib]# pcs status
>> Cluster name: cl-g5se-f3efce
>> Last updated: Thu Feb 18 11:36:01 2016 Last change: Thu Feb 18 
>> 10:48:33 2016 via crm_resource on g5se-f3efce
>> Stack: cman
>> Current DC: g5se-f3efce - partition with quorum
>> Version: 1.1.11-97629de
>> 1 Nodes configured
>> 4 Resources configured
>>
>>
>> Online: [ g5se-f3efce ]
>>
>> Full list of resources:
>>
>> sw-ready-g5se-f3efce   (ocf::pacemaker:GBmon): Started g5se-f3efce
>> meta-data      (ocf::pacemaker:GBmon): Started g5se-f3efce
>> netmon (ocf::heartbeat:ethmonitor):    Started g5se-f3efce
>> ClusterIP      (ocf::heartbeat:IPaddr2):       Stopped
>>
>>
>> The cluster really just has one node at this time.
>>
>> I retrieve the constraint ID, remove the constraint, verify that ClusterIP 
>> is started,  and then reboot:
>>
>> [root@g5se-f3efce cib]# pcs constraint ref ClusterIP
>> Resource: ClusterIP
>>   cli-ban-ClusterIP-on-g5se-f3efce
>> [root@g5se-f3efce cib]# pcs constraint remove 
>> cli-ban-ClusterIP-on-g5se-f3efce
>>
>> [root@g5se-f3efce cib]# pcs status
>> Cluster name: cl-g5se-f3efce
>> Last updated: Thu Feb 18 11:45:09 2016 Last change: Thu Feb 18 
>> 11:44:53 2016 via crm_resource on g5se-f3efce
>> Stack: cman
>> Current DC: g5se-f3efce - partition with quorum
>> Version: 1.1.11-97629de
>> 1 Nodes configured
>> 4 Resources configured
>>
>>
>> Online: [ g5se-f3efce ]
>>
>> Full list of resources:
>>
>> sw-ready-g5se-f3efce   (ocf::pacemaker:GBmon): Started g5se-f3efce
>> meta-data      (ocf::pacemaker:GBmon): Started g5se-f3efce
>> netmon (ocf::heartbeat:ethmonitor):    Started g5se-f3efce
>> ClusterIP      (ocf::heartbeat:IPaddr2):       Started g5se-f3efce
>>
>>
>> [root@g5se-f3efce cib]# reboot
>>
>> ....after reboot, log in, and the constraint is back and ClusterIP has not 
>> started.
>>
>>
>> I have noticed in /var/lib/pacemaker/cib that the cib-x.raw files get 
>> created when there are changes to the cib (cib.xml). After a reboot, I see 
>> the constraint being added in a diff between .raw files:
>>
>> [root@g5se-f3efce cib]# diff cib-7.raw cib-8.raw
>> 1c1
>> < <cib epoch="239" num_updates="0" admin_epoch="0" 
>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:44:53 
>> 2016" update-origin="g5se-f3efce" update-client="crm_resource" 
>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>> ---
>>> <cib epoch="240" num_updates="0" admin_epoch="0" 
>>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:46:49 
>>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>> 50c50,52
>> <     <constraints/>
>> ---
>>>     <constraints>
>>>       <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" 
>>> role="Started" node="g5se-f3efce" score="-INFINITY"/>
>>>     </constraints>
>>
>>
>> I have also looked in /var/log/cluster/corosync.log and seen logs where it 
>> seems the cib is getting updated. I'm not sure if the constraint is being 
>> put back in at shutdown or at start up. I just don't understand why it's 
>> being put back in. I don't think our daemon code or other scripts are doing 
>> this,  but it is something I could verify.
>>
>> ********************************
>>
>> >From "yum info pacemaker", my current version is:
>>
>> Name        : pacemaker
>> Arch        : x86_64
>> Version     : 1.1.12
>> Release     : 8.el6_7.2
>>
>> My earlier version was:
>>
>> Name        : pacemaker
>> Arch        : x86_64
>> Version     : 1.1.10
>> Release     : 1.el6_4.4
>>
>> I'm still using an earlier version pcs, because the new one seems to have 
>> issues with python:
>>
>> Name        : pcs
>> Arch        : noarch
>> Version     : 0.9.90
>> Release     : 1.0.1.el6.centos
>>
>> *******************************
>>
>> If anyone has ideas on the cause or thoughts on this, anything would be 
>> greatly appreciated.
>>
>> Thanks!
>>
>>
>>
>> Jeremy Matthews
>>
>>
>> -------------- next part -------------- An HTML attachment was 
>> scrubbed...
>> URL: 
>> <http://clusterlabs.org/pipermail/users/attachments/20160218/8a4b99fd/
>> attachment-0001.html>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Thu, 18 Feb 2016 13:37:31 -0600
>> From: Ken Gaillot <kgail...@redhat.com>
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] ClusterIP location constraint reappears
>>      after reboot
>> Message-ID: <56c61d7b.9090...@redhat.com>
>> Content-Type: text/plain; charset=windows-1252
>>
>> On 02/18/2016 01:07 PM, Jeremy Matthews wrote:
>>> Hi,
>>>
>>> We're having an issue with our cluster where after a reboot of our system a 
>>> location constraint reappears for the ClusterIP. This causes a problem, 
>>> because we have a daemon that checks the cluster state and waits until the 
>>> ClusterIP is started before it kicks off our application. We didn't have 
>>> this issue when using an earlier version of pacemaker. Here is the 
>>> constraint as shown by pcs:
>>>
>>> [root@g5se-f3efce cib]# pcs constraint Location Constraints:
>>>   Resource: ClusterIP
>>>     Disabled on: g5se-f3efce (role: Started) Ordering Constraints:
>>> Colocation Constraints:
>>>
>>> ...and here is our cluster status with the ClusterIP being Stopped:
>>>
>>> [root@g5se-f3efce cib]# pcs status
>>> Cluster name: cl-g5se-f3efce
>>> Last updated: Thu Feb 18 11:36:01 2016 Last change: Thu Feb 18
>>> 10:48:33 2016 via crm_resource on g5se-f3efce
>>> Stack: cman
>>> Current DC: g5se-f3efce - partition with quorum
>>> Version: 1.1.11-97629de
>>> 1 Nodes configured
>>> 4 Resources configured
>>>
>>>
>>> Online: [ g5se-f3efce ]
>>>
>>> Full list of resources:
>>>
>>> sw-ready-g5se-f3efce   (ocf::pacemaker:GBmon): Started g5se-f3efce
>>> meta-data      (ocf::pacemaker:GBmon): Started g5se-f3efce
>>> netmon (ocf::heartbeat:ethmonitor):    Started g5se-f3efce
>>> ClusterIP      (ocf::heartbeat:IPaddr2):       Stopped
>>>
>>>
>>> The cluster really just has one node at this time.
>>>
>>> I retrieve the constraint ID, remove the constraint, verify that ClusterIP 
>>> is started,  and then reboot:
>>>
>>> [root@g5se-f3efce cib]# pcs constraint ref ClusterIP
>>> Resource: ClusterIP
>>>   cli-ban-ClusterIP-on-g5se-f3efce
>>> [root@g5se-f3efce cib]# pcs constraint remove 
>>> cli-ban-ClusterIP-on-g5se-f3efce
>>>
>>> [root@g5se-f3efce cib]# pcs status
>>> Cluster name: cl-g5se-f3efce
>>> Last updated: Thu Feb 18 11:45:09 2016 Last change: Thu Feb 18
>>> 11:44:53 2016 via crm_resource on g5se-f3efce
>>> Stack: cman
>>> Current DC: g5se-f3efce - partition with quorum
>>> Version: 1.1.11-97629de
>>> 1 Nodes configured
>>> 4 Resources configured
>>>
>>>
>>> Online: [ g5se-f3efce ]
>>>
>>> Full list of resources:
>>>
>>> sw-ready-g5se-f3efce   (ocf::pacemaker:GBmon): Started g5se-f3efce
>>> meta-data      (ocf::pacemaker:GBmon): Started g5se-f3efce
>>> netmon (ocf::heartbeat:ethmonitor):    Started g5se-f3efce
>>> ClusterIP      (ocf::heartbeat:IPaddr2):       Started g5se-f3efce
>>>
>>>
>>> [root@g5se-f3efce cib]# reboot
>>>
>>> ....after reboot, log in, and the constraint is back and ClusterIP has not 
>>> started.
>>>
>>>
>>> I have noticed in /var/lib/pacemaker/cib that the cib-x.raw files get 
>>> created when there are changes to the cib (cib.xml). After a reboot, I see 
>>> the constraint being added in a diff between .raw files:
>>>
>>> [root@g5se-f3efce cib]# diff cib-7.raw cib-8.raw
>>> 1c1
>>> < <cib epoch="239" num_updates="0" admin_epoch="0" 
>>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:44:53 
>>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>>> ---
>>>> <cib epoch="240" num_updates="0" admin_epoch="0" 
>>>> validate-with="pacemaker-1.2" cib-last-written="Thu Feb 18 11:46:49 
>>>> 2016" update-origin="g5se-f3efce" update-client="crm_resource"
>>>> crm_feature_set="3.0.9" have-quorum="1" dc-uuid="g5se-f3efce">
>>> 50c50,52
>>> <     <constraints/>
>>> ---
>>>>     <constraints>
>>>>       <rsc_location id="cli-ban-ClusterIP-on-g5se-f3efce" rsc="ClusterIP" 
>>>> role="Started" node="g5se-f3efce" score="-INFINITY"/>
>>>>     </constraints>
>>>
>>>
>>> I have also looked in /var/log/cluster/corosync.log and seen logs where it 
>>> seems the cib is getting updated. I'm not sure if the constraint is being 
>>> put back in at shutdown or at start up. I just don't understand why it's 
>>> being put back in. I don't think our daemon code or other scripts are doing 
>>> this,  but it is something I could verify.
>>
>> I would look at any scripts running around that time first. Constraints that 
>> start with "cli-" were created by one of the CLI tools, so something must be 
>> calling it. The most likely candidates are pcs resource move/ban or 
>> crm_resource -M/--move/-B/--ban.
>>
>>> ********************************
>>>
>>> From "yum info pacemaker", my current version is:
>>>
>>> Name        : pacemaker
>>> Arch        : x86_64
>>> Version     : 1.1.12
>>> Release     : 8.el6_7.2
>>>
>>> My earlier version was:
>>>
>>> Name        : pacemaker
>>> Arch        : x86_64
>>> Version     : 1.1.10
>>> Release     : 1.el6_4.4
>>>
>>> I'm still using an earlier version pcs, because the new one seems to have 
>>> issues with python:
>>>
>>> Name        : pcs
>>> Arch        : noarch
>>> Version     : 0.9.90
>>> Release     : 1.0.1.el6.centos
>>>
>>> *******************************
>>>
>>> If anyone has ideas on the cause or thoughts on this, anything would be 
>>> greatly appreciated.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Jeremy Matthews
>>
>>
>>
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Fri, 19 Feb 2016 09:18:22 +0100
>> From: Jan Friesse <jfrie...@redhat.com>
>> To: Cluster Labs - All topics related to open-source clustering
>>      welcomed        <users@clusterlabs.org>
>> Subject: Re: [ClusterLabs] Too quick node reboot leads to failed
>>      corosync assert on other node(s)
>> Message-ID: <56c6cfce.5060...@redhat.com>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>> Michal Koutn? napsal(a):
>>> On 02/18/2016 10:40 AM, Christine Caulfield wrote:
>>>> I definitely remember looking into this, or something very like it, 
>>>> ages ago. I can't find anything in the commit logs for either 
>>>> corosync or cman that looks relevant though. If you're seeing it on 
>>>> recent builds then it's obviously still a problem anyway and we ought to 
>>>> look into it!
>>> Thanks for you replies.
>>>
>>> So far this happened only once and we've done only "post mortem", 
>>> alas no available reproducer. If I have time, I'll try to reproduce 
>>> it
>>
>> Ok. Actually I was trying to reproduce and was really not successful 
>> (current master). Steps I've used:
>> - 2 nodes, token set to 30 sec
>> - execute cpgbench on node2
>> - pause node1 corosync (ctrl+z), kill node1 corosync (kill -9 %1)
>> - wait until corosync on node2 move into "entering GATHER state from..."
>> - execute corosync on node1
>>
>> Basically during recovery new node trans list was never send (and/or ignored 
>> by node2).
>>
>> I'm going to try test v1.4.7, but it's also possible that bug is fixed 
>> by other commits (my favorites are 
>> cfbb021e130337603fe5b545d1e377296ecb92ea,
>> 4ee84c51fa73c4ec7cbee922111a140a3aaf75df,
>> f135b680967aaef1d466f40170c75ae3e470e147).
>>
>> Regards,
>>    Honza
>>
>>> locally and check whether it exists in the current version.
>>>
>>> Michal


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] ClusterIP location constraint reappears after reboot

Reply via email to