Re: [ClusterLabs] Cannot stop cluster due to order constraint

2017-09-17 Thread Leon Steffens

>> 
>> pcs constraint order start main1 then stop backup1 kind=Serialize
> 
> I think you want kind=Optional here. "Optional" means that if both
> actions are needed in the same transition, perform them in this order,
> otherwise it doesn't limit anything. "Serialize" means the start and
> stop can happen in either order, but not simultaneously, and backup1
> can't stop unless main1 is starting.


Thanks Ken!  Looking at it now, it seems so obvious - not sure why we didn’t 
even consider that.___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cannot stop cluster due to order constraint

2017-09-15 Thread Ken Gaillot
On Fri, 2017-09-08 at 15:31 +1000, Leon Steffens wrote:
> Hi all,
> 
> We are running Pacemaker 1.1.15 under Centos 6.9, and have a simple
> 3-node cluster with 6 sets of "main" and "backup" resources (just
> Dummy ones):
> 
> main1
> backup1
> main2
> backup2
> etc.
> 
> We have the following co-location constraint between main1 and
> backup1 (-200 because we don't want them to be on the same node, but
> under some circumstances they can end up on the same node)
> 
> pcs constraint colocation add backup1 with main1 -200
> 
> We also have the following order constraint between main1 and
> backup1.  This caters for the scenario where they end up on the same
> node - we want to make sure that "main" gets started before "backup"
> gets stopped, and started somewhere else (because of co-location
> score):
> 
> pcs constraint order start main1 then stop backup1 kind=Serialize

I think you want kind=Optional here. "Optional" means that if both
actions are needed in the same transition, perform them in this order,
otherwise it doesn't limit anything. "Serialize" means the start and
stop can happen in either order, but not simultaneously, and backup1
can't stop unless main1 is starting.

> When the cluster is started, everything works fine:
> 
> main1   (ocf::heartbeat:Dummy): Started straddie1
> main2   (ocf::heartbeat:Dummy): Started straddie2
> main3   (ocf::heartbeat:Dummy): Started straddie3
> main4   (ocf::heartbeat:Dummy): Started straddie1
> main5   (ocf::heartbeat:Dummy): Started straddie2
> main6   (ocf::heartbeat:Dummy): Started straddie3
> backup1 (ocf::heartbeat:Dummy): Started straddie2
> backup2 (ocf::heartbeat:Dummy): Started straddie1
> backup3 (ocf::heartbeat:Dummy): Started straddie1
> backup4 (ocf::heartbeat:Dummy): Started straddie2
> backup5 (ocf::heartbeat:Dummy): Started straddie1
> backup6 (ocf::heartbeat:Dummy): Started straddie2
> 
> When we do a "pcs cluster stop --all", things do not go so well.  pcs
> cluster stop hangs and the cluster state is as follows:
> 
> main1   (ocf::heartbeat:Dummy): Stopped
> main2   (ocf::heartbeat:Dummy): Stopped
> main3   (ocf::heartbeat:Dummy): Stopped
> main4   (ocf::heartbeat:Dummy): Stopped
> main5   (ocf::heartbeat:Dummy): Stopped
> main6   (ocf::heartbeat:Dummy): Stopped
> backup1 (ocf::heartbeat:Dummy): Started straddie2
> backup2 (ocf::heartbeat:Dummy): Started straddie1
> backup3 (ocf::heartbeat:Dummy): Started straddie1
> backup4 (ocf::heartbeat:Dummy): Started straddie2
> backup5 (ocf::heartbeat:Dummy): Started straddie1
> backup6 (ocf::heartbeat:Dummy): Started straddie2
> 
> The corosync.log clearly shows why this is happening.  It looks like
> Pacemaker wants to stop the backup resources, but the order
> constraint states that the "main" resources should be started first. 
> At this stage the "main" resources have already been stopped, and
> because the cluster is shutting down, the "main" resources cannot be
> started, and we are stuck:
> 
> 
> Sep 08 15:15:07 [23862] straddie3       crmd:     info:
> match_graph_event:      Action main1_stop_0 (14) confirmed on
> straddie1 (rc=0)
> Sep 08 15:15:07 [23862] straddie3       crmd:  warning: run_graph:  
>    Transition 48 (Complete=6, Pending=0, Fired=0, Skipped=0,
> Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-496.bz2):
> Terminated
> Sep 08 15:15:07 [23862] straddie3       crmd:  warning:
> te_graph_trigger:       Transition failed: terminated
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice: print_graph:
>    Graph 48 with 16 actions: batch-limit=0 jobs, network-
> delay=6ms
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   14]: Completed rsc op main1_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   15]: Completed rsc op main4_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   16]: Pending rsc op backup2_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 31]: Unresolved dependency rsc op
> main2_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   17]: Pending rsc op backup3_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 32]: Unresolved dependency rsc op
> main3_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   18]: Pending rsc op backup5_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 34]: Unresolved dependency rsc op
> main5_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Act

[ClusterLabs] Cannot stop cluster due to order constraint

2017-09-07 Thread Leon Steffens
Hi all,

We are running Pacemaker 1.1.15 under Centos 6.9, and have a simple 3-node
cluster with 6 sets of "main" and "backup" resources (just Dummy ones):

main1
backup1
main2
backup2
etc.

We have the following co-location constraint between main1 and backup1
(-200 because we don't want them to be on the same node, but under some
circumstances they can end up on the same node)

pcs constraint colocation add backup1 with main1 -200

We also have the following order constraint between main1 and backup1.
This caters for the scenario where they end up on the same node - we want
to make sure that "main" gets started before "backup" gets stopped, and
started somewhere else (because of co-location score):

pcs constraint order start main1 then stop backup1 kind=Serialize


When the cluster is started, everything works fine:

main1   (ocf::heartbeat:Dummy): Started straddie1
main2   (ocf::heartbeat:Dummy): Started straddie2
main3   (ocf::heartbeat:Dummy): Started straddie3
main4   (ocf::heartbeat:Dummy): Started straddie1
main5   (ocf::heartbeat:Dummy): Started straddie2
main6   (ocf::heartbeat:Dummy): Started straddie3
backup1 (ocf::heartbeat:Dummy): Started straddie2
backup2 (ocf::heartbeat:Dummy): Started straddie1
backup3 (ocf::heartbeat:Dummy): Started straddie1
backup4 (ocf::heartbeat:Dummy): Started straddie2
backup5 (ocf::heartbeat:Dummy): Started straddie1
backup6 (ocf::heartbeat:Dummy): Started straddie2

When we do a "pcs cluster stop --all", things do not go so well.  pcs
cluster stop hangs and the cluster state is as follows:

main1   (ocf::heartbeat:Dummy): Stopped
main2   (ocf::heartbeat:Dummy): Stopped
main3   (ocf::heartbeat:Dummy): Stopped
main4   (ocf::heartbeat:Dummy): Stopped
main5   (ocf::heartbeat:Dummy): Stopped
main6   (ocf::heartbeat:Dummy): Stopped
backup1 (ocf::heartbeat:Dummy): Started straddie2
backup2 (ocf::heartbeat:Dummy): Started straddie1
backup3 (ocf::heartbeat:Dummy): Started straddie1
backup4 (ocf::heartbeat:Dummy): Started straddie2
backup5 (ocf::heartbeat:Dummy): Started straddie1
backup6 (ocf::heartbeat:Dummy): Started straddie2

The corosync.log clearly shows why this is happening.  It looks like
Pacemaker wants to stop the backup resources, but the order constraint
states that the "main" resources should be started first.  At this stage
the "main" resources have already been stopped, and because the cluster is
shutting down, the "main" resources cannot be started, and we are stuck:


Sep 08 15:15:07 [23862] straddie3   crmd: info: match_graph_event:
 Action main1_stop_0 (14) confirmed on straddie1 (rc=0)
Sep 08 15:15:07 [23862] straddie3   crmd:  warning: run_graph:
 Transition 48 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=10,
Source=/var/lib/pacemaker/pengine/pe-input-496.bz2): Terminated
Sep 08 15:15:07 [23862] straddie3   crmd:  warning: te_graph_trigger:
Transition failed: terminated
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_graph:
 Graph 48 with 16 actions: batch-limit=0 jobs, network-delay=6ms
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:
 [Action   14]: Completed rsc op main1_stop_0  on
straddie1 (priority: 0, waiting: none)
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:
 [Action   15]: Completed rsc op main4_stop_0  on
straddie1 (priority: 0, waiting: none)
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:
 [Action   16]: Pending rsc op backup2_stop_0  on
straddie1 (priority: 0, waiting: none)
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:   *
[Input 31]: Unresolved dependency rsc op main2_start_0
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:
 [Action   17]: Pending rsc op backup3_stop_0  on
straddie1 (priority: 0, waiting: none)
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:   *
[Input 32]: Unresolved dependency rsc op main3_start_0
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:
 [Action   18]: Pending rsc op backup5_stop_0  on
straddie1 (priority: 0, waiting: none)
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:   *
[Input 34]: Unresolved dependency rsc op main5_start_0
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:
 [Action   19]: Completed rsc op main2_stop_0  on
straddie2 (priority: 0, waiting: none)
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:
 [Action   20]: Completed rsc op main5_stop_0  on
straddie2 (priority: 0, waiting: none)
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:
 [Action   21]: Pending rsc op backup1_stop_0  on
straddie2 (priority: 0, waiting: none)
Sep 08 15:15:07 [23862] straddie3   crmd:   notice: print_synapse:   *
[Input 30]: Unresolved dependency rsc op mai