On 06/10/2017 05:53 PM, Dan Ragle wrote: > > > On 5/25/2017 5:33 PM, Ken Gaillot wrote: >> On 05/24/2017 12:27 PM, Dan Ragle wrote: >>> I suspect this has been asked before and apologize if so, a google >>> search didn't seem to find anything that was helpful to me ... >>> >>> I'm setting up an active/active two-node cluster and am having an issue >>> where one of my two defined clusterIPs will not return to the other >>> node >>> after it (the other node) has been recovered. >>> >>> I'm running on CentOS 7.3. My resource setups look like this: >>> >>> # cibadmin -Q|grep dc-version >>> <nvpair id="cib-bootstrap-options-dc-version" >>> name="dc-version" >>> value="1.1.15-11.el7_3.4-e174ec8"/> >>> >>> # pcs resource show PublicIP-clone >>> Clone: PublicIP-clone >>> Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true >>> interleave=true >>> Resource: PublicIP (class=ocf provider=heartbeat type=IPaddr2) >>> Attributes: ip=75.144.71.38 cidr_netmask=24 nic=bond0 >>> Meta Attrs: resource-stickiness=0 >>> Operations: start interval=0s timeout=20s >>> (PublicIP-start-interval-0s) >>> stop interval=0s timeout=20s >>> (PublicIP-stop-interval-0s) >>> monitor interval=30s (PublicIP-monitor-interval-30s) >>> >>> # pcs resource show PrivateIP-clone >>> Clone: PrivateIP-clone >>> Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true >>> interleave=true >>> Resource: PrivateIP (class=ocf provider=heartbeat type=IPaddr2) >>> Attributes: ip=192.168.1.3 nic=bond1 cidr_netmask=24 >>> Meta Attrs: resource-stickiness=0 >>> Operations: start interval=0s timeout=20s >>> (PrivateIP-start-interval-0s) >>> stop interval=0s timeout=20s >>> (PrivateIP-stop-interval-0s) >>> monitor interval=10s timeout=20s >>> (PrivateIP-monitor-interval-10s) >>> >>> # pcs constraint --full | grep -i publicip >>> start WEB-clone then start PublicIP-clone (kind:Mandatory) >>> (id:order-WEB-clone-PublicIP-clone-mandatory) >>> # pcs constraint --full | grep -i privateip >>> start WEB-clone then start PrivateIP-clone (kind:Mandatory) >>> (id:order-WEB-clone-PrivateIP-clone-mandatory) >> >> FYI These constraints cover ordering only. If you also want to be sure >> that the IPs only start on a node where the web service is functional, >> then you also need colocation constraints. >> >>> >>> When I first create the resources, they split across the two nodes as >>> expected/desired: >>> >>> Clone Set: PublicIP-clone [PublicIP] (unique) >>> PublicIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> PublicIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> Clone Set: PrivateIP-clone [PrivateIP] (unique) >>> PrivateIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> PrivateIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> Clone Set: WEB-clone [WEB] >>> Started: [ node1-pcs node2-pcs ] >>> >>> I then put the second node in standby: >>> >>> # pcs node standby node2-pcs >>> >>> And the IPs both jump to node1 as expected: >>> >>> Clone Set: PublicIP-clone [PublicIP] (unique) >>> PublicIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> PublicIP:1 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> Clone Set: WEB-clone [WEB] >>> Started: [ node1-pcs ] >>> Stopped: [ node2-pcs ] >>> Clone Set: PrivateIP-clone [PrivateIP] (unique) >>> PrivateIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> PrivateIP:1 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> >>> Then unstandby the second node: >>> >>> # pcs node unstandby node2-pcs >>> >>> The publicIP goes back, but the private does not: >>> >>> Clone Set: PublicIP-clone [PublicIP] (unique) >>> PublicIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> PublicIP:1 (ocf::heartbeat:IPaddr2): Started >>> node2-pcs >>> Clone Set: WEB-clone [WEB] >>> Started: [ node1-pcs node2-pcs ] >>> Clone Set: PrivateIP-clone [PrivateIP] (unique) >>> PrivateIP:0 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> PrivateIP:1 (ocf::heartbeat:IPaddr2): Started >>> node1-pcs >>> >>> Anybody see what I'm doing wrong? I'm not seeing anything in the >>> logs to >>> indicate that it tries node2 and then fails; but I'm fairly new to the >>> software so it's possible I'm not looking in the right place. >> >> The pcs status would show any failed actions, and anything important in >> the logs would start with "error:" or "warning:". >> >> At any given time, one of the nodes is the DC, meaning it schedules >> actions for the whole cluster. That node will have more "pengine:" >> messages in its logs at the time. You can check those logs to see what >> decisions were made, as well as a "saving inputs" message to get the >> cluster state that was used to make those decisions. There is a >> crm_simulate tool that you can run on that file to get more information. >> >> By default, pacemaker will try to balance the number of resources >> running on each node, so I'm not sure why in this case node1 has four >> resources and node2 has two. crm_simulate might help explain it. >> >> However, there's nothing here telling pacemaker that the instances of >> PrivateIP should run on different nodes when possible. With your >> existing constraints, pacemaker would be equally happy to run both >> PublicIP instances on one node and both PrivateIP instances on the other >> node. > > Thanks for your reply. Finally getting back to this. > > Looking back at my config and my notes I realized I'm guilty of not > giving you enough information. There was indeed an additional pair of > resources that I didn't list in my original output that I didn't think > were relevant to the issue--my bad. Reading what you wrote made me > realize that it does appear as though pacemaker is simply trying to > balance the overall load of *all* the available resources. > > But I'm still confused as to how one would definitively correct the > issue. I tried this full reduction this morning. Starting from an > empty two-node cluster (no resources, no constraints): > > [root@node1 clustertest]# pcs status > Cluster name: MyCluster > Stack: corosync > Current DC: NONE > Last updated: Sat Jun 10 10:58:46 2017 Last change: Sat Jun > 10 10:40:23 2017 by root via cibadmin on node1-pcs > > 2 nodes and 0 resources configured > > OFFLINE: [ node1-pcs node2-pcs ] > > No resources > > > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > > [root@node1 clustertest]# pcs resource create ClusterIP > ocf:heartbeat:IPaddr2 ip=1.2.3.4 nic=bond0 cidr_netmask=24 > [root@node1 clustertest]# pcs resource meta ClusterIP > resource-stickiness=0 > [root@node1 clustertest]# pcs resource clone ClusterIP clone-max=2 > clone-node-max=2 globally-unique=true interleave=true > [root@node1 clustertest]# pcs resource create Test1 systemd:vtest1 > [root@node1 clustertest]# pcs resource create Test2 systemd:vtest2 > [root@node1 clustertest]# pcs constraint location Test1 prefers > node1-pcs=INFINITY > [root@node1 clustertest]# pcs constraint location Test2 prefers > node1-pcs=INFINITY > > [root@node1 clustertest]# pcs node standby node1-pcs > [root@node1 clustertest]# pcs status > Cluster name: MyCluster > Stack: corosync > Current DC: node1-pcs (version 1.1.15-11.el7_3.4-e174ec8) - partition > with quorum > Last updated: Sat Jun 10 11:01:07 2017 Last change: Sat Jun > 10 11:00:59 2017 by root via crm_attribute on node1-pcs > > 2 nodes and 4 resources configured > > Node node1-pcs: standby > Online: [ node2-pcs ] > > Full list of resources: > > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node2-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node2-pcs > Test1 (systemd:vtest1): Started node2-pcs > Test2 (systemd:vtest2): Started node2-pcs > > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > > [root@node1 clustertest]# pcs node unstandby node1-pcs > [root@node1 clustertest]# pcs status resources > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node2-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node1-pcs > Test1 (systemd:vtest1): Started node1-pcs > Test2 (systemd:vtest2): Started node1-pcs > > [root@node1 clustertest]# pcs node standby node2-pcs > [root@node1 clustertest]# pcs status resources > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node1-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node1-pcs > Test1 (systemd:vtest1): Started node1-pcs > Test2 (systemd:vtest2): Started node1-pcs > > [root@node1 clustertest]# pcs node unstandby node2-pcs > [root@node1 clustertest]# pcs status resources > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node1-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node2-pcs > Test1 (systemd:vtest1): Started node1-pcs > Test2 (systemd:vtest2): Started node1-pcs > > [root@node1 clustertest]# pcs resource delete ClusterIP > Attempting to stop: ClusterIP...Stopped > [root@node1 clustertest]# pcs resource create ClusterIP > ocf:heartbeat:IPaddr2 ip=1.2.3.4 nic=bond0 cidr_netmask=24 > [root@node1 clustertest]# pcs resource meta ClusterIP > resource-stickiness=0 > [root@node1 clustertest]# pcs resource clone ClusterIP clone-max=2 > clone-node-max=2 globally-unique=true interleave=true > > [root@node1 clustertest]# pcs status resources > Test1 (systemd:vtest1): Started node1-pcs > Test2 (systemd:vtest2): Started node1-pcs > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node2-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node2-pcs > > [root@node1 clustertest]# pcs node standby node1-pcs > [root@node1 clustertest]# pcs status resources > Test1 (systemd:vtest1): Started node2-pcs > Test2 (systemd:vtest2): Started node2-pcs > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node2-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node2-pcs > > [root@node1 clustertest]# pcs node unstandby node1-pcs > [root@node1 clustertest]# pcs status resources > Test1 (systemd:vtest1): Started node1-pcs > Test2 (systemd:vtest2): Started node1-pcs > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node2-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node2-pcs > > [root@node1 clustertest]# pcs node standby node2-pcs > [root@node1 clustertest]# pcs status resources > Test1 (systemd:vtest1): Started node1-pcs > Test2 (systemd:vtest2): Started node1-pcs > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node1-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node1-pcs > > [root@node1 clustertest]# pcs node unstandby node2-pcs > [root@node1 clustertest]# pcs status resources > Test1 (systemd:vtest1): Started node1-pcs > Test2 (systemd:vtest2): Started node1-pcs > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node1-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node2-pcs > > [root@node1 clustertest]# pcs node standby node1-pcs > [root@node1 clustertest]# pcs status resources > Test1 (systemd:vtest1): Started node2-pcs > Test2 (systemd:vtest2): Started node2-pcs > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node2-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node2-pcs > > [root@node1 clustertest]# pcs node unstandby node1-pcs > [root@node1 clustertest]# pcs status resources > Test1 (systemd:vtest1): Started node1-pcs > Test2 (systemd:vtest2): Started node1-pcs > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started node2-pcs > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started node2-pcs > > So in the initial configuration, it works as expected; putting the > nodes in standby one at a time (I waited at least 5 seconds between > each standby/unstandby operation) and then restoring the nodes shows > the ClusterIP bouncing back and forth as expected. But then after > deleting the ClusterIP resource and recreating it exactly as it > originally was the clones initially both stay on one node (the one the > test resources are not on). Putting the node the extra resources are > on in standby and then restoring it the IPs stay on the other node. > Putting the node the extra resources are *not* on in standby and then > restoring that node allows the IPs to split once again. > > I also did the test above with full pcs status displays after each > standby/unstandby; there were no errors displayed at each step. > > So I guess my bottom line question is: How does one tell Pacemaker > that the individual legs of globally unique clones should *always* be > spread across the available nodes whenever possible, regardless of the > number of processes on any one of the nodes? For kicks I did try: >
You configured 'clone-node-max=2'. Set that to '1' and the maximum number of clones per node is gonna be '1' - if this is what you intended ... Regards, Klaus > pcs constraint location ClusterIP:0 prefers node1-pcs=INFINITY > > but it responded with an error about an invalid character (:). > > Thanks, > > Dan > >> >> I think you could probably get what you want by putting an optional >> (<INFINITY) colocation preference between PrivateIP and PublicIP. The >> only way pacemaker could satisfy that would be to run one of each on >> each node. >> >>> Also, I noticed when putting a node in standby the main NIC appears to >>> be interrupted momentarily (long enough for my SSH session, which is >>> connected via the permanent IP on the NIC and not the clusterIP, to be >>> dropped). Is there any way to avoid this? I was thinking that the >>> cluster operations would only affect the ClusteIP and not the other IPs >>> being served on that NIC. >> >> Nothing in the cluster should cause that behavior. Check all the system >> logs around the time to see if anything unusual is reported. >> >>> >>> Thanks! >>> >>> Dan >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org