On 02/13/2016 08:09 PM, Frank D. Engel, Jr. wrote:
Hi,
I'm new to the software, and with the list - just started experimenting
with trying to get a cluster working using CentOS 7 and the pcs utility,
and I've made some progress, but I can't quite figure out why I'm seeing
this one behavior - hoping someone can help, might be something simple I
haven't picked up on yet.
I have three nodes configured (running under VirtualBox) with shared
storage using GFS2 - that much seems to be working ok.
I have a service called "WebSite" representing the Apache configuration,
and I cloned that to create "WebSite-clone", which I would expect to run
instances of on all three nodes.
However, if I leave "globally-unique" off, it will only run on one node,
where if I turn it on, it will run on two, but never on all three. I've
tried a number of things to get this working. I did verify that I can
manually start and stop Apache on all three nodes and it works on any of
them that way.
You don't want globally-unique=true; that's for cases where you want to
be able to run multiple instances of the service on the same machine if
necessary, because each clone handles different requests.
Currently my status looks like this (with globally-unique set to false;
"cluster-data" is my GFS2 filesystem):
Cluster name: lincl
Last updated: Sat Feb 13 20:58:26 2016 Last change: Sat Feb 13
20:45:08 2016 by root via crm_resource on lincl2-hb
Stack: corosync
Current DC: lincl2-hb (version 1.1.13-10.el7-44eb2dd) - partition with
quorum
3 nodes and 13 resources configured
Online: [ lincl0-hb lincl1-hb lincl2-hb ]
Full list of resources:
kdump (stonith:fence_kdump): Started lincl0-hb
Clone Set: dlm-clone [dlm]
Started: [ lincl0-hb lincl1-hb lincl2-hb ]
Master/Slave Set: cluster-data-clone [cluster-data]
Slaves: [ lincl0-hb lincl1-hb lincl2-hb ]
Clone Set: ClusterIP-clone [ClusterIP] (unique)
ClusterIP:0 (ocf::heartbeat:IPaddr2): Started lincl2-hb
ClusterIP:1 (ocf::heartbeat:IPaddr2): Started lincl0-hb
ClusterIP:2 (ocf::heartbeat:IPaddr2): Started lincl1-hb
Clone Set: WebSite-clone [WebSite]
Started: [ lincl0-hb ]
Stopped: [ lincl1-hb lincl2-hb ]
The above says that the cluster successfully started a WebSite instance
on lincl0-hb, but it is for some reason prevented from doing so on the
other two nodes.
Failed Actions:
* WebSite:0_start_0 on lincl2-hb 'unknown error' (1): call=142,
status=Timed Out, exitreason='Failed to access httpd status page.',
last-rc-change='Sat Feb 13 19:55:45 2016', queued=0ms, exec=120004ms
This gives a good bit of info:
* The "start" action on the "WebSite" resource failed no node lincl2-hb.
* The failure was a timeout. The start action did not return in the
configured (or default) time.
* The reason given by the apache resource agent was "Failed to access
httpd status page".
* WebSite:2_start_0 on lincl2-hb 'unknown error' (1): call=130,
status=Timed Out, exitreason='none',
last-rc-change='Sat Feb 13 19:33:49 2016', queued=0ms, exec=40003ms
* WebSite:1_monitor_60000 on lincl0-hb 'unknown error' (1): call=101,
status=complete, exitreason='Failed to access httpd status page.',
last-rc-change='Sat Feb 13 19:53:53 2016', queued=0ms, exec=0ms
* WebSite:0_monitor_60000 on lincl0-hb 'not running' (7): call=77,
status=complete, exitreason='none',
last-rc-change='Sat Feb 13 19:34:48 2016', queued=0ms, exec=0ms
* WebSite:2_start_0 on lincl1-hb 'unknown error' (1): call=41,
status=Timed Out, exitreason='none',
last-rc-change='Sat Feb 13 19:53:41 2016', queued=1ms, exec=120004ms
PCSD Status:
lincl0-hb: Online
lincl1-hb: Online
lincl2-hb: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
I'm not sure how to further troubleshoot those "Failed Actions" or how
to clear them from the display?
Pacemaker relies on what the resource agent tells it, so when the
resource agent fails, you'll have to look at that rather than pacemaker
itself. Often, agents will print more detailed messages to the system
log. Otherwise, just verifying the resource configuration and so forth
is a good idea.
In this case, the big hint is the status page. The apache resource agent
relies on the /server-status URL to verify that apache is running.
Double-check that apache's configuration is identical on all nodes,
particularly the /server-status configuration.
Once you've addressed the root cause of a failed action, you can clear
it from the display with "pcs resource cleanup" -- see "man pcs" for the
options it takes.
Another good idea is (with the cluster stopped) to ensure you can start
apache manually on each node and see the server-status URL from that
node (using curl or wget or whatever).
Configuration of the WebSite-clone looks like:
[root@lincl2 /]# pcs resource show WebSite-clone
Clone: WebSite-clone
Meta Attrs: globally-unique=false clone-node-max=1 clone-max=3
interleave=true
Resource: WebSite (class=ocf provider=heartbeat type=apache)
Attributes: configfile=/etc/httpd/conf/httpd.conf
statusurl=http://localhost/server-status
Operations: start interval=0s timeout=120s (WebSite-start-interval-0s)
stop interval=0s timeout=60s (WebSite-stop-interval-0s)
monitor interval=1min (WebSite-monitor-interval-1min)
Now I change globally-unique to true, and this happens:
[root@lincl2 /]# pcs resource update WebSite-clone globally-unique=true
[root@lincl2 /]# pcs resource
Clone Set: dlm-clone [dlm]
Started: [ lincl0-hb lincl1-hb lincl2-hb ]
Master/Slave Set: cluster-data-clone [cluster-data]
Slaves: [ lincl0-hb lincl1-hb lincl2-hb ]
Clone Set: ClusterIP-clone [ClusterIP] (unique)
ClusterIP:0 (ocf::heartbeat:IPaddr2): Started lincl2-hb
ClusterIP:1 (ocf::heartbeat:IPaddr2): Started lincl0-hb
ClusterIP:2 (ocf::heartbeat:IPaddr2): Started lincl1-hb
Clone Set: WebSite-clone [WebSite] (unique)
WebSite:0 (ocf::heartbeat:apache): Started lincl0-hb
WebSite:1 (ocf::heartbeat:apache): Started lincl1-hb
WebSite:2 (ocf::heartbeat:apache): Stopped
Constraints are set up as follows:
[root@lincl2 /]# pcs constraint
Location Constraints:
Ordering Constraints:
start dlm-clone then start cluster-data-clone (kind:Mandatory)
start ClusterIP-clone then start WebSite-clone (kind:Mandatory)
start cluster-data-clone then start WebSite-clone (kind:Mandatory)
Colocation Constraints:
cluster-data-clone with dlm-clone (score:INFINITY)
WebSite-clone with ClusterIP-clone (score:INFINITY)
WebSite-clone with cluster-data-clone (score:INFINITY)
As far as I can tell, there is no activity in the Apache log files from
pcs trying to start it and it failing or taking too long - it seems that
it never gets far enough for Apache itself to be trying to start.
Can someone give me ideas on how to further troubleshoot this? Ideally
I'd like it running one instance on each available node.
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org