Re: [ClusterLabs] Clone Issue

Vladislav Bogdanov Mon, 15 Feb 2016 04:54:07 -0800

15.02.2016 15:18, Frank D. Engel, Jr. wrote:

Good tip on the status url - I did go ahead and made that update.


This mostly depends on how resolver is configured on a given node.

Look at you /etc/hosts - I'd bet you have two localhost records there -one for IPv6 and one for IPv4. On the other hand, your apache isprobably configured to listen only on IPv4 addresses, so resource agentcannot connect to IPv6 loopback.



I'm not sure that I agree with the IP relying on Apache, though.

There could be multiple services hanging off the same IP addresses; if

My experience shows that in most cases this should be avoided as much aspossible. Do you really want IP to reside on a node where service failsto start thus losing client connections?

the IPs depend on one of those services, then would not stopping that
one service for maintenance also impact all of the others by stopping
the IP resource, when they otherwise could have continued to function?



On 2/15/2016 01:47, Vladislav Bogdanov wrote:

"Frank D. Engel, Jr." <fde...@fjrhome.net> wrote:

I tried working with a few of these suggestions but the issue doesn't
seem to be there.  All of them were configured the same way for the
status page.

Try to replace localhost with 127.0.0.1 in the status url param.

After rebooting all of the nodes, two of the ClusterIP resources wound
up on the same node, and "relocate run ClusterIP-clone" would not

Unfortunately, with the default placement strategy, cluster spread
resources equally over all the nodes. You can play with utilization
placement, assigning some attribute on all nodes to the number of
globally-unique clone instances, and adding utilization param
that_attribute=1 to CloneIP.

I raised this issue quite long ago, but it is not solved yet.

Last, you probably want to change your ClusterIP-related constraints,
so its instances are allocated together with the running apache
instance, not vise-versa.


Best,
Vladislav

resolve this.  I ended up taking the node with the duplicate out of the

cluster (pcs cluster stop) and then adding it back in - this allowed
that to run, and for some reason, the web site is on all three nodes
now.

So far the cluster behavior seems a bit flaky; maybe it is something
odd
in the configuration, but while I can understand how two of the IP
resources would wind up on the same node initially, I'm not sure why I
would need to take a node out of the cluster like that to fix it?

In some cases I've needed to reboot the nodes multiple times to get the

cluster to start behaving again after reboots of nodes for other
reasons; rebooting one of the three nodes sometimes causes the
cluster-data-clone (file system) to restart or even just be completely
lost on all of the nodes, and I've had to reboot a few times to get it
back.  I could understand that with two nodes down (and it should
effectively take the filesystem down in that case), but with just one
going down that seems to be a problem.

Still experimenting and exploring.


Thank you!



On 2/14/2016 10:23, Ken Gaillot wrote:

On 02/13/2016 08:09 PM, Frank D. Engel, Jr. wrote:

Hi,

I'm new to the software, and with the list - just started

experimenting

with trying to get a cluster working using CentOS 7 and the pcs

utility,

and I've made some progress, but I can't quite figure out why I'm

seeing

this one behavior - hoping someone can help, might be something

simple I

haven't picked up on yet.

I have three nodes configured (running under VirtualBox) with shared
storage using GFS2 - that much seems to be working ok.

I have a service called "WebSite" representing the Apache

configuration,

and I cloned that to create "WebSite-clone", which I would expect to

run

instances of on all three nodes.

However, if I leave "globally-unique" off, it will only run on one

node,

where if I turn it on, it will run on two, but never on all three.

I've

tried a number of things to get this working.  I did verify that I

can

manually start and stop Apache on all three nodes and it works on

any of

them that way.

You don't want globally-unique=true; that's for cases where you want

to

be able to run multiple instances of the service on the same machine

if

necessary, because each clone handles different requests.

Currently my status looks like this (with globally-unique set to

false;

"cluster-data" is my GFS2 filesystem):

Cluster name: lincl
Last updated: Sat Feb 13 20:58:26 2016        Last change: Sat Feb

20:45:08 2016 by root via crm_resource on lincl2-hb
Stack: corosync
Current DC: lincl2-hb (version 1.1.13-10.el7-44eb2dd) - partition

with

quorum
3 nodes and 13 resources configured

Online: [ lincl0-hb lincl1-hb lincl2-hb ]

Full list of resources:

   kdump    (stonith:fence_kdump):    Started lincl0-hb
   Clone Set: dlm-clone [dlm]
       Started: [ lincl0-hb lincl1-hb lincl2-hb ]
   Master/Slave Set: cluster-data-clone [cluster-data]
       Slaves: [ lincl0-hb lincl1-hb lincl2-hb ]
   Clone Set: ClusterIP-clone [ClusterIP] (unique)
       ClusterIP:0    (ocf::heartbeat:IPaddr2):    Started lincl2-hb
       ClusterIP:1    (ocf::heartbeat:IPaddr2):    Started lincl0-hb
       ClusterIP:2    (ocf::heartbeat:IPaddr2):    Started lincl1-hb
   Clone Set: WebSite-clone [WebSite]
       Started: [ lincl0-hb ]
       Stopped: [ lincl1-hb lincl2-hb ]

The above says that the cluster successfully started a WebSite

instance

on lincl0-hb, but it is for some reason prevented from doing so on

the

other two nodes.

Failed Actions:
* WebSite:0_start_0 on lincl2-hb 'unknown error' (1): call=142,
status=Timed Out, exitreason='Failed to access httpd status page.',
      last-rc-change='Sat Feb 13 19:55:45 2016', queued=0ms,

exec=120004ms

This gives a good bit of info:

* The "start" action on the "WebSite" resource failed no node

lincl2-hb.

* The failure was a timeout. The start action did not return in the
configured (or default) time.

* The reason given by the apache resource agent was "Failed to access
httpd status page".

* WebSite:2_start_0 on lincl2-hb 'unknown error' (1): call=130,
status=Timed Out, exitreason='none',
      last-rc-change='Sat Feb 13 19:33:49 2016', queued=0ms,

exec=40003ms

* WebSite:1_monitor_60000 on lincl0-hb 'unknown error' (1):

call=101,

status=complete, exitreason='Failed to access httpd status page.',
      last-rc-change='Sat Feb 13 19:53:53 2016', queued=0ms, exec=0ms
* WebSite:0_monitor_60000 on lincl0-hb 'not running' (7): call=77,
status=complete, exitreason='none',
      last-rc-change='Sat Feb 13 19:34:48 2016', queued=0ms, exec=0ms
* WebSite:2_start_0 on lincl1-hb 'unknown error' (1): call=41,
status=Timed Out, exitreason='none',
      last-rc-change='Sat Feb 13 19:53:41 2016', queued=1ms,

exec=120004ms


PCSD Status:
    lincl0-hb: Online
    lincl1-hb: Online
    lincl2-hb: Online

Daemon Status:
    corosync: active/enabled
    pacemaker: active/enabled
    pcsd: active/enabled



I'm not sure how to further troubleshoot those "Failed Actions" or

how

to clear them from the display?

Pacemaker relies on what the resource agent tells it, so when the
resource agent fails, you'll have to look at that rather than

pacemaker

itself. Often, agents will print more detailed messages to the system
log. Otherwise, just verifying the resource configuration and so

forth

is a good idea.

In this case, the big hint is the status page. The apache resource

agent

relies on the /server-status URL to verify that apache is running.
Double-check that apache's configuration is identical on all nodes,
particularly the /server-status configuration.

Once you've addressed the root cause of a failed action, you can

clear

it from the display with "pcs resource cleanup" -- see "man pcs" for

the

options it takes.

Another good idea is (with the cluster stopped) to ensure you can

start

apache manually on each node and see the server-status URL from that
node (using curl or wget or whatever).

Configuration of the WebSite-clone looks like:

[root@lincl2 /]# pcs resource show WebSite-clone
   Clone: WebSite-clone
    Meta Attrs: globally-unique=false clone-node-max=1 clone-max=3
interleave=true
    Resource: WebSite (class=ocf provider=heartbeat type=apache)
     Attributes: configfile=/etc/httpd/conf/httpd.conf
statusurl=http://localhost/server-status
     Operations: start interval=0s timeout=120s

(WebSite-start-interval-0s)

                 stop interval=0s timeout=60s

(WebSite-stop-interval-0s)

                 monitor interval=1min

(WebSite-monitor-interval-1min)



Now I change globally-unique to true, and this happens:

[root@lincl2 /]# pcs resource update WebSite-clone

globally-unique=true

[root@lincl2 /]# pcs resource
   Clone Set: dlm-clone [dlm]
       Started: [ lincl0-hb lincl1-hb lincl2-hb ]
   Master/Slave Set: cluster-data-clone [cluster-data]
       Slaves: [ lincl0-hb lincl1-hb lincl2-hb ]
   Clone Set: ClusterIP-clone [ClusterIP] (unique)
       ClusterIP:0    (ocf::heartbeat:IPaddr2):    Started lincl2-hb
       ClusterIP:1    (ocf::heartbeat:IPaddr2):    Started lincl0-hb
       ClusterIP:2    (ocf::heartbeat:IPaddr2):    Started lincl1-hb
   Clone Set: WebSite-clone [WebSite] (unique)
       WebSite:0    (ocf::heartbeat:apache):    Started lincl0-hb
       WebSite:1    (ocf::heartbeat:apache):    Started lincl1-hb
       WebSite:2    (ocf::heartbeat:apache):    Stopped


Constraints are set up as follows:

[root@lincl2 /]# pcs constraint
Location Constraints:
Ordering Constraints:
    start dlm-clone then start cluster-data-clone (kind:Mandatory)
    start ClusterIP-clone then start WebSite-clone (kind:Mandatory)
    start cluster-data-clone then start WebSite-clone

(kind:Mandatory)

Colocation Constraints:
    cluster-data-clone with dlm-clone (score:INFINITY)
    WebSite-clone with ClusterIP-clone (score:INFINITY)
    WebSite-clone with cluster-data-clone (score:INFINITY)





As far as I can tell, there is no activity in the Apache log files

from

pcs trying to start it and it failing or taking too long - it seems

that

it never gets far enough for Apache itself to be trying to start.


Can someone give me ideas on how to further troubleshoot this?

Ideally

I'd like it running one instance on each available node.

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Clone Issue

Reply via email to