Re: [ClusterLabs] Clone Issue
I don't have a DRBD volume configured. The tutorials I was going through trying to configure this cluster did show a DRBD setup so I may have confused the configuration somehow... I will do some further reading on this and see if I can improve on this; I just started reading some things that lead me to believe some of the reliability problems I'm seeing may be related to how that is configured, so it is certainly a target for improvement. Thank you for pointing that out! On 2/15/2016 03:30, Ferenc Wágner wrote: "Frank D. Engel, Jr."writes: Currently my status looks like this (with globally-unique set to false; "cluster-data" is my GFS2 filesystem): Master/Slave Set: cluster-data-clone [cluster-data] Slaves: [ lincl0-hb lincl1-hb lincl2-hb ] I'm not too much into this, but isn't cluster-data your DRDB? GFS shouldn't be a Master/Slave set but a plain clone as far as I know. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Clone Issue
I tried working with a few of these suggestions but the issue doesn't seem to be there. All of them were configured the same way for the status page. After rebooting all of the nodes, two of the ClusterIP resources wound up on the same node, and "relocate run ClusterIP-clone" would not resolve this. I ended up taking the node with the duplicate out of the cluster (pcs cluster stop) and then adding it back in - this allowed that to run, and for some reason, the web site is on all three nodes now. So far the cluster behavior seems a bit flaky; maybe it is something odd in the configuration, but while I can understand how two of the IP resources would wind up on the same node initially, I'm not sure why I would need to take a node out of the cluster like that to fix it? In some cases I've needed to reboot the nodes multiple times to get the cluster to start behaving again after reboots of nodes for other reasons; rebooting one of the three nodes sometimes causes the cluster-data-clone (file system) to restart or even just be completely lost on all of the nodes, and I've had to reboot a few times to get it back. I could understand that with two nodes down (and it should effectively take the filesystem down in that case), but with just one going down that seems to be a problem. Still experimenting and exploring. Thank you! On 2/14/2016 10:23, Ken Gaillot wrote: On 02/13/2016 08:09 PM, Frank D. Engel, Jr. wrote: Hi, I'm new to the software, and with the list - just started experimenting with trying to get a cluster working using CentOS 7 and the pcs utility, and I've made some progress, but I can't quite figure out why I'm seeing this one behavior - hoping someone can help, might be something simple I haven't picked up on yet. I have three nodes configured (running under VirtualBox) with shared storage using GFS2 - that much seems to be working ok. I have a service called "WebSite" representing the Apache configuration, and I cloned that to create "WebSite-clone", which I would expect to run instances of on all three nodes. However, if I leave "globally-unique" off, it will only run on one node, where if I turn it on, it will run on two, but never on all three. I've tried a number of things to get this working. I did verify that I can manually start and stop Apache on all three nodes and it works on any of them that way. You don't want globally-unique=true; that's for cases where you want to be able to run multiple instances of the service on the same machine if necessary, because each clone handles different requests. Currently my status looks like this (with globally-unique set to false; "cluster-data" is my GFS2 filesystem): Cluster name: lincl Last updated: Sat Feb 13 20:58:26 2016Last change: Sat Feb 13 20:45:08 2016 by root via crm_resource on lincl2-hb Stack: corosync Current DC: lincl2-hb (version 1.1.13-10.el7-44eb2dd) - partition with quorum 3 nodes and 13 resources configured Online: [ lincl0-hb lincl1-hb lincl2-hb ] Full list of resources: kdump(stonith:fence_kdump):Started lincl0-hb Clone Set: dlm-clone [dlm] Started: [ lincl0-hb lincl1-hb lincl2-hb ] Master/Slave Set: cluster-data-clone [cluster-data] Slaves: [ lincl0-hb lincl1-hb lincl2-hb ] Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0(ocf::heartbeat:IPaddr2):Started lincl2-hb ClusterIP:1(ocf::heartbeat:IPaddr2):Started lincl0-hb ClusterIP:2(ocf::heartbeat:IPaddr2):Started lincl1-hb Clone Set: WebSite-clone [WebSite] Started: [ lincl0-hb ] Stopped: [ lincl1-hb lincl2-hb ] The above says that the cluster successfully started a WebSite instance on lincl0-hb, but it is for some reason prevented from doing so on the other two nodes. Failed Actions: * WebSite:0_start_0 on lincl2-hb 'unknown error' (1): call=142, status=Timed Out, exitreason='Failed to access httpd status page.', last-rc-change='Sat Feb 13 19:55:45 2016', queued=0ms, exec=120004ms This gives a good bit of info: * The "start" action on the "WebSite" resource failed no node lincl2-hb. * The failure was a timeout. The start action did not return in the configured (or default) time. * The reason given by the apache resource agent was "Failed to access httpd status page". * WebSite:2_start_0 on lincl2-hb 'unknown error' (1): call=130, status=Timed Out, exitreason='none', last-rc-change='Sat Feb 13 19:33:49 2016', queued=0ms, exec=40003ms * WebSite:1_monitor_6 on lincl0-hb 'unknown error' (1): call=101, status=complete, exitreason='Failed to access httpd status page.', last-rc-change='Sat Feb 13 19:53:53 2016', queued=0ms, exec=0ms * WebSite:0_monitor_6 on lincl0-hb 'not running' (7): call=77, status=complete, exitreason='none', last-rc-change='Sat Feb 13 19:34:48 2016', queued=0ms, exec=0ms * WebSite:2_start_0 on lincl1-hb 'unknown error' (1): call=41, status=Timed Out, exitreason='none',
Re: [ClusterLabs] Clone Issue
On 02/13/2016 08:09 PM, Frank D. Engel, Jr. wrote: > Hi, > > I'm new to the software, and with the list - just started experimenting > with trying to get a cluster working using CentOS 7 and the pcs utility, > and I've made some progress, but I can't quite figure out why I'm seeing > this one behavior - hoping someone can help, might be something simple I > haven't picked up on yet. > > I have three nodes configured (running under VirtualBox) with shared > storage using GFS2 - that much seems to be working ok. > > I have a service called "WebSite" representing the Apache configuration, > and I cloned that to create "WebSite-clone", which I would expect to run > instances of on all three nodes. > > However, if I leave "globally-unique" off, it will only run on one node, > where if I turn it on, it will run on two, but never on all three. I've > tried a number of things to get this working. I did verify that I can > manually start and stop Apache on all three nodes and it works on any of > them that way. You don't want globally-unique=true; that's for cases where you want to be able to run multiple instances of the service on the same machine if necessary, because each clone handles different requests. > Currently my status looks like this (with globally-unique set to false; > "cluster-data" is my GFS2 filesystem): > > Cluster name: lincl > Last updated: Sat Feb 13 20:58:26 2016Last change: Sat Feb 13 > 20:45:08 2016 by root via crm_resource on lincl2-hb > Stack: corosync > Current DC: lincl2-hb (version 1.1.13-10.el7-44eb2dd) - partition with > quorum > 3 nodes and 13 resources configured > > Online: [ lincl0-hb lincl1-hb lincl2-hb ] > > Full list of resources: > > kdump(stonith:fence_kdump):Started lincl0-hb > Clone Set: dlm-clone [dlm] > Started: [ lincl0-hb lincl1-hb lincl2-hb ] > Master/Slave Set: cluster-data-clone [cluster-data] > Slaves: [ lincl0-hb lincl1-hb lincl2-hb ] > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0(ocf::heartbeat:IPaddr2):Started lincl2-hb > ClusterIP:1(ocf::heartbeat:IPaddr2):Started lincl0-hb > ClusterIP:2(ocf::heartbeat:IPaddr2):Started lincl1-hb > Clone Set: WebSite-clone [WebSite] > Started: [ lincl0-hb ] > Stopped: [ lincl1-hb lincl2-hb ] The above says that the cluster successfully started a WebSite instance on lincl0-hb, but it is for some reason prevented from doing so on the other two nodes. > Failed Actions: > * WebSite:0_start_0 on lincl2-hb 'unknown error' (1): call=142, > status=Timed Out, exitreason='Failed to access httpd status page.', > last-rc-change='Sat Feb 13 19:55:45 2016', queued=0ms, exec=120004ms This gives a good bit of info: * The "start" action on the "WebSite" resource failed no node lincl2-hb. * The failure was a timeout. The start action did not return in the configured (or default) time. * The reason given by the apache resource agent was "Failed to access httpd status page". > * WebSite:2_start_0 on lincl2-hb 'unknown error' (1): call=130, > status=Timed Out, exitreason='none', > last-rc-change='Sat Feb 13 19:33:49 2016', queued=0ms, exec=40003ms > * WebSite:1_monitor_6 on lincl0-hb 'unknown error' (1): call=101, > status=complete, exitreason='Failed to access httpd status page.', > last-rc-change='Sat Feb 13 19:53:53 2016', queued=0ms, exec=0ms > * WebSite:0_monitor_6 on lincl0-hb 'not running' (7): call=77, > status=complete, exitreason='none', > last-rc-change='Sat Feb 13 19:34:48 2016', queued=0ms, exec=0ms > * WebSite:2_start_0 on lincl1-hb 'unknown error' (1): call=41, > status=Timed Out, exitreason='none', > last-rc-change='Sat Feb 13 19:53:41 2016', queued=1ms, exec=120004ms > > > PCSD Status: > lincl0-hb: Online > lincl1-hb: Online > lincl2-hb: Online > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > > > I'm not sure how to further troubleshoot those "Failed Actions" or how > to clear them from the display? Pacemaker relies on what the resource agent tells it, so when the resource agent fails, you'll have to look at that rather than pacemaker itself. Often, agents will print more detailed messages to the system log. Otherwise, just verifying the resource configuration and so forth is a good idea. In this case, the big hint is the status page. The apache resource agent relies on the /server-status URL to verify that apache is running. Double-check that apache's configuration is identical on all nodes, particularly the /server-status configuration. Once you've addressed the root cause of a failed action, you can clear it from the display with "pcs resource cleanup" -- see "man pcs" for the options it takes. Another good idea is (with the cluster stopped) to ensure you can start apache manually on each node and see the server-status URL from that node (using curl or wget or whatever). > > Configuration of the WebSite-clone