Re: [ClusterLabs] Triggered assert at xml.c:594

2016-02-14 Thread Patrick Zwahlen
Replying to myself,

This seems to be related to the latest drbd RA (8.9.4+).

Still, is it something we should worry about ?

Regards!

-Original Message-
From: Patrick Zwahlen [mailto:p...@navixia.com] 
Sent: samedi, 13 février 2016 15:47
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: [ClusterLabs] Triggered assert at xml.c:594

Hi,

Short:
I'm getting asserts in my logs and wonder if I should worry.

Long:
I'm Running a lab on CentOS 7.2:
  pacemaker-1.1.13-10.el7.x86_64
  corosync-2.3.4-7.el7_2.1.x86_64

Since my latest "yum update", I see the following errors in my logs:

Feb 13 15:22:54 san1.local crmd[1896]:error: pcmkRegisterNode: Triggered 
assert at xml.c:594 : node->type == XML_ELEMENT_NODE

I see these logs during cluster start and when moving DRBD resources from one 
node to the other. Everything seems to work, though.

Strange thing is, I didn't update any pacemaker related RPMs:

Feb 13 14:35:02 Updated: 1:openssl-libs-1.0.1e-51.el7_2.2.x86_64
Feb 13 14:35:02 Updated: openssh-6.6.1p1-23.el7_2.x86_64
Feb 13 14:35:02 Updated: 32:bind-license-9.9.4-29.el7_2.2.noarch
Feb 13 14:35:02 Updated: 32:bind-libs-9.9.4-29.el7_2.2.x86_64
Feb 13 14:35:02 Updated: nss-3.19.1-19.el7_2.x86_64
Feb 13 14:35:02 Updated: nss-sysinit-3.19.1-19.el7_2.x86_64
Feb 13 14:35:03 Updated: 1:grub2-tools-2.02-0.34.el7.centos.x86_64
Feb 13 14:35:03 Updated: 1:grub2-2.02-0.34.el7.centos.x86_64
Feb 13 14:35:03 Updated: nss-tools-3.19.1-19.el7_2.x86_64
Feb 13 14:35:03 Updated: 32:bind-utils-9.9.4-29.el7_2.2.x86_64
Feb 13 14:35:04 Updated: 32:bind-libs-lite-9.9.4-29.el7_2.2.x86_64
Feb 13 14:35:04 Updated: openssh-clients-6.6.1p1-23.el7_2.x86_64
Feb 13 14:35:04 Updated: openssh-server-6.6.1p1-23.el7_2.x86_64
Feb 13 14:35:04 Updated: 1:openssl-1.0.1e-51.el7_2.2.x86_64
Feb 13 14:35:04 Updated: ntpdate-4.2.6p5-22.el7.centos.1.x86_64
Feb 13 14:35:04 Updated: python-perf-3.10.0-327.4.5.el7.x86_64
Feb 13 14:35:04 Updated: gnutls-3.3.8-14.el7_2.x86_64
Feb 13 14:35:10 Updated: tzdata-2016a-1.el7.noarch
Feb 13 14:35:10 Installed: kernel-3.10.0-327.4.5.el7.x86_64

Could the new kernel be the reason for those asserts ?

Thanks for your input on this one. - Patrick -

**
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager. "postmas...@navixia.com"  Navixia SA
**

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker issue when ethernet interface is pulled down

2016-02-14 Thread Debabrata Pani
Hi,
We ran into some problems when we pull down the ethernet interface using 
“ifconfig eth0 down”

Our cluster has the following configurations and resources

  *   Two  network interfaces : eth0 and lo(cal)
  *   3 nodes with one node put in maintenance mode
  *   No-quorum-policy=stop
  *   Stonith-enabled=false
  *   Postgresql Master/Slave
  *   vip master and vip replication IPs
  *   VIPs will run on the node where Postgresql Master is running

Two test cases that we executed are as follows

  *   Introduce delay in the ethernet interface o f the postgresql PRIMARY node 
 (Command  : tc qdisc add dev eth0 root netem delay 8000ms)
  *   `Ifconfig eth0 down` on the postgresql PRIMARY Node
  *   We expected that both these test cases test for network problems in the 
cluster

In the first case (ethernet interface delay)

  *   Cluster is divided into “partition WITH quorum” and “partition WITHOUT 
quorum”
  *   Partition WITHOUT quorum shuts down all the services
  *   Partition WITH quorum takes over as Postgresql PRIMARY and VIPs
  *   Everything as expected. Wow !

In the second case (ethernet interface down)

  *   We see lots of errors like the following . On the node
 *   Feb 12 14:09:48 corosync [MAIN  ] Totem is unable to form a cluster 
because of an operating system or network fault. The most common cause of this 
message is that the local firewall is configured improperly.
 *   Feb 12 14:09:49 corosync [MAIN  ] Totem is unable to form a cluster 
because of an operating system or network fault. The most common cause of this 
message is that the local firewall is configured improperly.
 *   Feb 12 14:09:51 corosync [MAIN  ] Totem is unable to form a cluster 
because of an operating system or network fault. The most common cause of this 
message is that the local firewall is configured improperly.
  *   But the `crm_mon –Afr` (from the node whose eth0 is down)  always shows 
the cluster to be fully formed.
 *   It shows all the nodes as UP
 *   It shows itself as the one running the postgresql PRIMARY  (as was the 
case before putting the ethernet interface is down)
  *   `crm_mon -Afr` on the OTHER nodes show a different story
 *   They show the other node as down
 *   One of the other two nodes takes over the postgresql PRIMARY
  *   This leads to a split brain situation which was gracefully avoided in the 
test case where only “delay is introduced into the interface”

Questions :

  *Is it a known issue with pacemaker when the ethernet interface is pulled 
down ?
  *   Is it an incorrect way of testing the cluster ? There is some information 
regarding the same in this thread 
http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738

Regards,
Deba

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker issue when ethernet interface is pulled down

2016-02-14 Thread emmanuel segura
use fence and after you configured the fencing you need to use
iptables for testing your cluster, with iptables you can block 5404
and 5405 ports

2016-02-14 14:09 GMT+01:00 Debabrata Pani :
> Hi,
> We ran into some problems when we pull down the ethernet interface using
> “ifconfig eth0 down”
>
> Our cluster has the following configurations and resources
>
> Two  network interfaces : eth0 and lo(cal)
> 3 nodes with one node put in maintenance mode
> No-quorum-policy=stop
> Stonith-enabled=false
> Postgresql Master/Slave
> vip master and vip replication IPs
> VIPs will run on the node where Postgresql Master is running
>
>
> Two test cases that we executed are as follows
>
> Introduce delay in the ethernet interface o f the postgresql PRIMARY node
> (Command  : tc qdisc add dev eth0 root netem delay 8000ms)
> `Ifconfig eth0 down` on the postgresql PRIMARY Node
> We expected that both these test cases test for network problems in the
> cluster
>
>
> In the first case (ethernet interface delay)
>
> Cluster is divided into “partition WITH quorum” and “partition WITHOUT
> quorum”
> Partition WITHOUT quorum shuts down all the services
> Partition WITH quorum takes over as Postgresql PRIMARY and VIPs
> Everything as expected. Wow !
>
>
> In the second case (ethernet interface down)
>
> We see lots of errors like the following . On the node
>
> Feb 12 14:09:48 corosync [MAIN  ] Totem is unable to form a cluster because
> of an operating system or network fault. The most common cause of this
> message is that the local firewall is configured improperly.
> Feb 12 14:09:49 corosync [MAIN  ] Totem is unable to form a cluster because
> of an operating system or network fault. The most common cause of this
> message is that the local firewall is configured improperly.
> Feb 12 14:09:51 corosync [MAIN  ] Totem is unable to form a cluster because
> of an operating system or network fault. The most common cause of this
> message is that the local firewall is configured improperly.
>
> But the `crm_mon –Afr` (from the node whose eth0 is down)  always shows the
> cluster to be fully formed.
>
> It shows all the nodes as UP
> It shows itself as the one running the postgresql PRIMARY  (as was the case
> before putting the ethernet interface is down)
>
> `crm_mon -Afr` on the OTHER nodes show a different story
>
> They show the other node as down
> One of the other two nodes takes over the postgresql PRIMARY
>
> This leads to a split brain situation which was gracefully avoided in the
> test case where only “delay is introduced into the interface”
>
>
> Questions :
>
>  Is it a known issue with pacemaker when the ethernet interface is pulled
> down ?
> Is it an incorrect way of testing the cluster ? There is some information
> regarding the same in this thread
> http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738
>
>
> Regards,
> Deba
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker issue when ethernet interface is pulled down

2016-02-14 Thread Debabrata Pani
Hi Emmanuel,

Thank you for the suggestion.
If I am getting it right, Fencing can be configured to shutdown the node
on which the ethernet interface has gone down.
And that appears to be a correct suggestion.
But I have a few queries still.

Queries:
* Is the test case ³put down the ethernet interface² not a valid one ?
* Why is the node unable to detect that it is cut off from the cluster and
shut the services down as per the ³no-quorum-policy² configuration ?


Regards,
Debabrata

On 14/02/16 19:31, "emmanuel segura"  wrote:

>use fence and after you configured the fencing you need to use
>iptables for testing your cluster, with iptables you can block 5404
>and 5405 ports
>
>2016-02-14 14:09 GMT+01:00 Debabrata Pani :
>> Hi,
>> We ran into some problems when we pull down the ethernet interface using
>> ³ifconfig eth0 down²
>>
>> Our cluster has the following configurations and resources
>>
>> Two  network interfaces : eth0 and lo(cal)
>> 3 nodes with one node put in maintenance mode
>> No-quorum-policy=stop
>> Stonith-enabled=false
>> Postgresql Master/Slave
>> vip master and vip replication IPs
>> VIPs will run on the node where Postgresql Master is running
>>
>>
>> Two test cases that we executed are as follows
>>
>> Introduce delay in the ethernet interface o f the postgresql PRIMARY
>>node
>> (Command  : tc qdisc add dev eth0 root netem delay 8000ms)
>> `Ifconfig eth0 down` on the postgresql PRIMARY Node
>> We expected that both these test cases test for network problems in the
>> cluster
>>
>>
>> In the first case (ethernet interface delay)
>>
>> Cluster is divided into ³partition WITH quorum² and ³partition WITHOUT
>> quorum²
>> Partition WITHOUT quorum shuts down all the services
>> Partition WITH quorum takes over as Postgresql PRIMARY and VIPs
>> Everything as expected. Wow !
>>
>>
>> In the second case (ethernet interface down)
>>
>> We see lots of errors like the following . On the node
>>
>> Feb 12 14:09:48 corosync [MAIN  ] Totem is unable to form a cluster
>>because
>> of an operating system or network fault. The most common cause of this
>> message is that the local firewall is configured improperly.
>> Feb 12 14:09:49 corosync [MAIN  ] Totem is unable to form a cluster
>>because
>> of an operating system or network fault. The most common cause of this
>> message is that the local firewall is configured improperly.
>> Feb 12 14:09:51 corosync [MAIN  ] Totem is unable to form a cluster
>>because
>> of an operating system or network fault. The most common cause of this
>> message is that the local firewall is configured improperly.
>>
>> But the `crm_mon ­Afr` (from the node whose eth0 is down)  always shows
>>the
>> cluster to be fully formed.
>>
>> It shows all the nodes as UP
>> It shows itself as the one running the postgresql PRIMARY  (as was the
>>case
>> before putting the ethernet interface is down)
>>
>> `crm_mon -Afr` on the OTHER nodes show a different story
>>
>> They show the other node as down
>> One of the other two nodes takes over the postgresql PRIMARY
>>
>> This leads to a split brain situation which was gracefully avoided in
>>the
>> test case where only ³delay is introduced into the interface²
>>
>>
>> Questions :
>>
>>  Is it a known issue with pacemaker when the ethernet interface is
>>pulled
>> down ?
>> Is it an incorrect way of testing the cluster ? There is some
>>information
>> regarding the same in this thread
>> http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738
>>
>>
>> Regards,
>> Deba
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
>-- 
>  .~.
>  /V\
> //  \\
>/(   )\
>^`~'^
>
>___
>Users mailing list: Users@clusterlabs.org
>http://clusterlabs.org/mailman/listinfo/users
>
>Project Home: http://www.clusterlabs.org
>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Clone Issue

2016-02-14 Thread Ken Gaillot
On 02/13/2016 08:09 PM, Frank D. Engel, Jr. wrote:
> Hi,
> 
> I'm new to the software, and with the list - just started experimenting
> with trying to get a cluster working using CentOS 7 and the pcs utility,
> and I've made some progress, but I can't quite figure out why I'm seeing
> this one behavior - hoping someone can help, might be something simple I
> haven't picked up on yet.
> 
> I have three nodes configured (running under VirtualBox) with shared
> storage using GFS2 - that much seems to be working ok.
> 
> I have a service called "WebSite" representing the Apache configuration,
> and I cloned that to create "WebSite-clone", which I would expect to run
> instances of on all three nodes.
> 
> However, if I leave "globally-unique" off, it will only run on one node,
> where if I turn it on, it will run on two, but never on all three.  I've
> tried a number of things to get this working.  I did verify that I can
> manually start and stop Apache on all three nodes and it works on any of
> them that way.

You don't want globally-unique=true; that's for cases where you want to
be able to run multiple instances of the service on the same machine if
necessary, because each clone handles different requests.

> Currently my status looks like this (with globally-unique set to false;
> "cluster-data" is my GFS2 filesystem):
> 
> Cluster name: lincl
> Last updated: Sat Feb 13 20:58:26 2016Last change: Sat Feb 13
> 20:45:08 2016 by root via crm_resource on lincl2-hb
> Stack: corosync
> Current DC: lincl2-hb (version 1.1.13-10.el7-44eb2dd) - partition with
> quorum
> 3 nodes and 13 resources configured
> 
> Online: [ lincl0-hb lincl1-hb lincl2-hb ]
> 
> Full list of resources:
> 
>  kdump(stonith:fence_kdump):Started lincl0-hb
>  Clone Set: dlm-clone [dlm]
>  Started: [ lincl0-hb lincl1-hb lincl2-hb ]
>  Master/Slave Set: cluster-data-clone [cluster-data]
>  Slaves: [ lincl0-hb lincl1-hb lincl2-hb ]
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>  ClusterIP:0(ocf::heartbeat:IPaddr2):Started lincl2-hb
>  ClusterIP:1(ocf::heartbeat:IPaddr2):Started lincl0-hb
>  ClusterIP:2(ocf::heartbeat:IPaddr2):Started lincl1-hb
>  Clone Set: WebSite-clone [WebSite]
>  Started: [ lincl0-hb ]
>  Stopped: [ lincl1-hb lincl2-hb ]

The above says that the cluster successfully started a WebSite instance
on lincl0-hb, but it is for some reason prevented from doing so on the
other two nodes.

> Failed Actions:
> * WebSite:0_start_0 on lincl2-hb 'unknown error' (1): call=142,
> status=Timed Out, exitreason='Failed to access httpd status page.',
> last-rc-change='Sat Feb 13 19:55:45 2016', queued=0ms, exec=120004ms

This gives a good bit of info:

* The "start" action on the "WebSite" resource failed no node lincl2-hb.

* The failure was a timeout. The start action did not return in the
configured (or default) time.

* The reason given by the apache resource agent was "Failed to access
httpd status page".

> * WebSite:2_start_0 on lincl2-hb 'unknown error' (1): call=130,
> status=Timed Out, exitreason='none',
> last-rc-change='Sat Feb 13 19:33:49 2016', queued=0ms, exec=40003ms
> * WebSite:1_monitor_6 on lincl0-hb 'unknown error' (1): call=101,
> status=complete, exitreason='Failed to access httpd status page.',
> last-rc-change='Sat Feb 13 19:53:53 2016', queued=0ms, exec=0ms
> * WebSite:0_monitor_6 on lincl0-hb 'not running' (7): call=77,
> status=complete, exitreason='none',
> last-rc-change='Sat Feb 13 19:34:48 2016', queued=0ms, exec=0ms
> * WebSite:2_start_0 on lincl1-hb 'unknown error' (1): call=41,
> status=Timed Out, exitreason='none',
> last-rc-change='Sat Feb 13 19:53:41 2016', queued=1ms, exec=120004ms
> 
> 
> PCSD Status:
>   lincl0-hb: Online
>   lincl1-hb: Online
>   lincl2-hb: Online
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> 
> 
> I'm not sure how to further troubleshoot those "Failed Actions" or how
> to clear them from the display?

Pacemaker relies on what the resource agent tells it, so when the
resource agent fails, you'll have to look at that rather than pacemaker
itself. Often, agents will print more detailed messages to the system
log. Otherwise, just verifying the resource configuration and so forth
is a good idea.

In this case, the big hint is the status page. The apache resource agent
relies on the /server-status URL to verify that apache is running.
Double-check that apache's configuration is identical on all nodes,
particularly the /server-status configuration.

Once you've addressed the root cause of a failed action, you can clear
it from the display with "pcs resource cleanup" -- see "man pcs" for the
options it takes.

Another good idea is (with the cluster stopped) to ensure you can start
apache manually on each node and see the server-status URL from that
node (using curl or wget or whatever).

> 
> Configuration of the WebSite-clone look

Re: [ClusterLabs] Pacemaker issue when ethernet interface is pulled down

2016-02-14 Thread Digimer
On 14/02/16 09:48 AM, Debabrata Pani wrote:
> Hi Emmanuel,
> 
> Thank you for the suggestion.
> If I am getting it right, Fencing can be configured to shutdown the node
> on which the ethernet interface has gone down.
> And that appears to be a correct suggestion.
> But I have a few queries still.

Fencing work regardless of why communication with a node is lost (eth
down, hung, caught on fire...). Think of this way; "Fencing puts a node
that has entered an unknown state into a known state" (usually 'off').

> Queries:
> * Is the test case ³put down the ethernet interface² not a valid one ?

Corosync reacts oddly to that. It's better to use an iptables rule to
block traffic (or crash the node with something like 'echo c >
/proc/sysrq-trigger).

> * Why is the node unable to detect that it is cut off from the cluster and
> shut the services down as per the ³no-quorum-policy² configuration ?

In HA, you have to assume that a lost node could be doing anything. You
can't expect it to be operating predictably (as is truly the case in the
real world... imaging bad RAM and what that does to a system). If a
system stops responding, you need an external mechanism to remove it
(IPMI, cut the power via a switched PDU, etc).

> 
> Regards,
> Debabrata
> 
> On 14/02/16 19:31, "emmanuel segura"  wrote:
> 
>> use fence and after you configured the fencing you need to use
>> iptables for testing your cluster, with iptables you can block 5404
>> and 5405 ports
>>
>> 2016-02-14 14:09 GMT+01:00 Debabrata Pani :
>>> Hi,
>>> We ran into some problems when we pull down the ethernet interface using
>>> ³ifconfig eth0 down²
>>>
>>> Our cluster has the following configurations and resources
>>>
>>> Two  network interfaces : eth0 and lo(cal)
>>> 3 nodes with one node put in maintenance mode
>>> No-quorum-policy=stop
>>> Stonith-enabled=false
>>> Postgresql Master/Slave
>>> vip master and vip replication IPs
>>> VIPs will run on the node where Postgresql Master is running
>>>
>>>
>>> Two test cases that we executed are as follows
>>>
>>> Introduce delay in the ethernet interface o f the postgresql PRIMARY
>>> node
>>> (Command  : tc qdisc add dev eth0 root netem delay 8000ms)
>>> `Ifconfig eth0 down` on the postgresql PRIMARY Node
>>> We expected that both these test cases test for network problems in the
>>> cluster
>>>
>>>
>>> In the first case (ethernet interface delay)
>>>
>>> Cluster is divided into ³partition WITH quorum² and ³partition WITHOUT
>>> quorum²
>>> Partition WITHOUT quorum shuts down all the services
>>> Partition WITH quorum takes over as Postgresql PRIMARY and VIPs
>>> Everything as expected. Wow !
>>>
>>>
>>> In the second case (ethernet interface down)
>>>
>>> We see lots of errors like the following . On the node
>>>
>>> Feb 12 14:09:48 corosync [MAIN  ] Totem is unable to form a cluster
>>> because
>>> of an operating system or network fault. The most common cause of this
>>> message is that the local firewall is configured improperly.
>>> Feb 12 14:09:49 corosync [MAIN  ] Totem is unable to form a cluster
>>> because
>>> of an operating system or network fault. The most common cause of this
>>> message is that the local firewall is configured improperly.
>>> Feb 12 14:09:51 corosync [MAIN  ] Totem is unable to form a cluster
>>> because
>>> of an operating system or network fault. The most common cause of this
>>> message is that the local firewall is configured improperly.
>>>
>>> But the `crm_mon ­Afr` (from the node whose eth0 is down)  always shows
>>> the
>>> cluster to be fully formed.
>>>
>>> It shows all the nodes as UP
>>> It shows itself as the one running the postgresql PRIMARY  (as was the
>>> case
>>> before putting the ethernet interface is down)
>>>
>>> `crm_mon -Afr` on the OTHER nodes show a different story
>>>
>>> They show the other node as down
>>> One of the other two nodes takes over the postgresql PRIMARY
>>>
>>> This leads to a split brain situation which was gracefully avoided in
>>> the
>>> test case where only ³delay is introduced into the interface²
>>>
>>>
>>> Questions :
>>>
>>>  Is it a known issue with pacemaker when the ethernet interface is
>>> pulled
>>> down ?
>>> Is it an incorrect way of testing the cluster ? There is some
>>> information
>>> regarding the same in this thread
>>> http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738
>>>
>>>
>>> Regards,
>>> Deba
>>>
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>> -- 
>>  .~.
>>  /V\
>> //  \\
>> /(   )\
>> ^`~'^
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.

Re: [ClusterLabs] Clone Issue

2016-02-14 Thread Frank D. Engel, Jr.
I tried working with a few of these suggestions but the issue doesn't 
seem to be there.  All of them were configured the same way for the 
status page.


After rebooting all of the nodes, two of the ClusterIP resources wound 
up on the same node, and "relocate run ClusterIP-clone" would not 
resolve this.  I ended up taking the node with the duplicate out of the 
cluster (pcs cluster stop) and then adding it back in - this allowed 
that to run, and for some reason, the web site is on all three nodes now.


So far the cluster behavior seems a bit flaky; maybe it is something odd 
in the configuration, but while I can understand how two of the IP 
resources would wind up on the same node initially, I'm not sure why I 
would need to take a node out of the cluster like that to fix it?


In some cases I've needed to reboot the nodes multiple times to get the 
cluster to start behaving again after reboots of nodes for other 
reasons; rebooting one of the three nodes sometimes causes the 
cluster-data-clone (file system) to restart or even just be completely 
lost on all of the nodes, and I've had to reboot a few times to get it 
back.  I could understand that with two nodes down (and it should 
effectively take the filesystem down in that case), but with just one 
going down that seems to be a problem.


Still experimenting and exploring.


Thank you!



On 2/14/2016 10:23, Ken Gaillot wrote:

On 02/13/2016 08:09 PM, Frank D. Engel, Jr. wrote:

Hi,

I'm new to the software, and with the list - just started experimenting
with trying to get a cluster working using CentOS 7 and the pcs utility,
and I've made some progress, but I can't quite figure out why I'm seeing
this one behavior - hoping someone can help, might be something simple I
haven't picked up on yet.

I have three nodes configured (running under VirtualBox) with shared
storage using GFS2 - that much seems to be working ok.

I have a service called "WebSite" representing the Apache configuration,
and I cloned that to create "WebSite-clone", which I would expect to run
instances of on all three nodes.

However, if I leave "globally-unique" off, it will only run on one node,
where if I turn it on, it will run on two, but never on all three.  I've
tried a number of things to get this working.  I did verify that I can
manually start and stop Apache on all three nodes and it works on any of
them that way.

You don't want globally-unique=true; that's for cases where you want to
be able to run multiple instances of the service on the same machine if
necessary, because each clone handles different requests.


Currently my status looks like this (with globally-unique set to false;
"cluster-data" is my GFS2 filesystem):

Cluster name: lincl
Last updated: Sat Feb 13 20:58:26 2016Last change: Sat Feb 13
20:45:08 2016 by root via crm_resource on lincl2-hb
Stack: corosync
Current DC: lincl2-hb (version 1.1.13-10.el7-44eb2dd) - partition with
quorum
3 nodes and 13 resources configured

Online: [ lincl0-hb lincl1-hb lincl2-hb ]

Full list of resources:

  kdump(stonith:fence_kdump):Started lincl0-hb
  Clone Set: dlm-clone [dlm]
  Started: [ lincl0-hb lincl1-hb lincl2-hb ]
  Master/Slave Set: cluster-data-clone [cluster-data]
  Slaves: [ lincl0-hb lincl1-hb lincl2-hb ]
  Clone Set: ClusterIP-clone [ClusterIP] (unique)
  ClusterIP:0(ocf::heartbeat:IPaddr2):Started lincl2-hb
  ClusterIP:1(ocf::heartbeat:IPaddr2):Started lincl0-hb
  ClusterIP:2(ocf::heartbeat:IPaddr2):Started lincl1-hb
  Clone Set: WebSite-clone [WebSite]
  Started: [ lincl0-hb ]
  Stopped: [ lincl1-hb lincl2-hb ]

The above says that the cluster successfully started a WebSite instance
on lincl0-hb, but it is for some reason prevented from doing so on the
other two nodes.


Failed Actions:
* WebSite:0_start_0 on lincl2-hb 'unknown error' (1): call=142,
status=Timed Out, exitreason='Failed to access httpd status page.',
 last-rc-change='Sat Feb 13 19:55:45 2016', queued=0ms, exec=120004ms

This gives a good bit of info:

* The "start" action on the "WebSite" resource failed no node lincl2-hb.

* The failure was a timeout. The start action did not return in the
configured (or default) time.

* The reason given by the apache resource agent was "Failed to access
httpd status page".


* WebSite:2_start_0 on lincl2-hb 'unknown error' (1): call=130,
status=Timed Out, exitreason='none',
 last-rc-change='Sat Feb 13 19:33:49 2016', queued=0ms, exec=40003ms
* WebSite:1_monitor_6 on lincl0-hb 'unknown error' (1): call=101,
status=complete, exitreason='Failed to access httpd status page.',
 last-rc-change='Sat Feb 13 19:53:53 2016', queued=0ms, exec=0ms
* WebSite:0_monitor_6 on lincl0-hb 'not running' (7): call=77,
status=complete, exitreason='none',
 last-rc-change='Sat Feb 13 19:34:48 2016', queued=0ms, exec=0ms
* WebSite:2_start_0 on lincl1-hb 'unknown error' (1): call=41,
status=Timed Out, exitreason='none',
 last-rc-ch

Re: [ClusterLabs] Clone Issue

2016-02-14 Thread Vladislav Bogdanov
"Frank D. Engel, Jr."  wrote:
>I tried working with a few of these suggestions but the issue doesn't 
>seem to be there.  All of them were configured the same way for the 
>status page.

Try to replace localhost with 127.0.0.1 in the status url param.

>
>After rebooting all of the nodes, two of the ClusterIP resources wound 
>up on the same node, and "relocate run ClusterIP-clone" would not 

Unfortunately, with the default placement strategy, cluster spread resources 
equally over all the nodes. You can play with utilization placement, assigning 
some attribute on all nodes to the number of globally-unique clone instances, 
and adding utilization param that_attribute=1 to CloneIP.

I raised this issue quite long ago, but it is not solved yet.

Last, you probably want to change your ClusterIP-related constraints, so its 
instances are allocated together with the running apache instance, not 
vise-versa.


Best,
Vladislav

>resolve this.  I ended up taking the node with the duplicate out of the
>
>cluster (pcs cluster stop) and then adding it back in - this allowed 
>that to run, and for some reason, the web site is on all three nodes
>now.
>
>So far the cluster behavior seems a bit flaky; maybe it is something
>odd 
>in the configuration, but while I can understand how two of the IP 
>resources would wind up on the same node initially, I'm not sure why I 
>would need to take a node out of the cluster like that to fix it?
>
>In some cases I've needed to reboot the nodes multiple times to get the
>
>cluster to start behaving again after reboots of nodes for other 
>reasons; rebooting one of the three nodes sometimes causes the 
>cluster-data-clone (file system) to restart or even just be completely 
>lost on all of the nodes, and I've had to reboot a few times to get it 
>back.  I could understand that with two nodes down (and it should 
>effectively take the filesystem down in that case), but with just one 
>going down that seems to be a problem.
>
>Still experimenting and exploring.
>
>
>Thank you!
>
>
>
>On 2/14/2016 10:23, Ken Gaillot wrote:
>> On 02/13/2016 08:09 PM, Frank D. Engel, Jr. wrote:
>>> Hi,
>>>
>>> I'm new to the software, and with the list - just started
>experimenting
>>> with trying to get a cluster working using CentOS 7 and the pcs
>utility,
>>> and I've made some progress, but I can't quite figure out why I'm
>seeing
>>> this one behavior - hoping someone can help, might be something
>simple I
>>> haven't picked up on yet.
>>>
>>> I have three nodes configured (running under VirtualBox) with shared
>>> storage using GFS2 - that much seems to be working ok.
>>>
>>> I have a service called "WebSite" representing the Apache
>configuration,
>>> and I cloned that to create "WebSite-clone", which I would expect to
>run
>>> instances of on all three nodes.
>>>
>>> However, if I leave "globally-unique" off, it will only run on one
>node,
>>> where if I turn it on, it will run on two, but never on all three. 
>I've
>>> tried a number of things to get this working.  I did verify that I
>can
>>> manually start and stop Apache on all three nodes and it works on
>any of
>>> them that way.
>> You don't want globally-unique=true; that's for cases where you want
>to
>> be able to run multiple instances of the service on the same machine
>if
>> necessary, because each clone handles different requests.
>>
>>> Currently my status looks like this (with globally-unique set to
>false;
>>> "cluster-data" is my GFS2 filesystem):
>>>
>>> Cluster name: lincl
>>> Last updated: Sat Feb 13 20:58:26 2016Last change: Sat Feb
>13
>>> 20:45:08 2016 by root via crm_resource on lincl2-hb
>>> Stack: corosync
>>> Current DC: lincl2-hb (version 1.1.13-10.el7-44eb2dd) - partition
>with
>>> quorum
>>> 3 nodes and 13 resources configured
>>>
>>> Online: [ lincl0-hb lincl1-hb lincl2-hb ]
>>>
>>> Full list of resources:
>>>
>>>   kdump(stonith:fence_kdump):Started lincl0-hb
>>>   Clone Set: dlm-clone [dlm]
>>>   Started: [ lincl0-hb lincl1-hb lincl2-hb ]
>>>   Master/Slave Set: cluster-data-clone [cluster-data]
>>>   Slaves: [ lincl0-hb lincl1-hb lincl2-hb ]
>>>   Clone Set: ClusterIP-clone [ClusterIP] (unique)
>>>   ClusterIP:0(ocf::heartbeat:IPaddr2):Started lincl2-hb
>>>   ClusterIP:1(ocf::heartbeat:IPaddr2):Started lincl0-hb
>>>   ClusterIP:2(ocf::heartbeat:IPaddr2):Started lincl1-hb
>>>   Clone Set: WebSite-clone [WebSite]
>>>   Started: [ lincl0-hb ]
>>>   Stopped: [ lincl1-hb lincl2-hb ]
>> The above says that the cluster successfully started a WebSite
>instance
>> on lincl0-hb, but it is for some reason prevented from doing so on
>the
>> other two nodes.
>>
>>> Failed Actions:
>>> * WebSite:0_start_0 on lincl2-hb 'unknown error' (1): call=142,
>>> status=Timed Out, exitreason='Failed to access httpd status page.',
>>>  last-rc-change='Sat Feb 13 19:55:45 2016', queued=0ms,
>exec=120004ms
>> This gives a good bit of info:
>>
>> * The "start" action o