Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-31 Thread Seth Reid
We are only using one mount, and that mount has nothing on it currently.


I have fixed the problem. Our OS is Ubuntu 16.04 LTS (Xenial). I added the
17.04 (Zesty) repo to get newer a newer version of Corosync. I upgraded
Corosync, which upgraded a long list of other related packages (Pacemaker
and gfs2 among them). My fencing now works properly. If a node loses
network connection to the cluster, only that node is fenced. Presumably, it
is a bug in one of the packages in the Xenial repo that has been fixed by
in versions in Zesty.

---
Seth Reid



On Fri, Mar 31, 2017 at 10:31 AM, Bob Peterson  wrote:

> - Original Message -
> | I can confirm that doing an ifdown is not the source of my corosync
> issues.
> | My cluster is in another state, so I can't pull a cable, but I can down a
> | port on a switch. That had the exact same affects as doing an ifdown. Two
> | machines got fenced when it should have only been one.
> |
> | ---
> | Seth Reid
>
>
> Hi Seth,
>
> I don't know if your problem is the same thing I'm looking at BUT:
> I'm currently working on a fix to the GFS2 file system for a
> similar problem. The scenario is something like this:
>
> 1. Node X goes down for some reason.
> 2. Node X gets fenced by one of the other nodes.
> 3. As part of the recovery, GFS2 on all the other nodes have to
>replay the journals for all the file systems mounted on X.
> 4. GFS2 journal replay hogs the CPU, which causes corosync to be
>starved for CPU on some node (say node Y).
> 5. Since corosync on node Y was starved for CPU, it doesn't respond
>in time to the other nodes (say node Z).
> 6. Thus, node Z fences node Y.
>
> In my case, the solution is to fix GFS2 so that it does some
> "cond_resched()" (conditional schedule) statements to allow corosync
> (and dlm) to get some work done. Thus, corosync isn't starved for
> CPU and does its work, and therefore, it doesn't get fenced.
>
> I don't know if that's what is happening in your case.
> Do you have a lot of GFS2 mount points that would need recovery
> when the first fence event occurs?
> In my case, I can recreate the problem by having 60 GFS2 mount points.
>
> Hopefully I'll be sending a GFS2 patch to the cluster-devel
> mailing list for this problem soon.
>
> In testing my fix, I've periodically experienced some weirdness
> and other unexplained fencing, so maybe there's a second problem
> lurking (or maybe there's just something weird in the experimental
> kernel I'm using as a base). Hopefully testing will prove whether
> my fix to GFS2 recovery is enough or if there's another problem.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-31 Thread Bob Peterson
- Original Message -
| I can confirm that doing an ifdown is not the source of my corosync issues.
| My cluster is in another state, so I can't pull a cable, but I can down a
| port on a switch. That had the exact same affects as doing an ifdown. Two
| machines got fenced when it should have only been one.
| 
| ---
| Seth Reid
| System Operations Engineer
| Vendini, Inc.
| 415.349.7736
| sr...@vendini.com
| www.vendini.com

Hi Seth,

I don't know if your problem is the same thing I'm looking at BUT:
I'm currently working on a fix to the GFS2 file system for a
similar problem. The scenario is something like this:

1. Node X goes down for some reason.
2. Node X gets fenced by one of the other nodes.
3. As part of the recovery, GFS2 on all the other nodes have to
   replay the journals for all the file systems mounted on X.
4. GFS2 journal replay hogs the CPU, which causes corosync to be
   starved for CPU on some node (say node Y).
5. Since corosync on node Y was starved for CPU, it doesn't respond
   in time to the other nodes (say node Z).
6. Thus, node Z fences node Y.

In my case, the solution is to fix GFS2 so that it does some
"cond_resched()" (conditional schedule) statements to allow corosync
(and dlm) to get some work done. Thus, corosync isn't starved for
CPU and does its work, and therefore, it doesn't get fenced.

I don't know if that's what is happening in your case.
Do you have a lot of GFS2 mount points that would need recovery
when the first fence event occurs?
In my case, I can recreate the problem by having 60 GFS2 mount points.

Hopefully I'll be sending a GFS2 patch to the cluster-devel
mailing list for this problem soon.

In testing my fix, I've periodically experienced some weirdness
and other unexplained fencing, so maybe there's a second problem
lurking (or maybe there's just something weird in the experimental
kernel I'm using as a base). Hopefully testing will prove whether
my fix to GFS2 recovery is enough or if there's another problem.

Regards,

Bob Peterson
Red Hat File Systems

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-31 Thread Seth Reid
I can confirm that doing an ifdown is not the source of my corosync issues.
My cluster is in another state, so I can't pull a cable, but I can down a
port on a switch. That had the exact same affects as doing an ifdown. Two
machines got fenced when it should have only been one.

---
Seth Reid
System Operations Engineer
Vendini, Inc.
415.349.7736
sr...@vendini.com
www.vendini.com


On Fri, Mar 31, 2017 at 4:12 AM, Dejan Muhamedagic 
wrote:

> Hi,
>
> On Fri, Mar 31, 2017 at 02:39:02AM -0400, Digimer wrote:
> > On 31/03/17 02:32 AM, Jan Friesse wrote:
> > >> The original message has the logs from nodes 1 and 3. Node 2, the one
> > >> that
> > >> got fenced in this test, doesn't really show much. Here are the logs
> from
> > >> it:
> > >>
> > >> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
> > >> 192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
> > >> active_time=3253 secs
> > >> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
> > >> fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
> > >> dropped=0, active_time=3253 secs
> > >> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor
> failed,
> > >> forming new configuration.
> > >> Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed,
> > >> forming
> > >> new configuration.
> > >> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network
> > >> interface
> > >> is down.
> > >
> > > This is problem. Corosync handles ifdown really badly. If this was not
> > > intentional it may be caused by NetworkManager. Then please install
> > > equivalent of NetworkManager-config-server package (it's actually one
> > > file called 00-server.conf so you can extract it from, for example,
> > > Fedora package
> > > https://www.rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_
> 64/n/NetworkManager-config-server-1.8.0-0.1.fc27.noarch.html)
> >
> > ifdown'ing corosync's interface happens a lot, intentionally or
> > otherwise.
>
> I'm not sure, but I think that it can happen only intentionally,
> i.e. through a human intervention. If there's another problem
> with the interface it doesn't disappear from the system.
>
> Thanks,
>
> Dejan
>
> > I think it is reasonable to expect corosync to handle this
> > properly. How hard would it be to make corosync resilient to this fault
> > case?
> >
> > --
> > Digimer
> > Papers and Projects: https://alteeve.com/w/
> > "I am, somehow, less interested in the weight and convolutions of
> > Einstein’s brain than in the near certainty that people of equal talent
> > have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-31 Thread Dejan Muhamedagic
Hi,

On Fri, Mar 31, 2017 at 02:39:02AM -0400, Digimer wrote:
> On 31/03/17 02:32 AM, Jan Friesse wrote:
> >> The original message has the logs from nodes 1 and 3. Node 2, the one
> >> that
> >> got fenced in this test, doesn't really show much. Here are the logs from
> >> it:
> >>
> >> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
> >> 192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
> >> active_time=3253 secs
> >> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
> >> fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
> >> dropped=0, active_time=3253 secs
> >> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor failed,
> >> forming new configuration.
> >> Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed,
> >> forming
> >> new configuration.
> >> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network
> >> interface
> >> is down.
> > 
> > This is problem. Corosync handles ifdown really badly. If this was not
> > intentional it may be caused by NetworkManager. Then please install
> > equivalent of NetworkManager-config-server package (it's actually one
> > file called 00-server.conf so you can extract it from, for example,
> > Fedora package
> > https://www.rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/n/NetworkManager-config-server-1.8.0-0.1.fc27.noarch.html)
> 
> ifdown'ing corosync's interface happens a lot, intentionally or
> otherwise.

I'm not sure, but I think that it can happen only intentionally,
i.e. through a human intervention. If there's another problem
with the interface it doesn't disappear from the system.

Thanks,

Dejan

> I think it is reasonable to expect corosync to handle this
> properly. How hard would it be to make corosync resilient to this fault
> case?
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-31 Thread Digimer
On 31/03/17 02:32 AM, Jan Friesse wrote:
>> The original message has the logs from nodes 1 and 3. Node 2, the one
>> that
>> got fenced in this test, doesn't really show much. Here are the logs from
>> it:
>>
>> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
>> 192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
>> active_time=3253 secs
>> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
>> fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
>> dropped=0, active_time=3253 secs
>> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor failed,
>> forming new configuration.
>> Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed,
>> forming
>> new configuration.
>> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network
>> interface
>> is down.
> 
> This is problem. Corosync handles ifdown really badly. If this was not
> intentional it may be caused by NetworkManager. Then please install
> equivalent of NetworkManager-config-server package (it's actually one
> file called 00-server.conf so you can extract it from, for example,
> Fedora package
> https://www.rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/n/NetworkManager-config-server-1.8.0-0.1.fc27.noarch.html)

ifdown'ing corosync's interface happens a lot, intentionally or
otherwise. I think it is reasonable to expect corosync to handle this
properly. How hard would it be to make corosync resilient to this fault
case?

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-31 Thread Jan Friesse

The original message has the logs from nodes 1 and 3. Node 2, the one that
got fenced in this test, doesn't really show much. Here are the logs from
it:

Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
active_time=3253 secs
Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
dropped=0, active_time=3253 secs
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor failed,
forming new configuration.
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed, forming
new configuration.
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network interface
is down.


This is problem. Corosync handles ifdown really badly. If this was not 
intentional it may be caused by NetworkManager. Then please install 
equivalent of NetworkManager-config-server package (it's actually one 
file called 00-server.conf so you can extract it from, for example, 
Fedora package 
https://www.rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/n/NetworkManager-config-server-1.8.0-0.1.fc27.noarch.html)


Regards,
  Honza


Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] adding new UDPU
member {192.168.100.13}
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] adding new UDPU
member {192.168.100.14}
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] adding new UDPU
member {192.168.100.15}
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] The network interface is
down.
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] adding new UDPU member
{192.168.100.13}
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] adding new UDPU member
{192.168.100.14}
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] adding new UDPU member
{192.168.100.15}

---
Seth Reid



On Wed, Mar 29, 2017 at 7:17 AM, Bob Peterson  wrote:


- Original Message -
| I will try to install updated packages from ubuntu 16.10 or newer. It
can't
| get worse than not working.
|
| Can you think of any logs that might help? I've enabled debug on corosync
| log, but it really doesn't show anything else other than corosync
exiting.
| Any diagnostic tools you can recommend?
|
| ---
| Seth Reid


Hi Seth,

Can you post the pertinent messages from the consoles of all nodes in the
cluster? Hopefully you were monitoring them.

Regards,

Bob Peterson
Red Hat File Systems

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-30 Thread Seth Reid
The original message has the logs from nodes 1 and 3. Node 2, the one that
got fenced in this test, doesn't really show much. Here are the logs from
it:

Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
active_time=3253 secs
Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
dropped=0, active_time=3253 secs
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor failed,
forming new configuration.
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed, forming
new configuration.
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network interface
is down.
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] adding new UDPU
member {192.168.100.13}
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] adding new UDPU
member {192.168.100.14}
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] adding new UDPU
member {192.168.100.15}
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] The network interface is
down.
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] adding new UDPU member
{192.168.100.13}
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] adding new UDPU member
{192.168.100.14}
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] adding new UDPU member
{192.168.100.15}

---
Seth Reid



On Wed, Mar 29, 2017 at 7:17 AM, Bob Peterson  wrote:

> - Original Message -
> | I will try to install updated packages from ubuntu 16.10 or newer. It
> can't
> | get worse than not working.
> |
> | Can you think of any logs that might help? I've enabled debug on corosync
> | log, but it really doesn't show anything else other than corosync
> exiting.
> | Any diagnostic tools you can recommend?
> |
> | ---
> | Seth Reid
>
>
> Hi Seth,
>
> Can you post the pertinent messages from the consoles of all nodes in the
> cluster? Hopefully you were monitoring them.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-29 Thread Christine Caulfield
On 24/03/17 20:44, Seth Reid wrote:
> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> production yet because I'm having a problem during fencing. When I
> disable the network interface of any one machine,


If you mean by using ifdown or similar then ... don't do that. A proper
test would be to either physically pull the cable to set up some
iptables rules to block traffic.

Just taking the interface down causes corosync to do odd things.

Chrissie


 the disabled machines
> is properly fenced leaving me, briefly, with a two node cluster. A
> second node is then fenced off immediately, and the remaining node
> appears to try to fence itself off. This leave two nodes with
> corosync/pacemaker stopped, and the remaining machine still in the
> cluster but showing an offline node and an UNCLEAN node. What can be
> causing this behavior?
> 
> Each machine has a dedicated network interface for the cluster, and
> there is a vlan on the switch devoted to just these interfaces.
> In the following, I disabled the interface on node id 2 (b014). Node 1
> (b013) is fenced as well. Node 2 (b015) is still up.
> 
> Logs from b013:
> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> /dev/null && debian-sa1 1 1)
> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
> failed, forming new configuration.
> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
> forming new configuration.
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership
> (192.168.100.13:576 ) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive
> the leave message. failed: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
> (192.168.100.13:576 ) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
> leave message. failed: 2
> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: crm_update_peer_proc:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing b014-cl/2 from
> the membership list
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with
> id=2 and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid 19223
> nodedown time 1490387717 fence_all dlm_stonith
> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing connection to
> node 2
> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0 entries for
> 2/(null): 0 in progress, 0 completed
> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
> b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node 2/(null)
> kicked: reboot
> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to
> node 3
> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to
> node 1
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace share_data
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
> Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
> lockspaces
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
> code=exited, status=255/n/a
> Mar 24 16:35:18 b013 cib[2220]:error: Connection to the CPG API
> failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Unit entered failed
> state.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to cib_rw failed
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Failed with result
> 'exit-code'.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to
> cib_rw[0x560754147990] closed (I/O condition=17)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Main process exited,
> code=exited, status=107/n/a
> Mar 24 16:35:18 b013 pacemakerd[2187]:error: Connection to the CPG
> API failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Unit entered failed

Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-29 Thread Bob Peterson
- Original Message -
| I will try to install updated packages from ubuntu 16.10 or newer. It can't
| get worse than not working.
| 
| Can you think of any logs that might help? I've enabled debug on corosync
| log, but it really doesn't show anything else other than corosync exiting.
| Any diagnostic tools you can recommend?
| 
| ---
| Seth Reid
| System Operations Engineer
| Vendini, Inc.
| 415.349.7736
| sr...@vendini.com
| www.vendini.com

Hi Seth,

Can you post the pertinent messages from the consoles of all nodes in the
cluster? Hopefully you were monitoring them.

Regards,

Bob Peterson
Red Hat File Systems

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-29 Thread Dejan Muhamedagic
On Fri, Mar 24, 2017 at 01:44:44PM -0700, Seth Reid wrote:
> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> production yet because I'm having a problem during fencing. When I disable
> the network interface of any one machine, the disabled machines is properly
> fenced leaving me, briefly, with a two node cluster. A second node is then
> fenced off immediately, and the remaining node appears to try to fence
> itself off. This leave two nodes with corosync/pacemaker stopped, and the
> remaining machine still in the cluster but showing an offline node and an
> UNCLEAN node. What can be causing this behavior?

Man, you've got a very suicidal cluster. Is it depressed? Did you
try psychotherapy?

Otherwise, it looks like corosync crashed. Maybe look for core
dumps. Also, there should be another log in
/var/log/pacemaker.log (or similar) with lower severity messages.

Thanks,

Dejan

> Each machine has a dedicated network interface for the cluster, and there
> is a vlan on the switch devoted to just these interfaces.
> In the following, I disabled the interface on node id 2 (b014). Node 1
> (b013) is fenced as well. Node 2 (b015) is still up.
> 
> Logs from b013:
> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> /dev/null && debian-sa1 1 1)
> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor failed,
> forming new configuration.
> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed, forming
> new configuration.
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership (
> 192.168.100.13:576) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive the
> leave message. failed: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership (
> 192.168.100.13:576) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the leave
> message. failed: 2
> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2 and/or
> uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2 and/or
> uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing b014-cl/2 from
> the membership list
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid 19223
> nodedown time 1490387717 fence_all dlm_stonith
> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing connection to node
> 2
> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0 entries for
> 2/(null): 0 in progress, 0 completed
> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
> b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node 2/(null)
> kicked: reboot
> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to node
> 3
> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to node
> 1
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace share_data
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
> Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
> lockspaces
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
> code=exited, status=255/n/a
> Mar 24 16:35:18 b013 cib[2220]:error: Connection to the CPG API failed:
> Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Unit entered failed
> state.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to cib_rw failed
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Failed with result
> 'exit-code'.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to
> cib_rw[0x560754147990] closed (I/O condition=17)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Main process exited,
> code=exited, status=107/n/a
> Mar 24 16:35:18 b013 pacemakerd[2187]:error: Connection to the CPG API
> failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Unit entered failed
> state.

Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-28 Thread Seth Reid
I will try to install updated packages from ubuntu 16.10 or newer. It can't
get worse than not working.

Can you think of any logs that might help? I've enabled debug on corosync
log, but it really doesn't show anything else other than corosync exiting.
Any diagnostic tools you can recommend?

---
Seth Reid
System Operations Engineer
Vendini, Inc.
415.349.7736
sr...@vendini.com
www.vendini.com


On Mon, Mar 27, 2017 at 3:10 PM, Ken Gaillot  wrote:

> On 03/27/2017 03:54 PM, Seth Reid wrote:
> >
> >
> >
> > On Fri, Mar 24, 2017 at 2:10 PM, Ken Gaillot  > > wrote:
> >
> > On 03/24/2017 03:52 PM, Digimer wrote:
> > > On 24/03/17 04:44 PM, Seth Reid wrote:
> > >> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its
> not in
> > >> production yet because I'm having a problem during fencing. When I
> > >> disable the network interface of any one machine, the disabled
> machines
> > >> is properly fenced leaving me, briefly, with a two node cluster. A
> > >> second node is then fenced off immediately, and the remaining node
> > >> appears to try to fence itself off. This leave two nodes with
> > >> corosync/pacemaker stopped, and the remaining machine still in the
> > >> cluster but showing an offline node and an UNCLEAN node. What can
> be
> > >> causing this behavior?
> > >
> > > It looks like the fence attempt failed, leaving the cluster hung.
> When
> > > you say all nodes were fenced, did all nodes actually reboot? Or
> did the
> > > two surviving nodes just lock up? If the later, then that is the
> proper
> > > response to a failed fence (DLM stays blocked).
> >
> > See comments inline ...
> >
> > >
> > >> Each machine has a dedicated network interface for the cluster,
> and
> > >> there is a vlan on the switch devoted to just these interfaces.
> > >> In the following, I disabled the interface on node id 2 (b014).
> > Node 1
> > >> (b013) is fenced as well. Node 2 (b015) is still up.
> > >>
> > >> Logs from b013:
> > >> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v
> debian-sa1 >
> > >> /dev/null && debian-sa1 1 1)
> > >> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
> > >> failed, forming new configuration.
> > >> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
> > >> forming new configuration.
> > >> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new
> > membership
> > >> (192.168.100.13:576 
> > ) was formed. Members left: 2
> > >> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to
> > receive
> > >> the leave message. failed: 2
> > >> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
> > >> (192.168.100.13:576 
> > ) was formed. Members left: 2
> > >> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive
> the
> > >> leave message. failed: 2
> > >> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc:
> > Node
> > >> b014-cl[2] - state is now lost (was member)
> > >> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc:
> Node
> > >> b014-cl[2] - state is now lost (was member)
> > >> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from
> the
> > >> membership list
> > >> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
> > >> and/or uname=b014-cl from the membership cache
> > >> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice:
> > crm_reap_unseen_nodes:
> > >> Node b014-cl[2] - state is now lost (was member)
> > >> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2
> > from the
> > >> membership list
> > >> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with
> id=2
> > >> and/or uname=b014-cl from the membership cache
> > >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice:
> > crm_update_peer_proc:
> > >> Node b014-cl[2] - state is now lost (was member)
> > >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing
> > b014-cl/2 from
> > >> the membership list
> > >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers
> with
> > >> id=2 and/or uname=b014-cl from the membership cache
> > >> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid
> > 19223
> > >> nodedown time 1490387717 fence_all dlm_stonith
> > >> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing
> > connection to
> > >> node 2
> > >> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes:
> > Node
> > >> b014-cl[2] - state is now lost (was member)
> > >> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0
> > 

Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-27 Thread Ken Gaillot
On 03/27/2017 03:54 PM, Seth Reid wrote:
> 
> 
> 
> On Fri, Mar 24, 2017 at 2:10 PM, Ken Gaillot  > wrote:
> 
> On 03/24/2017 03:52 PM, Digimer wrote:
> > On 24/03/17 04:44 PM, Seth Reid wrote:
> >> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> >> production yet because I'm having a problem during fencing. When I
> >> disable the network interface of any one machine, the disabled machines
> >> is properly fenced leaving me, briefly, with a two node cluster. A
> >> second node is then fenced off immediately, and the remaining node
> >> appears to try to fence itself off. This leave two nodes with
> >> corosync/pacemaker stopped, and the remaining machine still in the
> >> cluster but showing an offline node and an UNCLEAN node. What can be
> >> causing this behavior?
> >
> > It looks like the fence attempt failed, leaving the cluster hung. When
> > you say all nodes were fenced, did all nodes actually reboot? Or did the
> > two surviving nodes just lock up? If the later, then that is the proper
> > response to a failed fence (DLM stays blocked).
> 
> See comments inline ...
> 
> >
> >> Each machine has a dedicated network interface for the cluster, and
> >> there is a vlan on the switch devoted to just these interfaces.
> >> In the following, I disabled the interface on node id 2 (b014).
> Node 1
> >> (b013) is fenced as well. Node 2 (b015) is still up.
> >>
> >> Logs from b013:
> >> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> >> /dev/null && debian-sa1 1 1)
> >> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
> >> failed, forming new configuration.
> >> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
> >> forming new configuration.
> >> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new
> membership
> >> (192.168.100.13:576 
> ) was formed. Members left: 2
> >> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to
> receive
> >> the leave message. failed: 2
> >> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
> >> (192.168.100.13:576 
> ) was formed. Members left: 2
> >> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
> >> leave message. failed: 2
> >> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc:
> Node
> >> b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> >> b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> >> membership list
> >> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
> >> and/or uname=b014-cl from the membership cache
> >> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice:
> crm_reap_unseen_nodes:
> >> Node b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2
> from the
> >> membership list
> >> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
> >> and/or uname=b014-cl from the membership cache
> >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice:
> crm_update_peer_proc:
> >> Node b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing
> b014-cl/2 from
> >> the membership list
> >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with
> >> id=2 and/or uname=b014-cl from the membership cache
> >> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid
> 19223
> >> nodedown time 1490387717 fence_all dlm_stonith
> >> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing
> connection to
> >> node 2
> >> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes:
> Node
> >> b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0
> entries for
> >> 2/(null): 0 in progress, 0 completed
> >> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
> >> b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
> >> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node
> 2/(null)
> >> kicked: reboot
> 
> It looks like the fencing of b014-cl is reported as successful above ...
> 
> >> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to
> >> node 3
> >> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to
> >> node 1
> >> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace 
> 

Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-27 Thread Seth Reid
On Fri, Mar 24, 2017 at 2:10 PM, Ken Gaillot  wrote:

> On 03/24/2017 03:52 PM, Digimer wrote:
> > On 24/03/17 04:44 PM, Seth Reid wrote:
> >> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> >> production yet because I'm having a problem during fencing. When I
> >> disable the network interface of any one machine, the disabled machines
> >> is properly fenced leaving me, briefly, with a two node cluster. A
> >> second node is then fenced off immediately, and the remaining node
> >> appears to try to fence itself off. This leave two nodes with
> >> corosync/pacemaker stopped, and the remaining machine still in the
> >> cluster but showing an offline node and an UNCLEAN node. What can be
> >> causing this behavior?
> >
> > It looks like the fence attempt failed, leaving the cluster hung. When
> > you say all nodes were fenced, did all nodes actually reboot? Or did the
> > two surviving nodes just lock up? If the later, then that is the proper
> > response to a failed fence (DLM stays blocked).
>
> See comments inline ...
>
> >
> >> Each machine has a dedicated network interface for the cluster, and
> >> there is a vlan on the switch devoted to just these interfaces.
> >> In the following, I disabled the interface on node id 2 (b014). Node 1
> >> (b013) is fenced as well. Node 2 (b015) is still up.
> >>
> >> Logs from b013:
> >> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> >> /dev/null && debian-sa1 1 1)
> >> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
> >> failed, forming new configuration.
> >> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
> >> forming new configuration.
> >> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership
> >> (192.168.100.13:576 ) was formed. Members
> left: 2
> >> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive
> >> the leave message. failed: 2
> >> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
> >> (192.168.100.13:576 ) was formed. Members
> left: 2
> >> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
> >> leave message. failed: 2
> >> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
> >> b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> >> b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> >> membership list
> >> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
> >> and/or uname=b014-cl from the membership cache
> >> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
> >> Node b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
> >> membership list
> >> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
> >> and/or uname=b014-cl from the membership cache
> >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: crm_update_peer_proc:
> >> Node b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing b014-cl/2 from
> >> the membership list
> >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with
> >> id=2 and/or uname=b014-cl from the membership cache
> >> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid 19223
> >> nodedown time 1490387717 fence_all dlm_stonith
> >> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing connection to
> >> node 2
> >> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes: Node
> >> b014-cl[2] - state is now lost (was member)
> >> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0 entries for
> >> 2/(null): 0 in progress, 0 completed
> >> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
> >> b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
> >> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node 2/(null)
> >> kicked: reboot
>
> It looks like the fencing of b014-cl is reported as successful above ...
>
> >> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to
> >> node 3
> >> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to
> >> node 1
> >> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace
> share_data
> >> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
> >> Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
> >> lockspaces
> >> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
> >> code=exited, status=255/n/a
>
> ... but then DLM and corosync exit on this node. Pacemaker can only
> exit, and the node gets fenced.
>
> What does your fencing configuration look like?
>

This is the command I used. b013-cl, for example is a hosts file entry so
that the cluster 

Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-24 Thread Seth Reid
> On 24/03/17 04:44 PM, Seth Reid wrote:
> > I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> > production yet because I'm having a problem during fencing. When I
> > disable the network interface of any one machine, the disabled machines
> > is properly fenced leaving me, briefly, with a two node cluster. A
> > second node is then fenced off immediately, and the remaining node
> > appears to try to fence itself off. This leave two nodes with
> > corosync/pacemaker stopped, and the remaining machine still in the
> > cluster but showing an offline node and an UNCLEAN node. What can be
> > causing this behavior?
>
> It looks like the fence attempt failed, leaving the cluster hung. When
> you say all nodes were fenced, did all nodes actually reboot? Or did the
> two surviving nodes just lock up? If the later, then that is the proper
> response to a failed fence (DLM stays blocked).
>

The action if "off", so we aren't rebooting. The logs do still say reboot
though. In terms of actual fencing, only node 2 gets fenced, in that its
keys get removed from the shared volume. Node 1's keys don't get removed so
that is the failed fence. Node2 fence succeeds.

Of the remaining nodes, node 1 is offline in that corosync and pacemaker
are no longer running, so it can't access cluster resources. Node 3 shows
node 1 as Online but in a clean state. Neither node 1 or node 3 can write
to the cluster, but node 3 still has corosync and pacemaker running.

Here are the commands I used to build the cluster. I meant to put these in
the original post.

(single machine)$> pcs property set no-quorum-policy=freeze
(single machine)$> pcs property set stonith-enabled=true
(single machine)$> pcs property set symmetric-cluster=true
(single machine)$> pcs cluster enable --all
(single machine)$> pcs stonith create fence_wh fence_scsi
debug="/var/log/cluster/fence-debug.log" vgs_path="/sbin/vgs"
sg_persist_path="/usr/bin/sg_persist" sg_turs_path="/usr/bin/sg_turs"
pcmk_reboot_action="off" pcmk_host_list="b013-cl b014-cl b015-cl"
pcmk_monitor_action="metadata" meta provides="unfencing" --force
(single machine)$> pcs resource create dlm ocf:pacemaker:controld op
monitor interval=30s on-fail=fence clone interleave=true ordered=true
(single machine)$> pcs resource create clvmd ocf:heartbeat:clvm op monitor
interval=30s on-fail=fence clone interleave=true ordered=true
(single machine)$> pcs constraint order start dlm-clone then clvmd-clone
(single machine)$> pcs constraint colocation add clvmd-clone with dlm-clone
(single machine)$> mkfs.gfs2 -p lock_dlm -t webhosts:share_data -j 3
/dev/mapper/share-data
(single machine)$> pcs resource create gfs2share Filesystem
device="/dev/mapper/share-data" directory="/share" fstype="gfs2"
options="noatime,nodiratime" op monitor interval=10s on-fail=fence clone
interleave=true
(single machine)$> pcs constraint order start clvmd-clone then
gfs2share-clone
(single machine)$> pcs constraint colocation add gfs2share-clone with
clvmd-clone


>
> > Each machine has a dedicated network interface for the cluster, and
> > there is a vlan on the switch devoted to just these interfaces.
> > In the following, I disabled the interface on node id 2 (b014). Node 1
> > (b013) is fenced as well. Node 2 (b015) is still up.
> >
> > Logs from b013:
> > Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> > /dev/null && debian-sa1 1 1)
> > Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
> > failed, forming new configuration.
> > Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
> > forming new configuration.
> > Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership
> > (192.168.100.13:576 ) was formed. Members
> left: 2
> > Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive
> > the leave message. failed: 2
> > Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
> > (192.168.100.13:576 ) was formed. Members
> left: 2
> > Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
> > leave message. failed: 2
> > Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
> > b014-cl[2] - state is now lost (was member)
> > Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> > b014-cl[2] - state is now lost (was member)
> > Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> > membership list
> > Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
> > and/or uname=b014-cl from the membership cache
> > Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
> > Node b014-cl[2] - state is now lost (was member)
> > Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
> > membership list
> > Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
> > and/or uname=b014-cl from the membership cache
> > Mar 24 16:35:17 b013 

Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-24 Thread Digimer
On 24/03/17 04:44 PM, Seth Reid wrote:
> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> production yet because I'm having a problem during fencing. When I
> disable the network interface of any one machine, the disabled machines
> is properly fenced leaving me, briefly, with a two node cluster. A
> second node is then fenced off immediately, and the remaining node
> appears to try to fence itself off. This leave two nodes with
> corosync/pacemaker stopped, and the remaining machine still in the
> cluster but showing an offline node and an UNCLEAN node. What can be
> causing this behavior?

It looks like the fence attempt failed, leaving the cluster hung. When
you say all nodes were fenced, did all nodes actually reboot? Or did the
two surviving nodes just lock up? If the later, then that is the proper
response to a failed fence (DLM stays blocked).

> Each machine has a dedicated network interface for the cluster, and
> there is a vlan on the switch devoted to just these interfaces.
> In the following, I disabled the interface on node id 2 (b014). Node 1
> (b013) is fenced as well. Node 2 (b015) is still up.
> 
> Logs from b013:
> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> /dev/null && debian-sa1 1 1)
> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
> failed, forming new configuration.
> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
> forming new configuration.
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership
> (192.168.100.13:576 ) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive
> the leave message. failed: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
> (192.168.100.13:576 ) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
> leave message. failed: 2
> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: crm_update_peer_proc:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing b014-cl/2 from
> the membership list
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with
> id=2 and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid 19223
> nodedown time 1490387717 fence_all dlm_stonith
> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing connection to
> node 2
> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0 entries for
> 2/(null): 0 in progress, 0 completed
> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
> b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node 2/(null)
> kicked: reboot
> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to
> node 3
> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to
> node 1
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace share_data
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
> Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
> lockspaces
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
> code=exited, status=255/n/a
> Mar 24 16:35:18 b013 cib[2220]:error: Connection to the CPG API
> failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Unit entered failed
> state.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to cib_rw failed
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Failed with result
> 'exit-code'.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to
> cib_rw[0x560754147990] closed (I/O condition=17)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Main process exited,
> code=exited, status=107/n/a
> Mar 24 16:35:18 b013 pacemakerd[2187]:error: Connection to the CPG
> API failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Unit