Hi (Jan Friesse)
I studied the issue mentioned in the github url.
It looks the crash that I am talking about is slightly different from the
one mentioned in the original issue. May be they are related, but I would
like to
Highlight my setup for ease.
Three node cluster , one is in maintenance mode to prevent any scheduling
of resources.
=====
Stack: classic openais (with plugin)
^^ I'm pretty sure you don't want to use plugin based pcmk.
Current DC: vm2cen66.mobileum.com - partition with quorum
Version: 1.1.11-97629de
3 Nodes configured, 3 expected votes
6 Resources configured
Node vm3cent66.mobileum.com: maintenance
Online: [ vm1cen66.mobileum.com vm2cen66.mobileum.com ]
====
I login to vm1cen66 and do `ifdown eth0`
In vm1cen66, I don¹t see any change in the crm_mon -Afr output.
It remains the same, as shown below
====
Stack: classic openais (with plugin)
Current DC: vm2cen66.mobileum.com - partition with quorum
Version: 1.1.11-97629de
3 Nodes configured, 3 expected votes
6 Resources configured
Node vm3cent66.mobileum.com: maintenance
Online: [ vm1cen66.mobileum.com vm2cen66.mobileum.com ]
===
But if we login to the other nodes like vm2cen66, vem3cent66, we can
correctly see that the node vm1cen66 is offline.
That is expected
But if we look into the corosync.log of vm1cen66 we see the following
===
Mar 28 14:55:09 corosync [MAIN ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of
this message is that the local firewall is configured improperly.
pgsql(TestPostgresql)[28203]: 2016/03/28_14:55:10 INFO: Master does not
exist.
pgsql(TestPostgresql)[28203]: 2016/03/28_14:55:10 WARNING: My data is
out-of-date. status=DISCONNECT
Mar 28 14:55:11 corosync [MAIN ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of
this message is that the local firewall is configured improperly.
Mar 28 14:55:12 corosync [MAIN ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of
this message is that the local firewall is configured improperly.
======
This is result of ifdown. Just don't do that.
What exact version of corosync are you using?
Pgsql resource (the postgresql resource agent) is running on this
particular node . I did a pgrep of the process and found it running. Not
attaching the logs for now.
The ³crash² happens when the ethernet interface is brought up. vm1cen66 is
unable to reconnect to the cluster because corosync has crashed, taking
some processes of pacemaker along with it.
crm_mon too stops working (it was working previously, before putting the
interface up)
I have to restart the corosync and pacemaker services to make it work
again.
That's why I keep saying don't do ifdown.
The main observation is that the node where the ethernet interface is
down, does not really ³get² it. It assumes that the other nodes are still
online, although the logs do say that the interface is down.
Queries/Observations:
1- node vm1cen66 should realise that the other nodes are offline
That would be correct behavior, yes.
2- From the discussion in the github issue it seems that in case of
ethernet failure we want it to run as a single node setup. Is that so ?
Not exactly. It should behave like all other nodes gone down.
2a. If that is the case will it honour no-quorum-policy=ignore and stop
processes ?
2b. Or will it assume that it is a single node cluster and decided
accordingly ?
3- After doing an interface down, if we grep for the corosync port in the
netstat command , we see that the corosync process has now bound the
loopback interface. Previously it was bound to the ip on eth0.
Is this expected ? As per the discussion it should be so. But the crash
did not happen immediately. It crashes when we bring the ethernet
interface up.
This is expected.
If the corosync did crash, why were we observing the logs in
corosync.log
4- Is it possible to prevent the corosync crash that we witnessed when the
ethernet interface is brought up.
Nope. Just don't do ifdown.
5- Will preventing the corosync crash really matter ? Because the node
vm1cen66 has now converted into a single node cluster ? Or will it
automatically re-bind to eth0 when interface is brought up
(Could not verify because of the crash)
It's rebound to eth0, send wrong information to other nodes and totally
destroy membership. Again, just don't do ifdown.
6- What about the split brain situation due to pacemaker not shutting down
the services on that single node ?
In a master-slave configuration this causes some confusion as to which
instance should be made a master after the node joins back.
As per the suggestion from the group , we need to configure stonith for
it. Configuring stonith seems to be the topmost priority in pacemaker
clusters.
It's not exactly topmost priority, but it's easy way how to solve many
problems.
But as far as I gather, we need specialised hardware for this ?
I believe there were also SW based stonith agents (eventho not that
reliable so not exactly recommended). Also most of the servers have at
least IPMI.
And last recommendation. Don't do ifdown.
Regards,
Honza
Regards,
Debabrata Pani
On 03/03/16 13:46, "Jan Friesse" <jfrie...@redhat.com> wrote:
Hi,
In our deployment, due to some requirement, we need to do a :
service network restart
What is exact reason for doing network restart?
Due to this corosync crashes and the associated pacemaker processes
crash
as well.
As per the last comment on this issue,
-------
Corosync reacts oddly to that. It's better to use an iptables rule to
block traffic (or crash the node with something like 'echo c >
/proc/sysrq-trigge
--------
But other network services, like Postgres, do not crash due to this
network service restart :
I can login to psql , issue queries, without any problem.
In view of this, I would like to understand if it is possible to
prevent a
corosync (and a corresponding Pacemaker) crash ?
Since postgres is somehow surviving this restart.
Any pointer to socket-level details for this behaviour will help me
understand (and explain the stakeholders) the problems better.
https://github.com/corosync/corosync/pull/32 should help.
Regards,
Honza
Regards,
Debabrata Pani
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org