On 21/12/15 04:32 PM, Ludovic Zammit wrote:
> Hello,
>
> I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor.
> Every day at 11PM a snapshot job save both servers.
> The snapshotting process seems to cause a loss of connectivity between
> the two nodes which results in the cluster partitioning and pacemaker to
> start services on both nodes.
You should have stonith enabled, configured and tested.
> Then once the snapshotting is done, the two halves of the cluster are
> able to see each other again and pacemaker chooses one on which to run
> the services.
> Unfortunately that means that our DRBD partition has been mounted on
> both, so it now goes into « split brain mode » .
Hook DRBD's fencing into pacemaker's with the crm-{un,}fence-peer.sh
{un,}fence handlers and set fencing to 'resource-and-stonith'. This will
prevent split-brains, regardless of the root cause.
> When I was running corosync 1.4, I used to adjust the « token » variable
> in the configuration file so that both nodes would wait longer before
> detecting a loss of the other.
>
> Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the
> problem is back with a vengeance.
This is not a supported configuration on EL6, so I'm not surprised that
you'd seeing issues. In any case, fix stonith first and foremost. Then
sort out the reason for corosync blocking.
> I have tried the configuration below, with a a very high totem value,
> and that resulted in the following errors (I have since reverted that
> change):
>
> Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783
> Process pause detected for 3464149 ms, flush
> ing membership messages.
> Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783
> Process pause detected for 3464149 ms, flush
> ing membership messages.
> Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783
> Process pause detected for 3464199 ms, flush
> ing membership messages.
>
>
> What can I do to prevent the cluster splitting apart during those
> nightly snapshots?
> How do I manually set a long totem timeout without breaking everything else?
Snapshots are generally a poor way to handle backups. The images created
will be point-in-time recovery, *without* whatever was in RAM. So it
would effectively be like recovering from a sudden power loss. *Usually*
OK, but if something goes so wrong that you need to recover from backup,
"usually" isn't good enough. So before anything, I would reconsider the
snapshots entirely.
> ======================================================================
>
> Software version:
> 2.6.32-573.7.1.el6.x86_64
>
> corosync-2.3.5-1.el6.x86_64
> corosynclib-2.3.5-1.el6.x86_64
>
> pacemaker-cluster-libs-1.1.13-1.el6.x86_64
> pacemaker-cli-1.1.13-1.el6.x86_64
>
> kmod-microsoft-hyper-v-4.0.11-20150728.x86_64
> microsoft-hyper-v-4.0.11-20150728.x86_64
>
> Configuration:
>
> totem {
> version: 2
>
> crypto_cipher: none
> crypto_hash: none
> clear_node_high_bit: yes
> cluster_name: cluster
> transport: udpu
> token: 150000
>
> interface {
> ringnumber: 0
> bindnetaddr: 10.200.0.2
> mcastport: 5405
> ttl: 1
> }
> }
>
> nodelist {
> node {
> ring0_addr: 10.200.0.2
> }
>
> node {
> ring0_addr: 10.200.0.3
> }
> }
>
> logging {
> fileline: on
> to_stderr: no
> to_logfile: yes
> logfile: /var/log/cluster/corosync.log
> to_syslog: yes
> debug: off
> timestamp: on
> logger_subsys {
> subsys: QUORUM
> debug: off
> }
> }
>
>
> quorum {
> provider: corosync_votequorum
> two_node: 1
> }
>
>
>
> Thank you for your help,
> —
>
> Ludovic Zammit
>
>
>
>
>
>
>
>
> _______________________________________________
> discuss mailing list
> [email protected]
> http://lists.corosync.org/mailman/listinfo/discuss
>
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss