Re: [corosync] Totem Process pause detected

Digimer Mon, 21 Dec 2015 17:10:08 -0800

On 21/12/15 04:32 PM, Ludovic Zammit wrote:
> Hello,
> 
> I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor. 
> Every day at 11PM a snapshot job save both servers.
> The snapshotting process seems to cause a loss of connectivity between
> the two nodes which results in the cluster partitioning and pacemaker to
> start services on both nodes.


You should have stonith enabled, configured and tested.

> Then once the snapshotting is done, the two halves of the cluster are
> able to see each other again and pacemaker chooses one on which to run
> the services.
> Unfortunately that means that our DRBD partition has been mounted on
> both, so it now goes into «  split brain mode » .   

Hook DRBD's fencing into pacemaker's with the crm-{un,}fence-peer.sh
{un,}fence handlers and set fencing to 'resource-and-stonith'. This will
prevent split-brains, regardless of the root cause.

> When I was running corosync 1.4, I used to adjust the « token » variable
> in the configuration file so that both nodes would wait longer before
> detecting a loss of the other.
> 
> Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the
> problem is back with a vengeance.

This is not a supported configuration on EL6, so I'm not surprised that
you'd seeing issues. In any case, fix stonith first and foremost. Then
sort out the reason for corosync blocking.

> I have tried the configuration below, with a a very high totem value,
> and that resulted in the following errors (I have since reverted that
> change):
> 
> Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783
> Process pause detected for 3464149 ms, flush
> ing membership messages.
> Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783
> Process pause detected for 3464149 ms, flush
> ing membership messages.
> Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783
> Process pause detected for 3464199 ms, flush
> ing membership messages.
> 
> 
> What can I do to prevent the cluster splitting apart during those
> nightly snapshots? 
> How do I manually set a long totem timeout without breaking everything else?

Snapshots are generally a poor way to handle backups. The images created
will be point-in-time recovery, *without* whatever was in RAM. So it
would effectively be like recovering from a sudden power loss. *Usually*
OK, but if something goes so wrong that you need to recover from backup,
"usually" isn't good enough. So before anything, I would reconsider the
snapshots entirely.

> ======================================================================
> 
> Software version:
> 2.6.32-573.7.1.el6.x86_64
> 
> corosync-2.3.5-1.el6.x86_64
> corosynclib-2.3.5-1.el6.x86_64
> 
> pacemaker-cluster-libs-1.1.13-1.el6.x86_64
> pacemaker-cli-1.1.13-1.el6.x86_64
> 
> kmod-microsoft-hyper-v-4.0.11-20150728.x86_64
> microsoft-hyper-v-4.0.11-20150728.x86_64
> 
> Configuration:
> 
> totem {
>     version: 2
> 
>     crypto_cipher: none
>     crypto_hash: none
>     clear_node_high_bit: yes
>     cluster_name: cluster
>     transport: udpu
>     token: 150000
> 
>     interface {
>         ringnumber: 0
>         bindnetaddr: 10.200.0.2
>         mcastport: 5405
>         ttl: 1
>     }
> }
> 
> nodelist {
>     node {
>         ring0_addr:  10.200.0.2
>     }
> 
>     node {
>         ring0_addr:  10.200.0.3
>     }
> }
> 
> logging {
>     fileline: on
>     to_stderr: no
>     to_logfile: yes
>     logfile: /var/log/cluster/corosync.log
>     to_syslog: yes
>     debug: off
>     timestamp: on
>     logger_subsys {
>         subsys: QUORUM
>         debug: off
>     }
> }
> 
> 
> quorum {
>     provider: corosync_votequorum
>     two_node: 1
> }
> 
> 
> 
> Thank you for your help,
> — 
> 
> Ludovic Zammit
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> discuss mailing list
> [email protected]
> http://lists.corosync.org/mailman/listinfo/discuss
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] Totem Process pause detected

Reply via email to