Carlos,

Increasing corosync timeouts and 'monitor' action timeouts in pacemaker might help, but do you have separate leased network connection for corosync? It is better to connect your servers directly with cross cable (to be independent of switches/network infrastructure, and use this connection for intercluster communications.

Best regards,
Alex

07.02.2013 03:07, Andrew Beekhof:
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] CLM CONFIGURATION CHANGE
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] New Configuration:
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9)
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Left:
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Joined:

This appears to be the (almost) root of your problem.
The load is staving corosync of CPU (or possibly network bandwidth)
and it can no longer talk to its peer.
Corosync then informs pacemaker who initiates recovery.

I'd start by tuning some of your timeout values in corosync.conf


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to