Carlos,
Increasing corosync timeouts and 'monitor' action timeouts in pacemaker might help, but do you have separate leased network connection for corosync? It is better to connect your servers directly with cross cable (to be independent of switches/network infrastructure, and use this connection for intercluster communications. Best regards, Alex 07.02.2013 03:07, Andrew Beekhof: Feb 6 04:31:47 diana corosync[2902]: [CLM ] CLM CONFIGURATION CHANGE Feb 6 04:31:47 diana corosync[2902]: [CLM ] New Configuration: Feb 6 04:31:47 diana corosync[2902]: [CLM ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9) Feb 6 04:31:47 diana corosync[2902]: [CLM ] Members Left: Feb 6 04:31:47 diana corosync[2902]: [CLM ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8) Feb 6 04:31:47 diana corosync[2902]: [CLM ] Members Joined:This appears to be the (almost) root of your problem. The load is staving corosync of CPU (or possibly network bandwidth) and it can no longer talk to its peer. Corosync then informs pacemaker who initiates recovery. I'd start by tuning some of your timeout values in corosync.conf |
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org