On Thu, Dec 9, 2010 at 2:11 PM, Alain.Moulle <alain.mou...@bull.net> wrote:
> Hi,
>
> Thanks.
> So I have a robustness pb with Pacemaker/corosync ... you'll tell me
> if it seems normal or not , if I miss something or not :

Perfectly valid testcase, unacceptable result.

Perhaps try with stonith-enabled=false so we can get the logs?
Actually a better alternative would be to leave stonith enabled but
use rsyslog to log to a non-cluster machine.
Its shouldn't be too hard to configure.

Best guess, corosync comms are being saturated which is being
escalated into a node failure.

> for example on two nodes, I make a script which configure 60 resnames
> with ocf pacemaker Dummy script,
> (30 with INFINITY location on node1 and 30 with INFINITY location on node2)
> and my test consists in a loop of [ 60 "crm resource start <resname>",
> sleep 60, 60 "crm resource stop <resname>",
> sleep 60 ] and so on ...
>
> After 2 or 3 successful loops, pacemaker systematically fails on one
> node (meaning crm_mon does does connect anymore) and this
> node is fenced by the other one.  I have tried several times and it is
> systematic.
>
> On the alive node, my script test displays :
> STOP LOOP : Number 3
> Stop resname1
> *Call cib_replace failed (-41): Remote node did not respond*
> <null>
> ERROR on Stop resname1
>
> and when the fenced node is rebooted and I look in syslog, I only have
> these lines before
> the first boot Linux line :
> 1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [4557]: debug:
> rsc:resname22:335: monitor
> 1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [12818]: debug:
> perform_ra_op: resetting scheduler class to SCHED_OTHER
> 1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [4557]: debug:
> rsc:resname20:334: monitor
> 1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [4557]: debug:
> rsc:resname18:333: monitor
> 1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [12819]: debug:
> perform_ra_op: resetting scheduler class to SCHED_OTHER
> 1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [12820]: debug:
> perform_ra_op: resetting scheduler class to SCHED_OTHER
> 1291896079 2010 Dec  9 13:01:19 node2 daemon debug Dummy DEBUG:
> resname22 monitor : 0
> 1291896079 2010 Dec  9 13:01:19 node2 daemon debug Dummy DEBUG:
> resname18 monitor : 0
> 1291896079 2010 Dec  9 13:01:19 node2 daemon debug Dummy DEBUG:
> resname20 monitor : 0
> 1291896080 2010 Dec  9 13:01:20 node2 daemon debug lrmd [4557]: debug:
> rsc:resname60:373: monitor
> 1291896080 2010 Dec  9 13:01:20 node2 daemon debug lrmd [12839]: debug:
> perform_ra_op: resetting scheduler class to SCHED_OTHER
> 1291896080 2010 Dec  9 13:01:20 node2 daemon debug Dummy DEBUG:
> resname60 monitor : 0
> 1291896082 2010 Dec  9 13:01:22 node2 daemon err corosync   [TOTEM ]
> FAILED TO RECEIVE
> 1291896082 2010 Dec  9 13:01:22 node2 daemon debug corosync   [TOTEM ]
> entering GATHER state from 6.
> 1291896328 2010 Dec  9 13:05:28 node2 syslog notice syslog-ng syslog-ng
> starting up; version='3.0.3'
> 1291896328 2010 Dec  9 13:05:28 node2 kern info kernel Initializing
> cgroup subsys cpuset
> 1291896328 2010 Dec  9 13:05:28 node2 kern info kernel Initializing
> cgroup subsys cpu
> 1291896328 2010 Dec  9 13:05:28 node2 kern notice kernel Linux version
> 2.6.32-30.el6. .... etc.
>
> Releases :
> corosync-1.2.3-21.el6.x86_64
> pacemaker-1.1.2-2.el6.x86_64
>
> Alain
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to