On Thu, Dec 9, 2010 at 2:11 PM, Alain.Moulle <alain.mou...@bull.net> wrote: > Hi, > > Thanks. > So I have a robustness pb with Pacemaker/corosync ... you'll tell me > if it seems normal or not , if I miss something or not :
Perfectly valid testcase, unacceptable result. Perhaps try with stonith-enabled=false so we can get the logs? Actually a better alternative would be to leave stonith enabled but use rsyslog to log to a non-cluster machine. Its shouldn't be too hard to configure. Best guess, corosync comms are being saturated which is being escalated into a node failure. > for example on two nodes, I make a script which configure 60 resnames > with ocf pacemaker Dummy script, > (30 with INFINITY location on node1 and 30 with INFINITY location on node2) > and my test consists in a loop of [ 60 "crm resource start <resname>", > sleep 60, 60 "crm resource stop <resname>", > sleep 60 ] and so on ... > > After 2 or 3 successful loops, pacemaker systematically fails on one > node (meaning crm_mon does does connect anymore) and this > node is fenced by the other one. I have tried several times and it is > systematic. > > On the alive node, my script test displays : > STOP LOOP : Number 3 > Stop resname1 > *Call cib_replace failed (-41): Remote node did not respond* > <null> > ERROR on Stop resname1 > > and when the fenced node is rebooted and I look in syslog, I only have > these lines before > the first boot Linux line : > 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [4557]: debug: > rsc:resname22:335: monitor > 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [12818]: debug: > perform_ra_op: resetting scheduler class to SCHED_OTHER > 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [4557]: debug: > rsc:resname20:334: monitor > 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [4557]: debug: > rsc:resname18:333: monitor > 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [12819]: debug: > perform_ra_op: resetting scheduler class to SCHED_OTHER > 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [12820]: debug: > perform_ra_op: resetting scheduler class to SCHED_OTHER > 1291896079 2010 Dec 9 13:01:19 node2 daemon debug Dummy DEBUG: > resname22 monitor : 0 > 1291896079 2010 Dec 9 13:01:19 node2 daemon debug Dummy DEBUG: > resname18 monitor : 0 > 1291896079 2010 Dec 9 13:01:19 node2 daemon debug Dummy DEBUG: > resname20 monitor : 0 > 1291896080 2010 Dec 9 13:01:20 node2 daemon debug lrmd [4557]: debug: > rsc:resname60:373: monitor > 1291896080 2010 Dec 9 13:01:20 node2 daemon debug lrmd [12839]: debug: > perform_ra_op: resetting scheduler class to SCHED_OTHER > 1291896080 2010 Dec 9 13:01:20 node2 daemon debug Dummy DEBUG: > resname60 monitor : 0 > 1291896082 2010 Dec 9 13:01:22 node2 daemon err corosync [TOTEM ] > FAILED TO RECEIVE > 1291896082 2010 Dec 9 13:01:22 node2 daemon debug corosync [TOTEM ] > entering GATHER state from 6. > 1291896328 2010 Dec 9 13:05:28 node2 syslog notice syslog-ng syslog-ng > starting up; version='3.0.3' > 1291896328 2010 Dec 9 13:05:28 node2 kern info kernel Initializing > cgroup subsys cpuset > 1291896328 2010 Dec 9 13:05:28 node2 kern info kernel Initializing > cgroup subsys cpu > 1291896328 2010 Dec 9 13:05:28 node2 kern notice kernel Linux version > 2.6.32-30.el6. .... etc. > > Releases : > corosync-1.2.3-21.el6.x86_64 > pacemaker-1.1.2-2.el6.x86_64 > > Alain > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems