Hi, We tested complicated node trouble.
An error of "Election Timeout" occurred then. * Pacemaker:pacemaker-1.0.9.1 * heartbeat-3.0.3-2.3.el5 * cluster-glue:cluster-glue-1.0.6-1.6.el5 * resource-agents-1.0.3-1.0.dev.b7a3b1973ba7 We tested it in the next procedure. Step1) Start all nodes. Step2) In a cgl49 node, we generate a monitor error of prmApPostgreSQLDB1. Step3) A cgl49 node is done STONITH of by a cgl54 node. Step4) With Step3, we do kill of the master process of the cgl54 node. Step5) A cgl54 node reboots. Step6) A cgl49 node is done STONITH. Step7) A cgl53 node is promoted to a DC node. Step8) A cgl49 node is done STONITH of again. However, because the cgl49 node has STONITH only from a cgl54 node, STONITH does time-out and does a loop. ============ Last updated: Mon Aug 30 14:40:58 2010 Stack: Heartbeat Current DC: cgl53 (a07bcfc0-7aee-4382-9a2b-711b9c93e7e9) - partition WITHOUT quorum Version: 1.0.9-74392a28b7f3 stable-1.0 tip 4 Nodes configured, unknown expected votes 16 Resources configured. ============ Node cgl49 (979c05ea-442b-4f53-9ba7-6cb7e82f30ac): UNCLEAN (offline) Node cgl54 (9bea1025-3cbe-481f-830d-a24dfc7f0374): UNCLEAN (offline) Online: [ cgl50 cgl53 ] Step9) When a cgl54 node restores, the election of the DC is performed, but an error occurs here. * cgl50 node crmd: [32110]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=do_election_count_vote ] crmd: [32110]: info: update_dc: Unset DC cgl53 (snip) cgl50 crmd: [32110]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped! * cgl53 node crmd: [1325]: info: do_state_transition: State transition S_INTEGRATION -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=do_election_count_vote ] cgl53 crmd: [1325]: info: update_dc: Unset DC cgl53 (snip) crmd: [1325]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped! (snip) crmd: [1325]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped! (siip) crmd: [1325]: info: crmd_ha_msg_filter: Another DC detected: cgl50 (op=join_offer) Step10) A cgl53 node becomes the "Pending" state. And a cgl53 node becomes the "online" state after STONITH of the wait state did time-out. Why is it that "Election Timeout" occurred? Why is it that a cgl53 node became the "Pending" state? Possibly this may be a problem of ccm. In addition, the same problem may be already reported. * Because a log file was big, I registered the same contents with Bugzilla. * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2502 Best Regards, Hideo Yamauchi. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker