On Wed, Oct 12, 2011 at 7:10 PM, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: >>>> Andrew Beekhof <and...@beekhof.net> schrieb am 12.10.2011 um 04:42 in >>>> Nachricht > <CAEDLWG3p+=myur8a45cm4hfprmclvbekxbheiqkovhy++dk...@mail.gmail.com>: >> On Thu, Sep 29, 2011 at 6:09 PM, Ulrich Windl >> <ulrich.wi...@rz.uni-regensburg.de> wrote: >> > Hello! >> > >> > I'm examining a case where both nodes of a two node cluster were fenced at >> the same time. The cluster is running SLES11 SP1 with a corosync 1.4.1 Update >> to make the rrp stable. I found strange messages: >> > >> > 08:15:25 h02 cib: [10993]: WARN: cib_process_replace: Replacement 0.952.21 >> not applied to 0.952.23: current num_updates is greater than the replacement >> > 08:15:25 h02 cib: [10993]: WARN: cib_diff_notify: Update (client: crmd, >> call:13834): -1.-1.-1 -> 0.952.21 (Update was older than existing >> configuration) >> > 08:15:25 h02 crmd: [10997]: WARN: finalize_sync_callback: Sync from h06 >> resulted in an error: Update was older than existing configuration >> > 08:15:25 h02 crmd: [10997]: WARN: do_log: FSA: Input I_ELECTION_DC from >> finalize_sync_callback() received in state S_FINALIZE_JOIN >> >> Was there a cluster partition at this time? > > Hi! > > Yes, I had shut down corosync on both nodes for a corosync update. Naturally > the node that terminates last has the latest CIB I guess. Unfortunately you > cannot always start up that node first, and even if, the second node will > have an obsolete CIB. If you start the wrong node first, that node's CIB may > be later (by version number) than the one that was more current (by content). > How does pacemaker handle these situations?
It continues with the highest version and saves the older one to disk. > > Most cluster software has to handle these problems, but most do with less > confusing noise in the logs. > > Specifically, when is a version considered to be "-1"? When there it contains no version information > >> Looks like one got further ahead than the other, but since we >> regenerate the resource state after an election there is no harm here. > > I hoped so ;-) Versions are x.y.z .z indicates only status updates, nothing that wouldn't have been regenerated anyway > > [...] >> > 08:23:02 h06 crmd: [10847]: debug: crm_compare_age: Loose: 18 vs 268 >> (seconds) >> > 08:23:02 h06 crmd: [10847]: debug: do_election_count_vote: Election 5 >> (owner: h02) lost: vote from h02 (Uptime) >> >> The colon is important. h06 lost the election because of the vote. >> There was no "lost vote". >> >> > 08:23:02 h06 crmd: [10847]: info: update_dc: Unset DC h02 >> > 08:23:03 h06 crmd: [10847]: debug: do_cl_join_finalize_respond: join-6: >> > Join >> complete. Sending local LRM status to h02 >> > 08:23:04 h06 crmd: [10847]: debug: get_xpath_object: No match for >> //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff >> > 08:24:01 h06 crmd: [10847]: debug: get_xpath_object: No match for >> //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff >> > >> > Around at that time I also had this strange message: >> > h02:~ # crm_resource -C -r prm_ocfs_fs_samba:0 -N h06 >> > Cleaning up prm_ocfs_fs_samba:0 on h06 >> > Waiting for 2 replies from the CRMd. >> > >> > No messages received in 60 seconds.. aborting >> > >> > Does anybody have an idea what could be wrong? I think the network was ok. > > I'd like to have an explanation for this as well. If a DC was being elected, there would be no-one to answer this query. > > Thanks for explaining, anyway. > > Regards, > Ulrich > > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems