On Wed, Oct 12, 2011 at 7:10 PM, Ulrich Windl
<ulrich.wi...@rz.uni-regensburg.de> wrote:
>>>> Andrew Beekhof <and...@beekhof.net> schrieb am 12.10.2011 um 04:42 in 
>>>> Nachricht
> <CAEDLWG3p+=myur8a45cm4hfprmclvbekxbheiqkovhy++dk...@mail.gmail.com>:
>> On Thu, Sep 29, 2011 at 6:09 PM, Ulrich Windl
>> <ulrich.wi...@rz.uni-regensburg.de> wrote:
>> > Hello!
>> >
>> > I'm examining a case where both nodes of a two node cluster were fenced at
>> the same time. The cluster is running SLES11 SP1 with a corosync 1.4.1 Update
>> to make the rrp stable. I found strange messages:
>> >
>> > 08:15:25 h02 cib: [10993]: WARN: cib_process_replace: Replacement 0.952.21
>> not applied to 0.952.23: current num_updates is greater than the replacement
>> > 08:15:25 h02 cib: [10993]: WARN: cib_diff_notify: Update (client: crmd,
>> call:13834): -1.-1.-1 -> 0.952.21 (Update was older than existing 
>> configuration)
>> > 08:15:25 h02 crmd: [10997]: WARN: finalize_sync_callback: Sync from h06
>> resulted in an error: Update was older than existing configuration
>> > 08:15:25 h02 crmd: [10997]: WARN: do_log: FSA: Input I_ELECTION_DC from
>> finalize_sync_callback() received in state S_FINALIZE_JOIN
>>
>> Was there a cluster partition at this time?
>
> Hi!
>
> Yes, I had shut down corosync on both nodes for a corosync update. Naturally 
> the node that terminates last has the latest CIB I guess. Unfortunately you 
> cannot always start up that node first, and even if, the second node will 
> have an obsolete CIB. If you start the wrong node first, that node's CIB may 
> be later (by version number) than the one that was more current (by content). 
> How does pacemaker handle these situations?

It continues with the highest version and saves the older one to disk.

>
> Most cluster software has to handle these problems, but most do with less 
> confusing noise in the logs.
>
> Specifically, when is a version considered to be "-1"?

When there it contains no version information

>
>> Looks like one got further ahead than the other, but since we
>> regenerate the resource state after an election there is no harm here.
>
> I hoped so ;-)

Versions are x.y.z

.z indicates only status updates, nothing that wouldn't have been
regenerated anyway

>
> [...]
>> > 08:23:02 h06 crmd: [10847]: debug: crm_compare_age: Loose: 18 vs 268
>> (seconds)
>> > 08:23:02 h06 crmd: [10847]: debug: do_election_count_vote: Election 5
>> (owner: h02) lost: vote from h02 (Uptime)
>>
>> The colon is important.  h06 lost the election because of the vote.
>> There was no "lost vote".
>>
>> > 08:23:02 h06 crmd: [10847]: info: update_dc: Unset DC h02
>> > 08:23:03 h06 crmd: [10847]: debug: do_cl_join_finalize_respond: join-6: 
>> > Join
>> complete. Sending local LRM status to h02
>> > 08:23:04 h06 crmd: [10847]: debug: get_xpath_object: No match for
>> //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff
>> > 08:24:01 h06 crmd: [10847]: debug: get_xpath_object: No match for
>> //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff
>> >
>> > Around at that time I also had this strange message:
>> > h02:~ # crm_resource -C -r prm_ocfs_fs_samba:0 -N h06
>> > Cleaning up prm_ocfs_fs_samba:0 on h06
>> > Waiting for 2 replies from the CRMd.
>> >
>> > No messages received in 60 seconds.. aborting
>> >
>> > Does anybody have an idea what could be wrong? I think the network was ok.
>
> I'd like to have an explanation for this as well.

If a DC was being elected, there would be no-one to answer this query.

>
> Thanks for explaining, anyway.
>
> Regards,
> Ulrich
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to