>>> Lars Marowsky-Bree <l...@suse.com> schrieb am 25.11.2013 um 18:20 in
Nachricht
<20131125172059.gw10...@suse.de>:
> On 2013-11-25T17:48:25, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de>
wrote:
> 
> Hi Ulrich,
> 
>> Probably reason:
>> cib: [12226]: ERROR: cib_perform_op: Discarding update with feature set 
> '3.0.7' greater than our own '3.0.6'
>> 
>> Is it required to update the whole cluster at once?
> 
> It shouldn't be, and we tested that for sure. Rolling upgrades should be
> possible.

(my former boss once said: Give him a program ,and he'll find a bug if there
is one...)

It seems to be my fate to be affected by bugs others aren't. Anyway: I've see
the situation in the 5-node cluster where 4 nodes were up, and crm_mon said
"DC: none" and every node "UNCLEAN". The cluster wasn't able to get out of this
state for at least 15 minutes. When I returned this morning most nodes had
rebootet, but still did not have a DC.

Anotherthing I've notivced: One of out nodes has defective hardware and is
down. It was OK all the time with SLES11 SP2, but SP3 now tried to fence the
node and got a fencing timeout:

stonith-ng: [12244]: ERROR: remote_op_done: Operation reboot of o3 by <no-one>
for o1[4c91cc39-1fad-4a2e-9c06-32f6786e5baf]: Operation timed out
stonith-ng: [12244]: info: call_remote_stonith: No remaining peers capable of
terminating o3
crmd: [12248]: notice: tengine_stonith_notify: Peer o3 was not terminated
(reboot) by <anyone> o1: Operation timed out
(ref=ca661ef1-1e1b-46b1-bdd9-31c760bc2a79)

Isn't the logic that after a fencing operation timeout the node is considered
to be OFFLINE?

My node currently has the state "UNCLEAN (offline)".

How do I make an offline node clean? ;-)

> 
> What doesn't work is making changes, or at any point in time having only
> new version of the cluster up and then trying to rejoin an old one.

Do you consider resource cleanups (crm_resource -C ...) as a change? These are
essential in keeping the cluster happy, especially if you missed some ordering
constraints.

> 
> It's hard to say what happened without knowing more details about your
> update procedure.

As it's only the test cluster, I'll finish upgrading all the nodes, and then
I'll have a look how it works. Stay tuned.

Regards,
Ulrich

> 
> 
> 
> Regards,
>     Lars
> 
> -- 
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

> HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to