>>> Lars Marowsky-Bree <l...@suse.com> schrieb am 25.11.2013 um 18:20 in Nachricht <20131125172059.gw10...@suse.de>: > On 2013-11-25T17:48:25, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > > Hi Ulrich, > >> Probably reason: >> cib: [12226]: ERROR: cib_perform_op: Discarding update with feature set > '3.0.7' greater than our own '3.0.6' >> >> Is it required to update the whole cluster at once? > > It shouldn't be, and we tested that for sure. Rolling upgrades should be > possible.
(my former boss once said: Give him a program ,and he'll find a bug if there is one...) It seems to be my fate to be affected by bugs others aren't. Anyway: I've see the situation in the 5-node cluster where 4 nodes were up, and crm_mon said "DC: none" and every node "UNCLEAN". The cluster wasn't able to get out of this state for at least 15 minutes. When I returned this morning most nodes had rebootet, but still did not have a DC. Anotherthing I've notivced: One of out nodes has defective hardware and is down. It was OK all the time with SLES11 SP2, but SP3 now tried to fence the node and got a fencing timeout: stonith-ng: [12244]: ERROR: remote_op_done: Operation reboot of o3 by <no-one> for o1[4c91cc39-1fad-4a2e-9c06-32f6786e5baf]: Operation timed out stonith-ng: [12244]: info: call_remote_stonith: No remaining peers capable of terminating o3 crmd: [12248]: notice: tengine_stonith_notify: Peer o3 was not terminated (reboot) by <anyone> o1: Operation timed out (ref=ca661ef1-1e1b-46b1-bdd9-31c760bc2a79) Isn't the logic that after a fencing operation timeout the node is considered to be OFFLINE? My node currently has the state "UNCLEAN (offline)". How do I make an offline node clean? ;-) > > What doesn't work is making changes, or at any point in time having only > new version of the cluster up and then trying to rejoin an old one. Do you consider resource cleanups (crm_resource -C ...) as a change? These are essential in keeping the cluster happy, especially if you missed some ordering constraints. > > It's hard to say what happened without knowing more details about your > update procedure. As it's only the test cluster, I'll finish upgrading all the nodes, and then I'll have a look how it works. Stay tuned. Regards, Ulrich > > > > Regards, > Lars > > -- > Architect Storage/HA > SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, > HRB 21284 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems