Hi! Some updates:
I've completed upgrading 4 of 5 nodes to SP3; the remaining node won't boot due to a hardware problem. I saw that I don't have an SBD device any more (it's stopped). Unfortunately I could not start it (crm resource start prm_stonith_sbd). I guess it's due to the fact that the cluster won't start resources until the UNCLEAN node has been fenced. The dog bites ist tail ,it seems... Likewise: # stonith_admin -F rkdvmso3 Command failed: No such device A manual OFF via sbd commandline also failed: crmd[13716]: notice: tengine_stonith_notify: Peer o3 was not terminated (off) by o5 for o5: No such device (ref=2a8b6889-9b08-4e99-9a12-4404fa1232b1) by client stonith_admin.23725 However sbd had success: sbd: [24996]: info: off successfully delivered to o3 sbd: [24994]: info: Message successfully delivered. "crm(live)node# clearstate o3" did not help either. The cluster is refusing to work: cib: [12243]: info: cib_process_diff: Diff 0.620.155 -> 0.621.1 not applied to 0.617.0: current "epoch" is less than required I wonder: Does the "old" down node prevent the cluster from working? Obviously I cannot upgrade the defective node. It seems "crm(live)node# delete o3" fixed the problem of the cluster being stuck. Unfortunately now the cluster is sending the same updates very fast: cib[2821]: warning: cib_process_replace: Replacement 0.617.18 from o2 not applied to 0.622.8: current epoch is greater than the replacement Node "o2" has the latest software running, but still says: cib: [12243]: ERROR: cib_perform_op: Discarding update with feature set '3.0.7' greater than our own '3.0.6' crmd: [12248]: info: update_dc: Set DC to o2 (3.0.6) cib: [12243]: WARN: cib_diff_notify: Update (client: crmd, call:479): -1.-1.-1 -> 0.622.8 (The action/feature is not supported) While trying to restart the cluster stack on o2, the other nodes complained: cib[17864]: warning: cib_process_replace: Replacement 0.617.21 from o2 not applied to 0.622.8: current epoch is greater than the replacement The effective thing to do was to kill crmd on o2: o2:~ # kill 12248 Causing: crmd[2826]: warning: reap_dead_nodes: Our DC node (o2) left the cluster sbd: [7005]: info: Writing reset to node slot o2 Unfortunately and despite of the fact that o2 was shot, the cluster got a stonith timeout and retried the stonith! stonith-ng[2822]: notice: remote_op_done: Operation reboot of o2 by o4 for crmd.17791@o4.d9f4760b: Timer expired When o2 returned from reste, the old game started again: cib[2821]: warning: cib_process_replace: Replacement 0.617.6 from o2 not applied to 0.622.55: current epoch is greater than the replacement I had expected that o2 would update the config from the other nodes that had a quorum! Could this (on the DC) be the reason? o4 stonith-ng[17787]: error: crm_abort: call_remote_stonith: Triggered assert at remote.c:973 : op->state < st_done stonith-ng[17787]: notice: remote_op_timeout: Action reboot (97a0476a-7f1d-4986-ba68-0f0d88aeb764) for o2 (crmd.17791) timed out Even after shutting down the wole cluster node by node and restarting it, the stonith operations were re-issued. Could it be a conflict in timing parameters? However the same parameters worked in SP2... Regards, Ulrich >>> Lars Marowsky-Bree <l...@suse.com> schrieb am 26.11.2013 um 10:19 in Nachricht <20131126091933.gi10...@suse.de>: > On 2013-11-26T09:32:50, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > >> Anotherthing I've notivced: One of out nodes has defective hardware and is >> down. It was OK all the time with SLES11 SP2, but SP3 now tried to fence the >> node and got a fencing timeout: > > Hmmm. > >> Isn't the logic that after a fencing operation timeout the node is > considered >> to be OFFLINE? > > No. A timeout is not a successful fence. The problem is why you're > getting the timeout in the first place now. > >> My node currently has the state "UNCLEAN (offline)". >> >> How do I make an offline node clean? ;-) > > You can run stonith_admin and manually ack the fence, that should work. > >> > What doesn't work is making changes, or at any point in time having only >> > new version of the cluster up and then trying to rejoin an old one. >> Do you consider resource cleanups (crm_resource -C ...) as a change? These > are >> essential in keeping the cluster happy, especially if you missed some > ordering >> constraints. > > No, as long as an old version is still around, that will be the DC and > the internal upgrade shouldn't happen. > > I meant changes to the configuration that actually use new features. And > as soon as no more nodes running the old version are online, it'll > convert upwards too ... > > But yes, we're already working on a new maintenance update for SP3 too. > > > Regards, > Lars > > -- > Architect Storage/HA > SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, > HRB 21284 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems