Hi!

Some updates:

I've completed upgrading 4 of 5 nodes to SP3; the remaining node won't boot
due to a hardware problem.

I saw that I don't have an SBD device any more (it's stopped). Unfortunately I
could not start it (crm resource start prm_stonith_sbd).
I guess it's due to the fact that the cluster won't start resources until the
UNCLEAN node has been fenced. The dog bites ist tail ,it seems...

Likewise:
# stonith_admin -F rkdvmso3
Command failed: No such device

A manual OFF via sbd commandline also failed:
crmd[13716]:   notice: tengine_stonith_notify: Peer o3 was not terminated
(off) by o5 for o5: No such device (ref=2a8b6889-9b08-4e99-9a12-4404fa1232b1)
by client stonith_admin.23725
However sbd had success:
sbd: [24996]: info: off successfully delivered to o3
sbd: [24994]: info: Message successfully delivered.

"crm(live)node# clearstate o3" did not help either.

The cluster is refusing to work:
cib: [12243]: info: cib_process_diff: Diff 0.620.155 -> 0.621.1 not applied to
0.617.0: current "epoch" is less than required

I wonder: Does the "old" down node prevent the cluster from working? Obviously
I cannot upgrade the defective node.

It seems "crm(live)node# delete o3" fixed the problem of the cluster being
stuck.

Unfortunately now the cluster is sending the same updates very fast:
cib[2821]:  warning: cib_process_replace: Replacement 0.617.18 from o2 not
applied to 0.622.8: current epoch is greater than the replacement

Node "o2" has the latest software running, but still says:
cib: [12243]: ERROR: cib_perform_op: Discarding update with feature set
'3.0.7' greater than our own '3.0.6'
crmd: [12248]: info: update_dc: Set DC to o2 (3.0.6)
cib: [12243]: WARN: cib_diff_notify: Update (client: crmd, call:479): -1.-1.-1
-> 0.622.8 (The action/feature is not supported)

While trying to restart the cluster stack on o2, the other nodes complained:
cib[17864]:  warning: cib_process_replace: Replacement 0.617.21 from o2 not
applied to 0.622.8: current epoch is greater than the replacement

The effective thing to do was to kill crmd on o2:
o2:~ # kill 12248

Causing:
crmd[2826]:  warning: reap_dead_nodes: Our DC node (o2) left the cluster
sbd: [7005]: info: Writing reset to node slot o2

Unfortunately and despite of the fact that o2 was shot, the cluster got a
stonith timeout and retried the stonith!
stonith-ng[2822]:   notice: remote_op_done: Operation reboot of o2 by o4 for
crmd.17791@o4.d9f4760b: Timer expired

When o2 returned from reste, the old game started again:
cib[2821]:  warning: cib_process_replace: Replacement 0.617.6 from o2 not
applied to 0.622.55: current epoch is greater than the replacement

I had expected that o2 would update the config from the other nodes that had a
quorum!

Could this (on the DC) be the reason?
o4 stonith-ng[17787]:    error: crm_abort: call_remote_stonith: Triggered
assert at remote.c:973 : op->state < st_done
stonith-ng[17787]:   notice: remote_op_timeout: Action reboot
(97a0476a-7f1d-4986-ba68-0f0d88aeb764) for o2 (crmd.17791) timed out

Even after shutting down the wole cluster node by node and restarting it, the
stonith operations were re-issued.

Could it be a conflict in timing parameters? However the same parameters
worked in SP2...

Regards,
Ulrich

>>> Lars Marowsky-Bree <l...@suse.com> schrieb am 26.11.2013 um 10:19 in
Nachricht
<20131126091933.gi10...@suse.de>:
> On 2013-11-26T09:32:50, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de>
wrote:
> 
>> Anotherthing I've notivced: One of out nodes has defective hardware and is
>> down. It was OK all the time with SLES11 SP2, but SP3 now tried to fence
the
>> node and got a fencing timeout:
> 
> Hmmm.
> 
>> Isn't the logic that after a fencing operation timeout the node is 
> considered
>> to be OFFLINE?
> 
> No. A timeout is not a successful fence. The problem is why you're
> getting the timeout in the first place now.
> 
>> My node currently has the state "UNCLEAN (offline)".
>> 
>> How do I make an offline node clean? ;-)
> 
> You can run stonith_admin and manually ack the fence, that should work.
> 
>> > What doesn't work is making changes, or at any point in time having only
>> > new version of the cluster up and then trying to rejoin an old one.
>> Do you consider resource cleanups (crm_resource -C ...) as a change? These

> are
>> essential in keeping the cluster happy, especially if you missed some 
> ordering
>> constraints.
> 
> No, as long as an old version is still around, that will be the DC and
> the internal upgrade shouldn't happen.
> 
> I meant changes to the configuration that actually use new features. And
> as soon as no more nodes running the old version are online, it'll
> convert upwards too ...
> 
> But yes, we're already working on a new maintenance update for SP3 too.
> 
> 
> Regards,
>     Lars
> 
> -- 
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

> HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to