On Tue, 2014-04-08 at 17:29 -0400, Digimer wrote: 
> Looks like your fencing (stonith) failed.

Where?  If I'm reading the logs correctly, it looks like stonith worked.
Here's the stonith:

Apr  8 09:53:21 lotus-4vm6 stonith-ng[2492]:   notice: log_operation: Operation 
'reboot' [3306] (call 2 from crmd.2496) for host 'lotus-4vm5' with device 
'st-fencing' returned: 0 (OK)

and then corosync reporting that the node left the cluster:

Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: 
lost: lotus-4vm5 3176140298

Yes?  Or am I misunderstanding that message?

Doesn't this below also further indicate that the vm5 node did actually
get stonithed?

Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node lotus-4vm5 was not seen in the previous 
transition
Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: update_member: Node 
3176140298/lotus-4vm5 is now: lost

crmd and cib also seem to be noticing the node has gone away too, don't
they here:

Apr  8 09:53:26 lotus-4vm6 cib[2491]:   notice: plugin_handle_membership: 
Membership 20: quorum lost
Apr  8 09:53:26 lotus-4vm6 cib[2491]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost (was 
member)
Apr  8 09:53:26 lotus-4vm6 crmd[2496]:   notice: plugin_handle_membership: 
Membership 20: quorum lost
Apr  8 09:53:26 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost (was 
member)

And then the node comes back:

Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 24: memb=1, new=0, lost=0
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: 
memb: lotus-4vm6 3192917514
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 24: memb=2, new=1, lost=0
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: update_member: Node 
3176140298/lotus-4vm5 is now: member
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: 
NEW:  lotus-4vm5 3176140298
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: 
MEMB: lotus-4vm5 3176140298

And now crmd realizes the node is back:

Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member 
(was lost)

As well as cib:

Apr  8 09:54:04 lotus-4vm6 cib[2491]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member 
(was lost)

And stonith-ng and crmd reports successful reboot:

Apr  8 09:54:04 lotus-4vm6 stonith-ng[2492]:   notice: remote_op_done: 
Operation reboot of lotus-4vm5 by lotus-4vm6 for 
crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/nohh...@public.gmane.org<mailto:crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/nohh...@public.gmane.org>:
 OK
Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: tengine_stonith_callback: 
Stonith operation 2/13:0:0:f325afae-64b0-4812-a897-70556ab1e806: OK (0)
Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: tengine_stonith_notify: Peer 
lotus-4vm5 was terminated (reboot) by lotus-4vm6 for lotus-4vm6: OK 
(ref=ae82b411-b07a-4235-be55-5a30a00b323b) by client crmd.2496

But all of a sudden, crmd reports the node is "lost" again?

Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: 
send_stonith_update: Node lotus-4vm5[3176140298] - state is now lost (was 
member)

But why?

Not surprising that we get these messages (below) if crmd thinks it was
suddenly "lost" (when it was in fact not according to the log for vm5:
)

Apr  8 09:54:11 lotus-4vm6 crmd[2496]:  warning: crmd_cs_dispatch: Recieving 
messages from a node we think is dead: lotus-4vm5[-1118826998]
Apr  8 09:54:31 lotus-4vm6 crmd[2496]:   notice: do_election_count_vote: 
Election 2 (current: 2, owner: lotus-4vm5): Processed vote from lotus-4vm5 
(Peer is not part of our cluster)

So I think the question is, why did crmd suddenly believe the node to be
"lost" when there was no evidence that it was lost?

b.

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to