> On 27 Nov 2014, at 8:17 pm, Christine Caulfield <ccaul...@redhat.com> wrote: > > On 25/11/14 19:55, Daniel Dehennin wrote: >> Christine Caulfield <ccaul...@redhat.com> writes: >> >>> It seems to me that fencing is failing for some reason, though I can't >>> tell from the logs exactly why, so you might have to investgate your >>> setup for IPMI to see just what is happening (I'm no IPMI expert, >>> sorry). >> >> Thanks for looking, but actually IPMI stonith is working, for all nodes >> I tested: >> >> stonith_adm --reboot <node> >> >> And it works. >> >>> The logs files tell me this though: >>> >>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request >>> 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith >>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result >>> 1084811079 pid 7358 result 1 exit status >>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status >>> 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035 >>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request >>> 1084811079 no actor >>> >>> >>> Showing a status code '1' from dlm_stonith - the result should be 0 if >>> fencing completed succesfully. >> >> But 1084811080 is nebula3 and in its logs I see: >> >> Nov 25 10:56:33 nebula3 stonith-ng[6232]: notice: >> can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: >> static-list >> [...] >> Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: log_operation: Operation >> 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device >> 'Stonith-nebula2-IPMILAN' returned: 0 (OK) >> Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: >> crm_glib_handler: Forked child 7376 to record non-fatal assert at >> logging.c:63 : Source ID 20 was not found when attempting to remove it >> Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: >> crm_glib_handler: Forked child 7377 to record non-fatal assert at >> logging.c:63 : Source ID 21 was not found when attempting to remove it >> Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: remote_op_done: >> Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK >> Nov 25 10:56:34 nebula3 crmd[6236]: notice: tengine_stonith_notify: Peer >> nebula2 was terminated (reboot) by nebula1 for nebula1: OK >> (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038 >> Nov 25 10:56:34 nebula3 crmd[6236]: notice: crm_update_peer_state: >> tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null)) >> >> Which means to me that stonith-ng manage to fence the node and notify >> its success. >> >> How the “returned: 0 (OK)” could became “receive 1”? >> >> A logic issue somewhere between stonith-ng and dlm_controld? >> > > it could be, I don't know enough about pacemaker to be able to comment on > that, sorry. The 'no actors' message from dlm_controld worries me though.
This was fixed a few months ago: + David Vossel (9 months ago) 054fedf: Fix: stonith_api_time_helper now returns when the most recent fencing operation completed (origin/pr/444) + Andrew Beekhof (9 months ago) d9921e5: Fix: Fencing: Pass the correct options when looking up the history by node name + Andrew Beekhof (9 months ago) b0a8876: Log: Fencing: Send details of stonith_api_time() and stonith_api_kick() to syslog It doesn't seem Ubuntu has these patches _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org