> On 27 Nov 2014, at 8:17 pm, Christine Caulfield <ccaul...@redhat.com> wrote:
> 
> On 25/11/14 19:55, Daniel Dehennin wrote:
>> Christine Caulfield <ccaul...@redhat.com> writes:
>> 
>>> It seems to me that fencing is failing for some reason, though I can't
>>> tell from the logs exactly why, so you might have to investgate your
>>> setup for IPMI to see just what is happening (I'm no IPMI expert,
>>> sorry).
>> 
>> Thanks for looking, but actually IPMI stonith is working, for all nodes
>> I tested:
>> 
>>     stonith_adm --reboot <node>
>> 
>> And it works.
>> 
>>> The logs files tell me this though:
>>> 
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
>>> 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result
>>> 1084811079 pid 7358 result 1 exit status
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status
>>> 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
>>> 1084811079 no actor
>>> 
>>> 
>>> Showing a status code '1' from dlm_stonith - the result should be 0 if
>>> fencing completed succesfully.
>> 
>> But 1084811080 is nebula3 and in its logs I see:
>> 
>> Nov 25 10:56:33 nebula3 stonith-ng[6232]:   notice: 
>> can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: 
>> static-list
>> [...]
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: log_operation: Operation 
>> 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 
>> 'Stonith-nebula2-IPMILAN' returned: 0 (OK)
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]:    error: crm_abort: 
>> crm_glib_handler: Forked child 7376 to record non-fatal assert at 
>> logging.c:63 : Source ID 20 was not found when attempting to remove it
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]:    error: crm_abort: 
>> crm_glib_handler: Forked child 7377 to record non-fatal assert at 
>> logging.c:63 : Source ID 21 was not found when attempting to remove it
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: remote_op_done: 
>> Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK
>> Nov 25 10:56:34 nebula3 crmd[6236]:   notice: tengine_stonith_notify: Peer 
>> nebula2 was terminated (reboot) by nebula1 for nebula1: OK 
>> (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038
>> Nov 25 10:56:34 nebula3 crmd[6236]:   notice: crm_update_peer_state: 
>> tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null))
>> 
>> Which means to me that stonith-ng manage to fence the node and notify
>> its success.
>> 
>> How the “returned: 0 (OK)” could became “receive 1”?
>> 
>> A logic issue somewhere between stonith-ng and dlm_controld?
>> 
> 
> it could be, I don't know enough about pacemaker to be able to comment on 
> that, sorry. The 'no actors' message from dlm_controld worries me though.

This was fixed a few months ago:

+ David Vossel (9 months ago) 054fedf: Fix: stonith_api_time_helper now returns 
when the most recent fencing operation completed  (origin/pr/444)
+ Andrew Beekhof (9 months ago) d9921e5: Fix: Fencing: Pass the correct options 
when looking up the history by node name 
+ Andrew Beekhof (9 months ago) b0a8876: Log: Fencing: Send details of 
stonith_api_time() and stonith_api_kick() to syslog 

It doesn't seem Ubuntu has these patches
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to