Re: [Pacemaker] Pacemaker fencing and DLM/cLVM
Andrew Beekhof writes: > This was fixed a few months ago: > > + David Vossel (9 months ago) 054fedf: Fix: stonith_api_time_helper now > returns when the most recent fencing operation completed (origin/pr/444) > + Andrew Beekhof (9 months ago) d9921e5: Fix: Fencing: Pass the correct > options when looking up the history by node name > + Andrew Beekhof (9 months ago) b0a8876: Log: Fencing: Send details of > stonith_api_time() and stonith_api_kick() to syslog > > It doesn't seem Ubuntu has these patches Thanks, I just opened a bug report[1]. Footnotes: [1] https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1397278 -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker fencing and DLM/cLVM
> On 27 Nov 2014, at 8:17 pm, Christine Caulfield wrote: > > On 25/11/14 19:55, Daniel Dehennin wrote: >> Christine Caulfield writes: >> >>> It seems to me that fencing is failing for some reason, though I can't >>> tell from the logs exactly why, so you might have to investgate your >>> setup for IPMI to see just what is happening (I'm no IPMI expert, >>> sorry). >> >> Thanks for looking, but actually IPMI stonith is working, for all nodes >> I tested: >> >> stonith_adm --reboot >> >> And it works. >> >>> The logs files tell me this though: >>> >>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request >>> 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith >>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result >>> 1084811079 pid 7358 result 1 exit status >>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status >>> 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035 >>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request >>> 1084811079 no actor >>> >>> >>> Showing a status code '1' from dlm_stonith - the result should be 0 if >>> fencing completed succesfully. >> >> But 1084811080 is nebula3 and in its logs I see: >> >> Nov 25 10:56:33 nebula3 stonith-ng[6232]: notice: >> can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: >> static-list >> [...] >> Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: log_operation: Operation >> 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device >> 'Stonith-nebula2-IPMILAN' returned: 0 (OK) >> Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: >> crm_glib_handler: Forked child 7376 to record non-fatal assert at >> logging.c:63 : Source ID 20 was not found when attempting to remove it >> Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: >> crm_glib_handler: Forked child 7377 to record non-fatal assert at >> logging.c:63 : Source ID 21 was not found when attempting to remove it >> Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: remote_op_done: >> Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK >> Nov 25 10:56:34 nebula3 crmd[6236]: notice: tengine_stonith_notify: Peer >> nebula2 was terminated (reboot) by nebula1 for nebula1: OK >> (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038 >> Nov 25 10:56:34 nebula3 crmd[6236]: notice: crm_update_peer_state: >> tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null)) >> >> Which means to me that stonith-ng manage to fence the node and notify >> its success. >> >> How the “returned: 0 (OK)” could became “receive 1”? >> >> A logic issue somewhere between stonith-ng and dlm_controld? >> > > it could be, I don't know enough about pacemaker to be able to comment on > that, sorry. The 'no actors' message from dlm_controld worries me though. This was fixed a few months ago: + David Vossel (9 months ago) 054fedf: Fix: stonith_api_time_helper now returns when the most recent fencing operation completed (origin/pr/444) + Andrew Beekhof (9 months ago) d9921e5: Fix: Fencing: Pass the correct options when looking up the history by node name + Andrew Beekhof (9 months ago) b0a8876: Log: Fencing: Send details of stonith_api_time() and stonith_api_kick() to syslog It doesn't seem Ubuntu has these patches ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker fencing and DLM/cLVM
On 25/11/14 19:55, Daniel Dehennin wrote: Christine Caulfield writes: It seems to me that fencing is failing for some reason, though I can't tell from the logs exactly why, so you might have to investgate your setup for IPMI to see just what is happening (I'm no IPMI expert, sorry). Thanks for looking, but actually IPMI stonith is working, for all nodes I tested: stonith_adm --reboot And it works. The logs files tell me this though: Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result 1084811079 pid 7358 result 1 exit status Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035 Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request 1084811079 no actor Showing a status code '1' from dlm_stonith - the result should be 0 if fencing completed succesfully. But 1084811080 is nebula3 and in its logs I see: Nov 25 10:56:33 nebula3 stonith-ng[6232]: notice: can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: static-list [...] Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: log_operation: Operation 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 'Stonith-nebula2-IPMILAN' returned: 0 (OK) Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: crm_glib_handler: Forked child 7376 to record non-fatal assert at logging.c:63 : Source ID 20 was not found when attempting to remove it Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: crm_glib_handler: Forked child 7377 to record non-fatal assert at logging.c:63 : Source ID 21 was not found when attempting to remove it Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: remote_op_done: Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK Nov 25 10:56:34 nebula3 crmd[6236]: notice: tengine_stonith_notify: Peer nebula2 was terminated (reboot) by nebula1 for nebula1: OK (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038 Nov 25 10:56:34 nebula3 crmd[6236]: notice: crm_update_peer_state: tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null)) Which means to me that stonith-ng manage to fence the node and notify its success. How the “returned: 0 (OK)” could became “receive 1”? A logic issue somewhere between stonith-ng and dlm_controld? it could be, I don't know enough about pacemaker to be able to comment on that, sorry. The 'no actors' message from dlm_controld worries me though. Chrissie ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker fencing and DLM/cLVM
Christine Caulfield writes: > It seems to me that fencing is failing for some reason, though I can't > tell from the logs exactly why, so you might have to investgate your > setup for IPMI to see just what is happening (I'm no IPMI expert, > sorry). Thanks for looking, but actually IPMI stonith is working, for all nodes I tested: stonith_adm --reboot And it works. > The logs files tell me this though: > > Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request > 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith > Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result > 1084811079 pid 7358 result 1 exit status > Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status > 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035 > Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request > 1084811079 no actor > > > Showing a status code '1' from dlm_stonith - the result should be 0 if > fencing completed succesfully. But 1084811080 is nebula3 and in its logs I see: Nov 25 10:56:33 nebula3 stonith-ng[6232]: notice: can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: static-list [...] Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: log_operation: Operation 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 'Stonith-nebula2-IPMILAN' returned: 0 (OK) Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: crm_glib_handler: Forked child 7376 to record non-fatal assert at logging.c:63 : Source ID 20 was not found when attempting to remove it Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: crm_glib_handler: Forked child 7377 to record non-fatal assert at logging.c:63 : Source ID 21 was not found when attempting to remove it Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: remote_op_done: Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK Nov 25 10:56:34 nebula3 crmd[6236]: notice: tengine_stonith_notify: Peer nebula2 was terminated (reboot) by nebula1 for nebula1: OK (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038 Nov 25 10:56:34 nebula3 crmd[6236]: notice: crm_update_peer_state: tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null)) Which means to me that stonith-ng manage to fence the node and notify its success. How the “returned: 0 (OK)” could became “receive 1”? A logic issue somewhere between stonith-ng and dlm_controld? Thanks. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker fencing and DLM/cLVM
On 25/11/14 10:45, Daniel Dehennin wrote: Daniel Dehennin writes: I'm using Ubuntu 14.04: - corosync 2.3.3-1ubuntu1 - pacemaker 1.1.10+git20130802-1ubuntu2.1 I thought everything was integrated in such configuration. Here are some more informations: - the pacemaker configuration - the log of the DC nebula1 with marks for each step - the log of the nebula2 with marks for each step - the log of the nebula3 with marks for each step - the output of “dlm_tool ls” and dlm_tool status” before/during/after nebula2 fencing The steps are: 1. All nodes up, cluster down 2. Start corosync on all nodes 3. Start pacemaker on all nodes 4. Start resource ONE-Storage-Clone (dlm, cLVM, VG, GFS2) 5. Crash nebula2 6. Start corosync on nebula2 after reboot 7. Start pacemaker on nebula2 after reboot Does someone understand why DLM did not get the ACK of the fencing automatically from stonith? Why ONE-Storage-Clone does not manage to start on nebula2 after fencing. It seems to me that fencing is failing for some reason, though I can't tell from the logs exactly why, so you might have to investgate your setup for IPMI to see just what is happening (I'm no IPMI expert, sorry). The logs files tell me this though: Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result 1084811079 pid 7358 result 1 exit status Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035 Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request 1084811079 no actor Showing a status code '1' from dlm_stonith - the result should be 0 if fencing completed succesfully. Chrissie ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker fencing and DLM/cLVM
Michael Schwartzkopff writes: > Yes. You have to tell all the underlying infrastructure to use the fencing of > pacemaker. I assume that you are working on a RH clone. > > See: > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/ch08s02s03.html Sorry, this is my fault. I'm using Ubuntu 14.04: - corosync 2.3.3-1ubuntu1 - pacemaker 1.1.10+git20130802-1ubuntu2.1 I thought everything was integrated in such configuration. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker fencing and DLM/cLVM
Am Montag, 24. November 2014, 15:14:26 schrieb Daniel Dehennin: > Hello, > > In my pacemaker/corosync cluster it looks like I have some issues with > fencing ACK on DLM/cLVM. > > When a node is fenced, dlm/cLVM are not aware of the fencing results and > LVM commands hangs unless I run “dlm_tools fence_ack ” > > Here are some log around the fencing of nebula1: > > Nov 24 09:51:06 nebula3 crmd[6043]: warning: update_failcount: Updating > failcount for clvm on nebula1 after failed stop: rc=1 (update=INFINITY, > time=1416819066) Nov 24 09:51:06 nebula3 pengine[6042]: warning: > unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown > error (1) Nov 24 09:51:06 nebula3 pengine[6042]: warning: pe_fence_node: > Node nebula1 will be fenced because of resource failure(s) Nov 24 09:51:06 > nebula3 pengine[6042]: warning: stage6: Scheduling Node nebula1 for > STONITH Nov 24 09:51:06 nebula3 pengine[6042]: notice: > native_stop_constraints: Stop of failed resource clvm:0 is implicit after > nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]: notice: > LogActions: MoveStonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2) > Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop > dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]: notice: > LogActions: Stopclvm:0#011(nebula1) Nov 24 09:51:06 nebula3 > pengine[6042]: warning: process_pe_message: Calculated Transition 4: > /var/lib/pacemaker/pengine/pe-warn-1.bz2 Nov 24 09:51:06 nebula3 > pengine[6042]: warning: unpack_rsc_op: Processing failed op stop for > clvm:0 on nebula1: unknown error (1) Nov 24 09:51:06 nebula3 pengine[6042]: > warning: pe_fence_node: Node nebula1 will be fenced because of resource > failure(s) Nov 24 09:51:06 nebula3 pengine[6042]: warning: stage6: > Scheduling Node nebula1 for STONITH Nov 24 09:51:06 nebula3 pengine[6042]: > notice: native_stop_constraints: Stop of failed resource clvm:0 is > implicit after nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]: > notice: LogActions: MoveStonith-nebula3-IPMILAN#011(Started nebula1 -> > nebula2) Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop > dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]: notice: > LogActions: Stopclvm:0#011(nebula1) Nov 24 09:51:06 nebula3 > pengine[6042]: warning: process_pe_message: Calculated Transition 5: > /var/lib/pacemaker/pengine/pe-warn-2.bz2 Nov 24 09:51:06 nebula3 > crmd[6043]: notice: te_fence_node: Executing reboot fencing operation > (79) on nebula1 (timeout=3) Nov 24 09:51:06 nebula3 stonith-ng[6039]: > notice: handle_request: Client crmd.6043.5ec58277 wants to fence (reboot) > 'nebula1' with device '(any)' Nov 24 09:51:06 nebula3 stonith-ng[6039]: > notice: initiate_remote_stonith_op: Initiating remote operation reboot for > nebula1: 50c93bed-e66f-48a5-bd2f-100a9e7ca7a1 (0) Nov 24 09:51:06 nebula3 > stonith-ng[6039]: notice: can_fence_host_with_device: > Stonith-nebula1-IPMILAN can fence nebula1: static-list Nov 24 09:51:06 > nebula3 stonith-ng[6039]: notice: can_fence_host_with_device: > Stonith-nebula2-IPMILAN can not fence nebula1: static-list Nov 24 09:51:06 > nebula3 stonith-ng[6039]: notice: can_fence_host_with_device: > Stonith-ONE-Frontend can not fence nebula1: static-list Nov 24 09:51:09 > nebula3 corosync[5987]: [TOTEM ] A processor failed, forming new > configuration. Nov 24 09:51:13 nebula3 corosync[5987]: [TOTEM ] A new > membership (192.168.231.71:81200) was formed. Members left: 1084811078 Nov > 24 09:51:13 nebula3 lvm[6311]: confchg callback. 0 joined, 1 left, 2 > members Nov 24 09:51:13 nebula3 corosync[5987]: [QUORUM] Members[2]: > 1084811079 1084811080 Nov 24 09:51:13 nebula3 corosync[5987]: [MAIN ] > Completed service synchronization, ready to provide service. Nov 24 > 09:51:13 nebula3 pacemakerd[6036]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was > member) Nov 24 09:51:13 nebula3 crmd[6043]: notice: > crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] - > state is now lost (was member) Nov 24 09:51:13 nebula3 kernel: [ > 510.140107] dlm: closing connection to node 1084811078 Nov 24 09:51:13 > nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from > 1084811079 walltime 1416819073 local 509 Nov 24 09:51:13 nebula3 > dlm_controld[6263]: 509 fence request 1084811078 pid 7142 nodedown time > 1416819073 fence_all dlm_stonith Nov 24 09:51:13 nebula3 > dlm_controld[6263]: 509 fence result 1084811078 pid 7142 result 1 exit > status Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status > 1084811078 receive 1 from 1084811080 walltime 1416819073 local 509 Nov 24 > 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 no actor > Nov 24 09:51:13 nebula3 stonith-ng[6039]: notice: remote_op_done: > Operation reboot of nebula1 by nebula2 for crmd.6043@nebula3.50c93bed: OK > Nov 24 09:51:13 n