Re: [Pacemaker] Pacemaker fencing and DLM/cLVM

2014-11-28 Thread Daniel Dehennin
Andrew Beekhof  writes:

> This was fixed a few months ago:
>
> + David Vossel (9 months ago) 054fedf: Fix: stonith_api_time_helper now 
> returns when the most recent fencing operation completed  (origin/pr/444)
> + Andrew Beekhof (9 months ago) d9921e5: Fix: Fencing: Pass the correct 
> options when looking up the history by node name 
> + Andrew Beekhof (9 months ago) b0a8876: Log: Fencing: Send details of 
> stonith_api_time() and stonith_api_kick() to syslog 
>
> It doesn't seem Ubuntu has these patches

Thanks, I just opened a bug report[1].

Footnotes: 
[1]  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1397278

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker fencing and DLM/cLVM

2014-11-27 Thread Andrew Beekhof

> On 27 Nov 2014, at 8:17 pm, Christine Caulfield  wrote:
> 
> On 25/11/14 19:55, Daniel Dehennin wrote:
>> Christine Caulfield  writes:
>> 
>>> It seems to me that fencing is failing for some reason, though I can't
>>> tell from the logs exactly why, so you might have to investgate your
>>> setup for IPMI to see just what is happening (I'm no IPMI expert,
>>> sorry).
>> 
>> Thanks for looking, but actually IPMI stonith is working, for all nodes
>> I tested:
>> 
>> stonith_adm --reboot 
>> 
>> And it works.
>> 
>>> The logs files tell me this though:
>>> 
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
>>> 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result
>>> 1084811079 pid 7358 result 1 exit status
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status
>>> 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
>>> 1084811079 no actor
>>> 
>>> 
>>> Showing a status code '1' from dlm_stonith - the result should be 0 if
>>> fencing completed succesfully.
>> 
>> But 1084811080 is nebula3 and in its logs I see:
>> 
>> Nov 25 10:56:33 nebula3 stonith-ng[6232]:   notice: 
>> can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: 
>> static-list
>> [...]
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: log_operation: Operation 
>> 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 
>> 'Stonith-nebula2-IPMILAN' returned: 0 (OK)
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: 
>> crm_glib_handler: Forked child 7376 to record non-fatal assert at 
>> logging.c:63 : Source ID 20 was not found when attempting to remove it
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: 
>> crm_glib_handler: Forked child 7377 to record non-fatal assert at 
>> logging.c:63 : Source ID 21 was not found when attempting to remove it
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: remote_op_done: 
>> Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK
>> Nov 25 10:56:34 nebula3 crmd[6236]:   notice: tengine_stonith_notify: Peer 
>> nebula2 was terminated (reboot) by nebula1 for nebula1: OK 
>> (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038
>> Nov 25 10:56:34 nebula3 crmd[6236]:   notice: crm_update_peer_state: 
>> tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null))
>> 
>> Which means to me that stonith-ng manage to fence the node and notify
>> its success.
>> 
>> How the “returned: 0 (OK)” could became “receive 1”?
>> 
>> A logic issue somewhere between stonith-ng and dlm_controld?
>> 
> 
> it could be, I don't know enough about pacemaker to be able to comment on 
> that, sorry. The 'no actors' message from dlm_controld worries me though.

This was fixed a few months ago:

+ David Vossel (9 months ago) 054fedf: Fix: stonith_api_time_helper now returns 
when the most recent fencing operation completed  (origin/pr/444)
+ Andrew Beekhof (9 months ago) d9921e5: Fix: Fencing: Pass the correct options 
when looking up the history by node name 
+ Andrew Beekhof (9 months ago) b0a8876: Log: Fencing: Send details of 
stonith_api_time() and stonith_api_kick() to syslog 

It doesn't seem Ubuntu has these patches
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker fencing and DLM/cLVM

2014-11-27 Thread Christine Caulfield

On 25/11/14 19:55, Daniel Dehennin wrote:

Christine Caulfield  writes:


It seems to me that fencing is failing for some reason, though I can't
tell from the logs exactly why, so you might have to investgate your
setup for IPMI to see just what is happening (I'm no IPMI expert,
sorry).


Thanks for looking, but actually IPMI stonith is working, for all nodes
I tested:

 stonith_adm --reboot 

And it works.


The logs files tell me this though:

Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result
1084811079 pid 7358 result 1 exit status
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status
1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
1084811079 no actor


Showing a status code '1' from dlm_stonith - the result should be 0 if
fencing completed succesfully.


But 1084811080 is nebula3 and in its logs I see:

Nov 25 10:56:33 nebula3 stonith-ng[6232]:   notice: can_fence_host_with_device: 
Stonith-nebula2-IPMILAN can fence nebula2: static-list
[...]
Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: log_operation: Operation 
'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 
'Stonith-nebula2-IPMILAN' returned: 0 (OK)
Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: 
crm_glib_handler: Forked child 7376 to record non-fatal assert at logging.c:63 
: Source ID 20 was not found when attempting to remove it
Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: 
crm_glib_handler: Forked child 7377 to record non-fatal assert at logging.c:63 
: Source ID 21 was not found when attempting to remove it
Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: remote_op_done: Operation 
reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK
Nov 25 10:56:34 nebula3 crmd[6236]:   notice: tengine_stonith_notify: Peer 
nebula2 was terminated (reboot) by nebula1 for nebula1: OK 
(ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038
Nov 25 10:56:34 nebula3 crmd[6236]:   notice: crm_update_peer_state: 
tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null))

Which means to me that stonith-ng manage to fence the node and notify
its success.

How the “returned: 0 (OK)” could became “receive 1”?

A logic issue somewhere between stonith-ng and dlm_controld?



it could be, I don't know enough about pacemaker to be able to comment 
on that, sorry. The 'no actors' message from dlm_controld worries me 
though.



Chrissie

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker fencing and DLM/cLVM

2014-11-25 Thread Daniel Dehennin
Christine Caulfield  writes:

> It seems to me that fencing is failing for some reason, though I can't
> tell from the logs exactly why, so you might have to investgate your
> setup for IPMI to see just what is happening (I'm no IPMI expert,
> sorry).

Thanks for looking, but actually IPMI stonith is working, for all nodes
I tested:

stonith_adm --reboot 

And it works.

> The logs files tell me this though:
>
> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
> 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result
> 1084811079 pid 7358 result 1 exit status
> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status
> 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035
> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
> 1084811079 no actor
>
>
> Showing a status code '1' from dlm_stonith - the result should be 0 if
> fencing completed succesfully.

But 1084811080 is nebula3 and in its logs I see:

Nov 25 10:56:33 nebula3 stonith-ng[6232]:   notice: can_fence_host_with_device: 
Stonith-nebula2-IPMILAN can fence nebula2: static-list
[...]
Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: log_operation: Operation 
'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 
'Stonith-nebula2-IPMILAN' returned: 0 (OK)
Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: 
crm_glib_handler: Forked child 7376 to record non-fatal assert at logging.c:63 
: Source ID 20 was not found when attempting to remove it
Nov 25 10:56:34 nebula3 stonith-ng[6232]:error: crm_abort: 
crm_glib_handler: Forked child 7377 to record non-fatal assert at logging.c:63 
: Source ID 21 was not found when attempting to remove it
Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: remote_op_done: Operation 
reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK
Nov 25 10:56:34 nebula3 crmd[6236]:   notice: tengine_stonith_notify: Peer 
nebula2 was terminated (reboot) by nebula1 for nebula1: OK 
(ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038
Nov 25 10:56:34 nebula3 crmd[6236]:   notice: crm_update_peer_state: 
tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null))

Which means to me that stonith-ng manage to fence the node and notify
its success.

How the “returned: 0 (OK)” could became “receive 1”?

A logic issue somewhere between stonith-ng and dlm_controld?

Thanks.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker fencing and DLM/cLVM

2014-11-25 Thread Christine Caulfield

On 25/11/14 10:45, Daniel Dehennin wrote:

Daniel Dehennin  writes:


I'm using Ubuntu 14.04:

- corosync 2.3.3-1ubuntu1
- pacemaker 1.1.10+git20130802-1ubuntu2.1

I thought everything was integrated in such configuration.


Here are some more informations:

- the pacemaker configuration
- the log of the DC nebula1 with marks for each step
- the log of the nebula2 with marks for each step
- the log of the nebula3 with marks for each step
- the output of “dlm_tool ls” and dlm_tool status” before/during/after
   nebula2 fencing

The steps are:

1. All nodes up, cluster down
2. Start corosync on all nodes
3. Start pacemaker on all nodes
4. Start resource ONE-Storage-Clone (dlm, cLVM, VG, GFS2)
5. Crash nebula2
6. Start corosync on nebula2 after reboot
7. Start pacemaker on nebula2 after reboot

Does someone understand why DLM did not get the ACK of the fencing
automatically from stonith?

Why ONE-Storage-Clone does not manage to start on nebula2 after fencing.



It seems to me that fencing is failing for some reason, though I can't 
tell from the logs exactly why, so you might have to investgate your 
setup for IPMI to see just what is happening (I'm no IPMI expert, sorry).


The logs files tell me this though:

Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request 
1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result 1084811079 
pid 7358 result 1 exit status
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status 1084811079 
receive 1 from 1084811080 walltime 1416909392 local 1035
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request 
1084811079 no actor



Showing a status code '1' from dlm_stonith - the result should be 0 if 
fencing completed succesfully.



Chrissie

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker fencing and DLM/cLVM

2014-11-24 Thread Daniel Dehennin
Michael Schwartzkopff  writes:

> Yes. You have to tell all the underlying infrastructure to use the fencing of 
> pacemaker. I assume that you are working on a RH clone.
>
> See: 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/ch08s02s03.html

Sorry, this is my fault.

I'm using Ubuntu 14.04:

- corosync 2.3.3-1ubuntu1
- pacemaker 1.1.10+git20130802-1ubuntu2.1

I thought everything was integrated in such configuration.

Regards.

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker fencing and DLM/cLVM

2014-11-24 Thread Michael Schwartzkopff
Am Montag, 24. November 2014, 15:14:26 schrieb Daniel Dehennin:
> Hello,
> 
> In my pacemaker/corosync cluster it looks like I have some issues with
> fencing ACK on DLM/cLVM.
> 
> When a node is fenced, dlm/cLVM are not aware of the fencing results and
> LVM commands hangs unless I run “dlm_tools fence_ack ”
> 
> Here are some log around the fencing of nebula1:
> 
> Nov 24 09:51:06 nebula3 crmd[6043]:  warning: update_failcount: Updating
> failcount for clvm on nebula1 after failed stop: rc=1 (update=INFINITY,
> time=1416819066) Nov 24 09:51:06 nebula3 pengine[6042]:  warning:
> unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown
> error (1) Nov 24 09:51:06 nebula3 pengine[6042]:  warning: pe_fence_node:
> Node nebula1 will be fenced because of resource failure(s) Nov 24 09:51:06
> nebula3 pengine[6042]:  warning: stage6: Scheduling Node nebula1 for
> STONITH Nov 24 09:51:06 nebula3 pengine[6042]:   notice:
> native_stop_constraints: Stop of failed resource clvm:0 is implicit after
> nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]:   notice:
> LogActions: MoveStonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2)
> Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop   
> dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]:   notice:
> LogActions: Stopclvm:0#011(nebula1) Nov 24 09:51:06 nebula3
> pengine[6042]:  warning: process_pe_message: Calculated Transition 4:
> /var/lib/pacemaker/pengine/pe-warn-1.bz2 Nov 24 09:51:06 nebula3
> pengine[6042]:  warning: unpack_rsc_op: Processing failed op stop for
> clvm:0 on nebula1: unknown error (1) Nov 24 09:51:06 nebula3 pengine[6042]:
>  warning: pe_fence_node: Node nebula1 will be fenced because of resource
> failure(s) Nov 24 09:51:06 nebula3 pengine[6042]:  warning: stage6:
> Scheduling Node nebula1 for STONITH Nov 24 09:51:06 nebula3 pengine[6042]: 
>  notice: native_stop_constraints: Stop of failed resource clvm:0 is
> implicit after nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]:  
> notice: LogActions: MoveStonith-nebula3-IPMILAN#011(Started nebula1 ->
> nebula2) Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop 
>   dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]:   notice:
> LogActions: Stopclvm:0#011(nebula1) Nov 24 09:51:06 nebula3
> pengine[6042]:  warning: process_pe_message: Calculated Transition 5:
> /var/lib/pacemaker/pengine/pe-warn-2.bz2 Nov 24 09:51:06 nebula3
> crmd[6043]:   notice: te_fence_node: Executing reboot fencing operation
> (79) on nebula1 (timeout=3) Nov 24 09:51:06 nebula3 stonith-ng[6039]:  
> notice: handle_request: Client crmd.6043.5ec58277 wants to fence (reboot)
> 'nebula1' with device '(any)' Nov 24 09:51:06 nebula3 stonith-ng[6039]:  
> notice: initiate_remote_stonith_op: Initiating remote operation reboot for
> nebula1: 50c93bed-e66f-48a5-bd2f-100a9e7ca7a1 (0) Nov 24 09:51:06 nebula3
> stonith-ng[6039]:   notice: can_fence_host_with_device:
> Stonith-nebula1-IPMILAN can fence nebula1: static-list Nov 24 09:51:06
> nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device:
> Stonith-nebula2-IPMILAN can not fence nebula1: static-list Nov 24 09:51:06
> nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device:
> Stonith-ONE-Frontend can not fence nebula1: static-list Nov 24 09:51:09
> nebula3 corosync[5987]:   [TOTEM ] A processor failed, forming new
> configuration. Nov 24 09:51:13 nebula3 corosync[5987]:   [TOTEM ] A new
> membership (192.168.231.71:81200) was formed. Members left: 1084811078 Nov
> 24 09:51:13 nebula3 lvm[6311]: confchg callback. 0 joined, 1 left, 2
> members Nov 24 09:51:13 nebula3 corosync[5987]:   [QUORUM] Members[2]:
> 1084811079 1084811080 Nov 24 09:51:13 nebula3 corosync[5987]:   [MAIN  ]
> Completed service synchronization, ready to provide service. Nov 24
> 09:51:13 nebula3 pacemakerd[6036]:   notice: crm_update_peer_state:
> pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was
> member) Nov 24 09:51:13 nebula3 crmd[6043]:   notice:
> crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] -
> state is now lost (was member) Nov 24 09:51:13 nebula3 kernel: [ 
> 510.140107] dlm: closing connection to node 1084811078 Nov 24 09:51:13
> nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from
> 1084811079 walltime 1416819073 local 509 Nov 24 09:51:13 nebula3
> dlm_controld[6263]: 509 fence request 1084811078 pid 7142 nodedown time
> 1416819073 fence_all dlm_stonith Nov 24 09:51:13 nebula3
> dlm_controld[6263]: 509 fence result 1084811078 pid 7142 result 1 exit
> status Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status
> 1084811078 receive 1 from 1084811080 walltime 1416819073 local 509 Nov 24
> 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 no actor
> Nov 24 09:51:13 nebula3 stonith-ng[6039]:   notice: remote_op_done:
> Operation reboot of nebula1 by nebula2 for crmd.6043@nebula3.50c93bed: OK
> Nov 24 09:51:13 n