Re: [ClusterLabs] Problem with stonith and starting services

Klaus Wenninger Thu, 06 Jul 2017 07:57:41 -0700

On 07/06/2017 04:48 PM, Ken Gaillot wrote:
> On 07/06/2017 09:26 AM, Klaus Wenninger wrote:
>> On 07/06/2017 04:20 PM, Cesar Hernandez wrote:
>>>> If node2 is getting the notification of its own fencing, it wasn't
>>>> successfully fenced. Successful fencing would render it incapacitated
>>>> (powered down, or at least cut off from the network and any shared
>>>> resources).
>>> Maybe I don't understand you, or maybe you don't understand me... ;)
>>> This is the syslog of the machine, where you can see that the machine has 
>>> rebooted successfully, and as I said, it has been rebooted successfully all 
>>> the times:
>> It is not just a question if it was rebooted at all.
>> Your fence-agent mustn't return positively until this definitely
>> has happened and the node is down.
>> Otherwise you will see that message and the node will try to
>> somehow cope with the fact that obviously the rest of the
>> cluster thinks that it is down already.
> But the "allegedly fenced" message comes in after the node has rebooted,
> so it would seem that everything was in the proper sequence.


True for this message but we don't see if they didn't exchange
anything in between the fencing-ok and the actual reboot.

>
> It looks like a bug when the fenced node rejoins quickly enough that it
> is a member again before its fencing confirmation has been sent. I know
> there have been plenty of clusters with nodes that quickly reboot and
> slow fencing devices, so that seems unlikely, but I don't see another
> explanation.

Anyway - maybe putting a delay at the end of the fence-agent might
give some insight.
In this case we have a combination of quick reboot & quick fencing
device ... possibly leading to nasty races ...

>
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] Initializing cgroup subsys 
>>> cpuset
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] Initializing cgroup subsys cpu
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] Initializing cgroup subsys 
>>> cpuacct
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] Linux version 3.16.0-4-amd64 
>>> (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 
>>> SMP Debian 3.16.39-1 (2016-12-30)
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] Command line: 
>>> BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 
>>> root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd 
>>> console=ttyS0 console=hvc0
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] e820: BIOS-provided physical 
>>> RAM map:
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 
>>> 0x0000000000000000-0x000000000009dfff] usable
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 
>>> 0x000000000009e000-0x000000000009ffff] reserved
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 
>>> 0x00000000000e0000-0x00000000000fffff] reserved
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 
>>> 0x0000000000100000-0x000000003fffffff] usable
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 
>>> 0x00000000fc000000-0x00000000ffffffff] reserved
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] NX (Execute Disable) 
>>> protection: active
>>> Jul  5 10:41:54 node2 kernel: [    0.000000] SMBIOS 2.4 present.
>>>
>>> ...
>>>
>>> Jul  5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 
>>> 67
>>>
>>> ...
>>>
>>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync Cluster Engine 
>>> ('UNKNOWN'): started and ready to provide service.
>>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync built-in features: 
>>> nss
>>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Successfully read main 
>>> configuration file '/etc/corosync/corosync.conf'.
>>>
>>> ...
>>>
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: Defaulting to uname -n for the 
>>> local classic openais (with plugin) node name
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: Membership 4308: quorum acquired
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
>>> node2[1108352940] - state is now member (was (null))
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node 
>>> node11[794540] - state is now member (was (null))
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: The local CRM is operational
>>> Jul  5 10:41:57 node2 crmd[608]:   notice: State transition S_STARTING -> 
>>> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
>>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Watching for stonith 
>>> topology changes
>>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Membership 4308: quorum 
>>> acquired
>>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: plugin_handle_membership: 
>>> Node node11[794540] - state is now member (was (null))
>>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: On loss of CCM Quorum: 
>>> Ignore
>>> Jul  5 10:41:58 node2 stonith-ng[604]:   notice: Added 'st-fence_propio:0' 
>>> to the device list (1 active devices)
>>> Jul  5 10:41:59 node2 stonith-ng[604]:   notice: Operation reboot of node2 
>>> by node11 for crmd.2141@node11.61c3e613: OK
>>> Jul  5 10:41:59 node2 crmd[608]:     crit: We were allegedly just fenced by 
>>> node11 for node11!
>>> Jul  5 10:41:59 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
>>> crmd (conn=0x228d970, async-conn=0x228d970) left
>>> Jul  5 10:41:59 node2 pacemakerd[597]:  warning: The crmd process (608) can 
>>> no longer be respawned, shutting the cluster down.
>>> Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Shutting down Pacemaker
>>> Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Stopping pengine: Sent -15 
>>> to process 607
>>> Jul  5 10:41:59 node2 pengine[607]:   notice: Invoking handler for signal 
>>> 15: Terminated
>>> Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Stopping attrd: Sent -15 
>>> to process 606
>>> Jul  5 10:41:59 node2 attrd[606]:   notice: Invoking handler for signal 15: 
>>> Terminated
>>> Jul  5 10:41:59 node2 attrd[606]:   notice: Exiting...
>>> Jul  5 10:41:59 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
>>> attrd (conn=0x2280ef0, async-conn=0x2280ef0) left


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Problem with stonith and starting services

Reply via email to