On 07/06/2017 04:48 PM, Ken Gaillot wrote: > On 07/06/2017 09:26 AM, Klaus Wenninger wrote: >> On 07/06/2017 04:20 PM, Cesar Hernandez wrote: >>>> If node2 is getting the notification of its own fencing, it wasn't >>>> successfully fenced. Successful fencing would render it incapacitated >>>> (powered down, or at least cut off from the network and any shared >>>> resources). >>> Maybe I don't understand you, or maybe you don't understand me... ;) >>> This is the syslog of the machine, where you can see that the machine has >>> rebooted successfully, and as I said, it has been rebooted successfully all >>> the times: >> It is not just a question if it was rebooted at all. >> Your fence-agent mustn't return positively until this definitely >> has happened and the node is down. >> Otherwise you will see that message and the node will try to >> somehow cope with the fact that obviously the rest of the >> cluster thinks that it is down already. > But the "allegedly fenced" message comes in after the node has rebooted, > so it would seem that everything was in the proper sequence.
True for this message but we don't see if they didn't exchange anything in between the fencing-ok and the actual reboot. > > It looks like a bug when the fenced node rejoins quickly enough that it > is a member again before its fencing confirmation has been sent. I know > there have been plenty of clusters with nodes that quickly reboot and > slow fencing devices, so that seems unlikely, but I don't see another > explanation. Anyway - maybe putting a delay at the end of the fence-agent might give some insight. In this case we have a combination of quick reboot & quick fencing device ... possibly leading to nasty races ... > >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] Initializing cgroup subsys >>> cpuset >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] Initializing cgroup subsys cpu >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] Initializing cgroup subsys >>> cpuacct >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] Linux version 3.16.0-4-amd64 >>> (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 >>> SMP Debian 3.16.39-1 (2016-12-30) >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] Command line: >>> BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 >>> root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd >>> console=ttyS0 console=hvc0 >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] e820: BIOS-provided physical >>> RAM map: >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] BIOS-e820: [mem >>> 0x0000000000000000-0x000000000009dfff] usable >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] BIOS-e820: [mem >>> 0x000000000009e000-0x000000000009ffff] reserved >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] BIOS-e820: [mem >>> 0x00000000000e0000-0x00000000000fffff] reserved >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] BIOS-e820: [mem >>> 0x0000000000100000-0x000000003fffffff] usable >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] BIOS-e820: [mem >>> 0x00000000fc000000-0x00000000ffffffff] reserved >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] NX (Execute Disable) >>> protection: active >>> Jul 5 10:41:54 node2 kernel: [ 0.000000] SMBIOS 2.4 present. >>> >>> ... >>> >>> Jul 5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port >>> 67 >>> >>> ... >>> >>> Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync Cluster Engine >>> ('UNKNOWN'): started and ready to provide service. >>> Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync built-in features: >>> nss >>> Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Successfully read main >>> configuration file '/etc/corosync/corosync.conf'. >>> >>> ... >>> >>> Jul 5 10:41:57 node2 crmd[608]: notice: Defaulting to uname -n for the >>> local classic openais (with plugin) node name >>> Jul 5 10:41:57 node2 crmd[608]: notice: Membership 4308: quorum acquired >>> Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node >>> node2[1108352940] - state is now member (was (null)) >>> Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node >>> node11[794540] - state is now member (was (null)) >>> Jul 5 10:41:57 node2 crmd[608]: notice: The local CRM is operational >>> Jul 5 10:41:57 node2 crmd[608]: notice: State transition S_STARTING -> >>> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] >>> Jul 5 10:41:57 node2 stonith-ng[604]: notice: Watching for stonith >>> topology changes >>> Jul 5 10:41:57 node2 stonith-ng[604]: notice: Membership 4308: quorum >>> acquired >>> Jul 5 10:41:57 node2 stonith-ng[604]: notice: plugin_handle_membership: >>> Node node11[794540] - state is now member (was (null)) >>> Jul 5 10:41:57 node2 stonith-ng[604]: notice: On loss of CCM Quorum: >>> Ignore >>> Jul 5 10:41:58 node2 stonith-ng[604]: notice: Added 'st-fence_propio:0' >>> to the device list (1 active devices) >>> Jul 5 10:41:59 node2 stonith-ng[604]: notice: Operation reboot of node2 >>> by node11 for crmd.2141@node11.61c3e613: OK >>> Jul 5 10:41:59 node2 crmd[608]: crit: We were allegedly just fenced by >>> node11 for node11! >>> Jul 5 10:41:59 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client >>> crmd (conn=0x228d970, async-conn=0x228d970) left >>> Jul 5 10:41:59 node2 pacemakerd[597]: warning: The crmd process (608) can >>> no longer be respawned, shutting the cluster down. >>> Jul 5 10:41:59 node2 pacemakerd[597]: notice: Shutting down Pacemaker >>> Jul 5 10:41:59 node2 pacemakerd[597]: notice: Stopping pengine: Sent -15 >>> to process 607 >>> Jul 5 10:41:59 node2 pengine[607]: notice: Invoking handler for signal >>> 15: Terminated >>> Jul 5 10:41:59 node2 pacemakerd[597]: notice: Stopping attrd: Sent -15 >>> to process 606 >>> Jul 5 10:41:59 node2 attrd[606]: notice: Invoking handler for signal 15: >>> Terminated >>> Jul 5 10:41:59 node2 attrd[606]: notice: Exiting... >>> Jul 5 10:41:59 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client >>> attrd (conn=0x2280ef0, async-conn=0x2280ef0) left _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org