Re: [ClusterLabs] Problem with stonith and starting services
> > > So if this is really the reason it would probably be worth > finding out what is really happening. > Thanks. Yes, I think this is really the reason. I fixed it one week ago and hasn't happened again ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/12/2017 05:16 PM, Cesar Hernandez wrote: > >> El 6 jul 2017, a las 17:34, Ken Gaillot escribió: >> >> On 07/06/2017 10:27 AM, Cesar Hernandez wrote: It looks like a bug when the fenced node rejoins quickly enough that it is a member again before its fencing confirmation has been sent. I know there have been plenty of clusters with nodes that quickly reboot and slow fencing devices, so that seems unlikely, but I don't see another explanation. >>> Could it be caused if node 2 becomes rebooted and alive before the stonith >>> script has finished? >> That *shouldn't* cause any problems, but I'm not sure what's happening >> in this case. > > So, this was the cause for the problem... > Before the two servers I have now, I've made other 3 cluster installations > with a different internet hosting provider. Using that provider, a machine > lasted more than 2 minutes to reboot using the fencing script (slow boot > process and slow remote api to respond) > So I added a "sleep 90" before the end of the script and it always worked > perfectly. > > Now, with a different provider, I used the same script, just changing the > remote api for the provider api. In this case, a machine lasts aprox 10 > seconds to do a full reboot, and also the api is faster (just 2 or 3 seconds > to respond). > So the machine was up again in less than 20 seconds. > > I suppose the problem comes when the node (node2 for example) that has been > rebooted sees that node1 is still waiting for the fencing script to finish > (due to the sleep 90) and it just becomes confused and exits pacemaker. > > I changed that sleep 90 for a sleep 5 and it hasn't happened again I guess pacemaker should be able to cope with a situation like that. Using sbd-fencing (e.g. fence_sbd) you would actually have quite a similar case. The fence-agent writes the poison-pill in the disk-slot of the node to be fenced and usually this will be read out by the victim-node within a second. But as guaranteed response-times of the shared-disk in an enterprise-environment can be really huge the fence-agent would still wait for 60s or so to be really sure that the other side has swallowed the pill. So if this is really the reason it would probably be worth finding out what is really happening. Regards, Klaus > > Thanks a lot to everyone for the help > > Cheers > Cesar > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- Klaus Wenninger Senior Software Engineer, EMEA ENG Openstack Infrastructure Red Hat kwenn...@redhat.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
> El 6 jul 2017, a las 17:34, Ken Gaillot escribió: > > On 07/06/2017 10:27 AM, Cesar Hernandez wrote: >> >>> >>> It looks like a bug when the fenced node rejoins quickly enough that it >>> is a member again before its fencing confirmation has been sent. I know >>> there have been plenty of clusters with nodes that quickly reboot and >>> slow fencing devices, so that seems unlikely, but I don't see another >>> explanation. >>> >> >> Could it be caused if node 2 becomes rebooted and alive before the stonith >> script has finished? > > That *shouldn't* cause any problems, but I'm not sure what's happening > in this case. So, this was the cause for the problem... Before the two servers I have now, I've made other 3 cluster installations with a different internet hosting provider. Using that provider, a machine lasted more than 2 minutes to reboot using the fencing script (slow boot process and slow remote api to respond) So I added a "sleep 90" before the end of the script and it always worked perfectly. Now, with a different provider, I used the same script, just changing the remote api for the provider api. In this case, a machine lasts aprox 10 seconds to do a full reboot, and also the api is faster (just 2 or 3 seconds to respond). So the machine was up again in less than 20 seconds. I suppose the problem comes when the node (node2 for example) that has been rebooted sees that node1 is still waiting for the fencing script to finish (due to the sleep 90) and it just becomes confused and exits pacemaker. I changed that sleep 90 for a sleep 5 and it hasn't happened again Thanks a lot to everyone for the help Cheers Cesar ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
>>> >> >> Could it be caused if node 2 becomes rebooted and alive before the stonith >> script has finished? > > That *shouldn't* cause any problems, but I'm not sure what's happening > in this case. Maybe is the cause for it... My other servers installations had a slow stonith device and also were slow booting servers, so I added a "sleep 90" on the fencing script. Now, I have a fast stonith device and fast booting servers, so node2 is becoming alive before the fencing script has finished, because of that sleep. I'm trying just now without the sleep to see if it happens again or not Thanks ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/06/2017 04:48 PM, Ken Gaillot wrote: > On 07/06/2017 09:26 AM, Klaus Wenninger wrote: >> On 07/06/2017 04:20 PM, Cesar Hernandez wrote: If node2 is getting the notification of its own fencing, it wasn't successfully fenced. Successful fencing would render it incapacitated (powered down, or at least cut off from the network and any shared resources). >>> Maybe I don't understand you, or maybe you don't understand me... ;) >>> This is the syslog of the machine, where you can see that the machine has >>> rebooted successfully, and as I said, it has been rebooted successfully all >>> the times: >> It is not just a question if it was rebooted at all. >> Your fence-agent mustn't return positively until this definitely >> has happened and the node is down. >> Otherwise you will see that message and the node will try to >> somehow cope with the fact that obviously the rest of the >> cluster thinks that it is down already. > But the "allegedly fenced" message comes in after the node has rebooted, > so it would seem that everything was in the proper sequence. True for this message but we don't see if they didn't exchange anything in between the fencing-ok and the actual reboot. > > It looks like a bug when the fenced node rejoins quickly enough that it > is a member again before its fencing confirmation has been sent. I know > there have been plenty of clusters with nodes that quickly reboot and > slow fencing devices, so that seems unlikely, but I don't see another > explanation. Anyway - maybe putting a delay at the end of the fence-agent might give some insight. In this case we have a combination of quick reboot & quick fencing device ... possibly leading to nasty races ... > >>> Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys >>> cpuset >>> Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpu >>> Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys >>> cpuacct >>> Jul 5 10:41:54 node2 kernel: [0.00] Linux version 3.16.0-4-amd64 >>> (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 >>> SMP Debian 3.16.39-1 (2016-12-30) >>> Jul 5 10:41:54 node2 kernel: [0.00] Command line: >>> BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 >>> root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd >>> console=ttyS0 console=hvc0 >>> Jul 5 10:41:54 node2 kernel: [0.00] e820: BIOS-provided physical >>> RAM map: >>> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >>> 0x-0x0009dfff] usable >>> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >>> 0x0009e000-0x0009] reserved >>> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >>> 0x000e-0x000f] reserved >>> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >>> 0x0010-0x3fff] usable >>> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >>> 0xfc00-0x] reserved >>> Jul 5 10:41:54 node2 kernel: [0.00] NX (Execute Disable) >>> protection: active >>> Jul 5 10:41:54 node2 kernel: [0.00] SMBIOS 2.4 present. >>> >>> ... >>> >>> Jul 5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port >>> 67 >>> >>> ... >>> >>> Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync Cluster Engine >>> ('UNKNOWN'): started and ready to provide service. >>> Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync built-in features: >>> nss >>> Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Successfully read main >>> configuration file '/etc/corosync/corosync.conf'. >>> >>> ... >>> >>> Jul 5 10:41:57 node2 crmd[608]: notice: Defaulting to uname -n for the >>> local classic openais (with plugin) node name >>> Jul 5 10:41:57 node2 crmd[608]: notice: Membership 4308: quorum acquired >>> Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node >>> node2[1108352940] - state is now member (was (null)) >>> Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node >>> node11[794540] - state is now member (was (null)) >>> Jul 5 10:41:57 node2 crmd[608]: notice: The local CRM is operational >>> Jul 5 10:41:57 node2 crmd[608]: notice: State transition S_STARTING -> >>> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] >>> Jul 5 10:41:57 node2 stonith-ng[604]: notice: Watching for stonith >>> topology changes >>> Jul 5 10:41:57 node2 stonith-ng[604]: notice: Membership 4308: quorum >>> acquired >>> Jul 5 10:41:57 node2 stonith-ng[604]: notice: plugin_handle_membership: >>> Node node11[794540] - state is now member (was (null)) >>> Jul 5 10:41:57 node2 stonith-ng[604]: notice: On loss of CCM Quorum: >>> Ignore >>> Jul 5 10:41:58 node2 stonith-ng[604]: notice: Added 'st-fence_propio:0' >>> to the device list (1 active devices) >>> Jul 5 10:41:59 node2 stonith
Re: [ClusterLabs] Problem with stonith and starting services
On 07/06/2017 09:26 AM, Klaus Wenninger wrote: > On 07/06/2017 04:20 PM, Cesar Hernandez wrote: >>> If node2 is getting the notification of its own fencing, it wasn't >>> successfully fenced. Successful fencing would render it incapacitated >>> (powered down, or at least cut off from the network and any shared >>> resources). >> >> Maybe I don't understand you, or maybe you don't understand me... ;) >> This is the syslog of the machine, where you can see that the machine has >> rebooted successfully, and as I said, it has been rebooted successfully all >> the times: > > It is not just a question if it was rebooted at all. > Your fence-agent mustn't return positively until this definitely > has happened and the node is down. > Otherwise you will see that message and the node will try to > somehow cope with the fact that obviously the rest of the > cluster thinks that it is down already. But the "allegedly fenced" message comes in after the node has rebooted, so it would seem that everything was in the proper sequence. It looks like a bug when the fenced node rejoins quickly enough that it is a member again before its fencing confirmation has been sent. I know there have been plenty of clusters with nodes that quickly reboot and slow fencing devices, so that seems unlikely, but I don't see another explanation. >> Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys >> cpuset >> Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpu >> Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys >> cpuacct >> Jul 5 10:41:54 node2 kernel: [0.00] Linux version 3.16.0-4-amd64 >> (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 >> SMP Debian 3.16.39-1 (2016-12-30) >> Jul 5 10:41:54 node2 kernel: [0.00] Command line: >> BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 >> root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd >> console=ttyS0 console=hvc0 >> Jul 5 10:41:54 node2 kernel: [0.00] e820: BIOS-provided physical >> RAM map: >> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >> 0x-0x0009dfff] usable >> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >> 0x0009e000-0x0009] reserved >> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >> 0x000e-0x000f] reserved >> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >> 0x0010-0x3fff] usable >> Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem >> 0xfc00-0x] reserved >> Jul 5 10:41:54 node2 kernel: [0.00] NX (Execute Disable) >> protection: active >> Jul 5 10:41:54 node2 kernel: [0.00] SMBIOS 2.4 present. >> >> ... >> >> Jul 5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port >> 67 >> >> ... >> >> Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync Cluster Engine >> ('UNKNOWN'): started and ready to provide service. >> Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync built-in features: >> nss >> Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Successfully read main >> configuration file '/etc/corosync/corosync.conf'. >> >> ... >> >> Jul 5 10:41:57 node2 crmd[608]: notice: Defaulting to uname -n for the >> local classic openais (with plugin) node name >> Jul 5 10:41:57 node2 crmd[608]: notice: Membership 4308: quorum acquired >> Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node >> node2[1108352940] - state is now member (was (null)) >> Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node >> node11[794540] - state is now member (was (null)) >> Jul 5 10:41:57 node2 crmd[608]: notice: The local CRM is operational >> Jul 5 10:41:57 node2 crmd[608]: notice: State transition S_STARTING -> >> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] >> Jul 5 10:41:57 node2 stonith-ng[604]: notice: Watching for stonith >> topology changes >> Jul 5 10:41:57 node2 stonith-ng[604]: notice: Membership 4308: quorum >> acquired >> Jul 5 10:41:57 node2 stonith-ng[604]: notice: plugin_handle_membership: >> Node node11[794540] - state is now member (was (null)) >> Jul 5 10:41:57 node2 stonith-ng[604]: notice: On loss of CCM Quorum: >> Ignore >> Jul 5 10:41:58 node2 stonith-ng[604]: notice: Added 'st-fence_propio:0' >> to the device list (1 active devices) >> Jul 5 10:41:59 node2 stonith-ng[604]: notice: Operation reboot of node2 >> by node11 for crmd.2141@node11.61c3e613: OK >> Jul 5 10:41:59 node2 crmd[608]: crit: We were allegedly just fenced by >> node11 for node11! >> Jul 5 10:41:59 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client >> crmd (conn=0x228d970, async-conn=0x228d970) left >> Jul 5 10:41:59 node2 pacemakerd[597]: warning: The crmd process (608) can >> no longer be respawned, shutting the cluster dow
Re: [ClusterLabs] Problem with stonith and starting services
On 07/06/2017 04:20 PM, Cesar Hernandez wrote: >> If node2 is getting the notification of its own fencing, it wasn't >> successfully fenced. Successful fencing would render it incapacitated >> (powered down, or at least cut off from the network and any shared >> resources). > > Maybe I don't understand you, or maybe you don't understand me... ;) > This is the syslog of the machine, where you can see that the machine has > rebooted successfully, and as I said, it has been rebooted successfully all > the times: It is not just a question if it was rebooted at all. Your fence-agent mustn't return positively until this definitely has happened and the node is down. Otherwise you will see that message and the node will try to somehow cope with the fact that obviously the rest of the cluster thinks that it is down already. > > Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpuset > Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpu > Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys > cpuacct > Jul 5 10:41:54 node2 kernel: [0.00] Linux version 3.16.0-4-amd64 > (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP > Debian 3.16.39-1 (2016-12-30) > Jul 5 10:41:54 node2 kernel: [0.00] Command line: > BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 > root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd > console=ttyS0 console=hvc0 > Jul 5 10:41:54 node2 kernel: [0.00] e820: BIOS-provided physical RAM > map: > Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem > 0x-0x0009dfff] usable > Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem > 0x0009e000-0x0009] reserved > Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem > 0x000e-0x000f] reserved > Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem > 0x0010-0x3fff] usable > Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem > 0xfc00-0x] reserved > Jul 5 10:41:54 node2 kernel: [0.00] NX (Execute Disable) protection: > active > Jul 5 10:41:54 node2 kernel: [0.00] SMBIOS 2.4 present. > > ... > > Jul 5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67 > > ... > > Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync Cluster Engine > ('UNKNOWN'): started and ready to provide service. > Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync built-in features: > nss > Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Successfully read main > configuration file '/etc/corosync/corosync.conf'. > > ... > > Jul 5 10:41:57 node2 crmd[608]: notice: Defaulting to uname -n for the > local classic openais (with plugin) node name > Jul 5 10:41:57 node2 crmd[608]: notice: Membership 4308: quorum acquired > Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node > node2[1108352940] - state is now member (was (null)) > Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node > node11[794540] - state is now member (was (null)) > Jul 5 10:41:57 node2 crmd[608]: notice: The local CRM is operational > Jul 5 10:41:57 node2 crmd[608]: notice: State transition S_STARTING -> > S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] > Jul 5 10:41:57 node2 stonith-ng[604]: notice: Watching for stonith > topology changes > Jul 5 10:41:57 node2 stonith-ng[604]: notice: Membership 4308: quorum > acquired > Jul 5 10:41:57 node2 stonith-ng[604]: notice: plugin_handle_membership: > Node node11[794540] - state is now member (was (null)) > Jul 5 10:41:57 node2 stonith-ng[604]: notice: On loss of CCM Quorum: Ignore > Jul 5 10:41:58 node2 stonith-ng[604]: notice: Added 'st-fence_propio:0' to > the device list (1 active devices) > Jul 5 10:41:59 node2 stonith-ng[604]: notice: Operation reboot of node2 by > node11 for crmd.2141@node11.61c3e613: OK > Jul 5 10:41:59 node2 crmd[608]: crit: We were allegedly just fenced by > node11 for node11! > Jul 5 10:41:59 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client > crmd (conn=0x228d970, async-conn=0x228d970) left > Jul 5 10:41:59 node2 pacemakerd[597]: warning: The crmd process (608) can > no longer be respawned, shutting the cluster down. > Jul 5 10:41:59 node2 pacemakerd[597]: notice: Shutting down Pacemaker > Jul 5 10:41:59 node2 pacemakerd[597]: notice: Stopping pengine: Sent -15 > to process 607 > Jul 5 10:41:59 node2 pengine[607]: notice: Invoking handler for signal 15: > Terminated > Jul 5 10:41:59 node2 pacemakerd[597]: notice: Stopping attrd: Sent -15 to > process 606 > Jul 5 10:41:59 node2 attrd[606]: notice: Invoking handler for signal 15: > Terminated > Jul 5 10:41:59 node2 attrd[606]: notice: Exiting... > Jul 5 10:41:59 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Cl
Re: [ClusterLabs] Problem with stonith and starting services
> > If node2 is getting the notification of its own fencing, it wasn't > successfully fenced. Successful fencing would render it incapacitated > (powered down, or at least cut off from the network and any shared > resources). Maybe I don't understand you, or maybe you don't understand me... ;) This is the syslog of the machine, where you can see that the machine has rebooted successfully, and as I said, it has been rebooted successfully all the times: Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpuset Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpu Jul 5 10:41:54 node2 kernel: [0.00] Initializing cgroup subsys cpuacct Jul 5 10:41:54 node2 kernel: [0.00] Linux version 3.16.0-4-amd64 (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.39-1 (2016-12-30) Jul 5 10:41:54 node2 kernel: [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd console=ttyS0 console=hvc0 Jul 5 10:41:54 node2 kernel: [0.00] e820: BIOS-provided physical RAM map: Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 0x-0x0009dfff] usable Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 0x0009e000-0x0009] reserved Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 0x0010-0x3fff] usable Jul 5 10:41:54 node2 kernel: [0.00] BIOS-e820: [mem 0xfc00-0x] reserved Jul 5 10:41:54 node2 kernel: [0.00] NX (Execute Disable) protection: active Jul 5 10:41:54 node2 kernel: [0.00] SMBIOS 2.4 present. ... Jul 5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67 ... Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync Cluster Engine ('UNKNOWN'): started and ready to provide service. Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Corosync built-in features: nss Jul 5 10:41:54 node2 corosync[585]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. ... Jul 5 10:41:57 node2 crmd[608]: notice: Defaulting to uname -n for the local classic openais (with plugin) node name Jul 5 10:41:57 node2 crmd[608]: notice: Membership 4308: quorum acquired Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node node2[1108352940] - state is now member (was (null)) Jul 5 10:41:57 node2 crmd[608]: notice: plugin_handle_membership: Node node11[794540] - state is now member (was (null)) Jul 5 10:41:57 node2 crmd[608]: notice: The local CRM is operational Jul 5 10:41:57 node2 crmd[608]: notice: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] Jul 5 10:41:57 node2 stonith-ng[604]: notice: Watching for stonith topology changes Jul 5 10:41:57 node2 stonith-ng[604]: notice: Membership 4308: quorum acquired Jul 5 10:41:57 node2 stonith-ng[604]: notice: plugin_handle_membership: Node node11[794540] - state is now member (was (null)) Jul 5 10:41:57 node2 stonith-ng[604]: notice: On loss of CCM Quorum: Ignore Jul 5 10:41:58 node2 stonith-ng[604]: notice: Added 'st-fence_propio:0' to the device list (1 active devices) Jul 5 10:41:59 node2 stonith-ng[604]: notice: Operation reboot of node2 by node11 for crmd.2141@node11.61c3e613: OK Jul 5 10:41:59 node2 crmd[608]: crit: We were allegedly just fenced by node11 for node11! Jul 5 10:41:59 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client crmd (conn=0x228d970, async-conn=0x228d970) left Jul 5 10:41:59 node2 pacemakerd[597]: warning: The crmd process (608) can no longer be respawned, shutting the cluster down. Jul 5 10:41:59 node2 pacemakerd[597]: notice: Shutting down Pacemaker Jul 5 10:41:59 node2 pacemakerd[597]: notice: Stopping pengine: Sent -15 to process 607 Jul 5 10:41:59 node2 pengine[607]: notice: Invoking handler for signal 15: Terminated Jul 5 10:41:59 node2 pacemakerd[597]: notice: Stopping attrd: Sent -15 to process 606 Jul 5 10:41:59 node2 attrd[606]: notice: Invoking handler for signal 15: Terminated Jul 5 10:41:59 node2 attrd[606]: notice: Exiting... Jul 5 10:41:59 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client attrd (conn=0x2280ef0, async-conn=0x2280ef0) left ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/06/2017 08:54 AM, Cesar Hernandez wrote: > >> >> So, the above log means that node1 decided that node2 needed to be >> fenced, requested fencing of node2, and received a successful result for >> the fencing, and yet node2 was not killed. >> >> Your fence agent should not return success until node2 has verifiably >> been stopped. If there is some way to query the AWS API whether node2 is >> running or not, that would be sufficient (merely checking that the node >> is not responding to some command such as ping is not sufficient). > > Thanks. But node2 has always been successfully fenced... so this is not the > problem If node2 is getting the notification of its own fencing, it wasn't successfully fenced. Successful fencing would render it incapacitated (powered down, or at least cut off from the network and any shared resources). ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/04/2017 08:28 AM, Cesar Hernandez wrote: > >> >> Agreed, I don't think it's multicast vs unicast. >> >> I can't see from this what's going wrong. Possibly node1 is trying to >> re-fence node2 when it comes back. Check that the fencing resources are >> configured correctly, and check whether node1 sees the first fencing >> succeed. > > > Thanks. Checked fencing resource and it always returns, it's a custom script > I used on other installations and it always worked. > I think the clue are the two messages that appear when it fails: > > Jul 3 09:07:04 node2 pacemakerd[597]: warning: The crmd process (608) can > no longer be respawned, shutting the cluster down. > Jul 3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by > node1 for node1! > > Anyone knows what are they related to? Seems not to be much information on > the Internet > > Thanks > Cesar "We were allegedly just fenced" means that the node just received a notification from stonithd that another node successfully fenced it. Clearly, this is a problem, because a node that is truly fenced should be unable to receive any communications from the cluster. As such, the cluster services immediately exit and stay down. So, the above log means that node1 decided that node2 needed to be fenced, requested fencing of node2, and received a successful result for the fencing, and yet node2 was not killed. Your fence agent should not return success until node2 has verifiably been stopped. If there is some way to query the AWS API whether node2 is running or not, that would be sufficient (merely checking that the node is not responding to some command such as ping is not sufficient). ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
> > >>> >>> But you definitely shouldn't have a fencing-agent that claims to have fenced >>> a node if it is not sure - rather the other way round if in doubt. >> >> > > True! Which is why I mentioned it to be dangerous. > But your fencing-agent is even more dangerous ;-) > > Well.. my startup fencing does, first, a "ssh $NODE reboot -f" and then another reboot using AWS API. All the times the node is reachable, the ssh reboot succeded. And the rest of the times, using AWS reboot works 100% of the times. So I think it's not dangerous to always return OK on fencing script when I'm 100% (or 99.%) sure that fencing works ;) Anyway, it's not the cause of this behavior. I'll keep trying... I'm also thinking on post this issue on the developers cluster list Thanks ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/05/2017 04:50 PM, Cesar Hernandez wrote: >> Not a good idea probably - and the reason for what you are experiencing ;-) >> If you have problems starting the nodes within a certain time-window >> disabling startup-fencing might be an option to consider although dangerous. >> But you definitely shouldn't have a fencing-agent that claims to have fenced >> a node if it is not sure - rather the other way round if in doubt. > > Thanks. But I think is not a good idea to disable startup fencing: I have > shared disks (drbd) and stonith is very important in this scenario > True! Which is why I mentioned it to be dangerous. But your fencing-agent is even more dangerous ;-) ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
> Not a good idea probably - and the reason for what you are experiencing ;-) > If you have problems starting the nodes within a certain time-window > disabling startup-fencing might be an option to consider although dangerous. > But you definitely shouldn't have a fencing-agent that claims to have fenced > a node if it is not sure - rather the other way round if in doubt. Thanks. But I think is not a good idea to disable startup fencing: I have shared disks (drbd) and stonith is very important in this scenario ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/05/2017 04:22 PM, Cesar Hernandez wrote: > >> Are you logging which ones went OK and which failed. >> The script returns negatively if both go wrong? > The script always returns OK Not a good idea probably - and the reason for what you are experiencing ;-) If you have problems starting the nodes within a certain time-window disabling startup-fencing might be an option to consider although dangerous. But you definitely shouldn't have a fencing-agent that claims to have fenced a node if it is not sure - rather the other way round if in doubt. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
> Are you logging which ones went OK and which failed. > The script returns negatively if both go wrong? The script always returns OK ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/05/2017 08:50 AM, Cesar Hernandez wrote: >> Might be kind of a strange race as well ... but without knowing what the >> script actually does ... >> > The script first try to reboot the node using ssh, something like ssh $NODE > reboot -f, then runs a remote reboot using AWS api Are you logging which ones went OK and which failed. The script returns negatively if both go wrong? > Thanks > > > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
> Might be kind of a strange race as well ... but without knowing what the > script actually does ... > The script first try to reboot the node using ssh, something like ssh $NODE reboot -f, then runs a remote reboot using AWS api Thanks ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/04/2017 04:52 PM, Cesar Hernandez wrote: >> The first line is the consequence of the 2nd. >> And the 1st says that node2 just has seen some fencing-resource >> positively reporting to have fenced himself - which >> is why crmd is exiting in a way that it is not respawned >> by pacemakerd. > Thanks. But my script have a logfile, I've checked it, and none of the nodes > have fenced himself: node1 has always fenced node2, and node2 has always > fenced node 1 Didn't say it has fenced itself. Actually it writes node1 has fenced node2 on behalf of node1. But obviously that is not true as otherwise node2 wouldn't be alive to receive that info. Might be kind of a strange race as well ... but without knowing what the script actually does ... Regards, Klaus > Cheers > Cesar > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
> The first line is the consequence of the 2nd. > And the 1st says that node2 just has seen some fencing-resource > positively reporting to have fenced himself - which > is why crmd is exiting in a way that it is not respawned > by pacemakerd. Thanks. But my script have a logfile, I've checked it, and none of the nodes have fenced himself: node1 has always fenced node2, and node2 has always fenced node 1 Cheers Cesar ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/04/2017 03:28 PM, Cesar Hernandez wrote: >> Agreed, I don't think it's multicast vs unicast. >> >> I can't see from this what's going wrong. Possibly node1 is trying to >> re-fence node2 when it comes back. Check that the fencing resources are >> configured correctly, and check whether node1 sees the first fencing >> succeed. > > Thanks. Checked fencing resource and it always returns, it's a custom script > I used on other installations and it always worked. > I think the clue are the two messages that appear when it fails: > > Jul 3 09:07:04 node2 pacemakerd[597]: warning: The crmd process (608) can > no longer be respawned, shutting the cluster down. > Jul 3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by > node1 for node1! > > Anyone knows what are they related to? Seems not to be much information on > the Internet The first line is the consequence of the 2nd. And the 1st says that node2 just has seen some fencing-resource positively reporting to have fenced himself - which is why crmd is exiting in a way that it is not respawned by pacemakerd. For the reason I can just guess ... Maybe your script is not checking whom it should actually fence and assumes something!? You can configure for which targets a certain fencing-resource using pcmk_host_map, pcmk_host_check & pcmk_host_list. Like this it would be possible that the fencing-resource was configured differently when your script worked in a different setup. Regards, Klaus > > Thanks > Cesar > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
> > Agreed, I don't think it's multicast vs unicast. > > I can't see from this what's going wrong. Possibly node1 is trying to > re-fence node2 when it comes back. Check that the fencing resources are > configured correctly, and check whether node1 sees the first fencing > succeed. Thanks. Checked fencing resource and it always returns, it's a custom script I used on other installations and it always worked. I think the clue are the two messages that appear when it fails: Jul 3 09:07:04 node2 pacemakerd[597]: warning: The crmd process (608) can no longer be respawned, shutting the cluster down. Jul 3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by node1 for node1! Anyone knows what are they related to? Seems not to be much information on the Internet Thanks Cesar ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/03/2017 02:34 AM, Cesar Hernandez wrote: > Hi > > I have installed a pacemaker cluster with two nodes. The same type of > installation has done before many times and the following error never > appeared before. The situation is the following: > > both nodes running cluster services > stop pacemaker&corosync on node 1 > stop pacemaker&corosync on node 2 > start corosync&pacemaker on node 1 > > Then node 1 starts, it sees node2 down, and it fences it, as it was expected. > But the problem comes when node 2 is rebooted and starts cluster services: > sometimes, it starts the corosync service but the pacemaker service starts > and then stops. The syslog shows the following error in these cases: > > Jul 3 09:07:04 node2 pacemakerd[597]: warning: The crmd process (608) can > no longer be respawned, shutting the cluster down. > Jul 3 09:07:04 node2 pacemakerd[597]: notice: Shutting down Pacemaker > > Previous messages show some warning messages that I'm not sure they are > related with the shutdown: > > > Jul 3 09:07:04 node2 stonith-ng[604]: notice: Operation reboot of node2 by > node1 for crmd.2413@node1.608d8118: OK > Jul 3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by > node1 for node1! > Jul 3 09:07:04 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client > crmd (conn=0x1471800, async-conn=0x1471800) left > > > On node1, all resources become unrunnable and it stays there forever until I > start manually pacemaker service on node2. > As I said, same type of installation has done before on other servers and > never happened this. The only difference is that in previous installations I > configured corosync with multicast and now I have configured with unicast (my > current network environment doesn't allow multicast) but I think it's not > related with that behaviour Agreed, I don't think it's multicast vs unicast. I can't see from this what's going wrong. Possibly node1 is trying to re-fence node2 when it comes back. Check that the fencing resources are configured correctly, and check whether node1 sees the first fencing succeed. > Cluster software versions: > corosync-1.4.8 > crmsh-2.1.5 > libqb-0.17.2 > Pacemaker-1.1.14 > resource-agents-3.9.6 > > > > Can you help me? > > Thanks > > Cesar ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org