Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15
On Tue, 2019-09-03 at 10:09 +0200, Marco Marino wrote: > Hi, I have a problem with fencing on a two node cluster. It seems > that randomly the cluster cannot complete monitor operation for fence > devices. In log I see: > crmd[8206]: error: Result of monitor operation for fence-node2 on > ld2.mydomain.it: Timed Out > As attachment there is > - /var/log/messages for node1 (only the important part) > - /var/log/messages for node2 (only the important part) <-- Problem > starts here > - pcs status > - pcs stonith show (for both fence devices) > > I think it could be a timeout problem, so how can I see timeout value > for monitor operation in stonith devices? > Please, someone can help me with this problem? > Furthermore, how can I fix the state of fence devices without > downtime? > > Thank you How to investigate depends on whether this is an occasional monitor failure, or happens every time the device start is attempted. From the status you attached, I'm guessing it's at start. In that case, my next step (since you've already verified ipmitool works directly) would be to run the fence agent manually using the same arguments used in the cluster configuration. Check the man page for the fence agent, looking at the section for "Stdin Parameters". These are what's used in the cluster configuration, so make a note of what values you've configured. Then run the fence agent like this: echo -e "action=status\nPARAMETER=VALUE\nPARAMETER=VALUE\n..." | /path/to/agent where PARAMETER=VALUE entries are what you have configured in the cluster. If the problem isn't obvious from that, you can try adding a debug_file parameter. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15
Hi, some updates about this? Thank you Il Mer 4 Set 2019, 10:46 Marco Marino ha scritto: > First of all, thank you for your support. > Andrey: sure, I can reach machines through IPMI. > Here is a short "log": > > #From ld1 trying to contact ld1 > [root@ld1 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XX > sdr elist all > SEL | 72h | ns | 7.1 | No Reading > Intrusion| 73h | ok | 7.1 | > iDRAC8 | 00h | ok | 7.1 | Dynamic MC @ 20h > ... > > #From ld1 trying to contact ld2 > ipmitool -I lanplus -H 192.168.254.251 -U root -P XX sdr elist all > SEL | 72h | ns | 7.1 | No Reading > Intrusion| 73h | ok | 7.1 | > iDRAC7 | 00h | ok | 7.1 | Dynamic MC @ 20h > ... > > > #From ld2 trying to contact ld1: > root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P X sdr > elist all > SEL | 72h | ns | 7.1 | No Reading > Intrusion| 73h | ok | 7.1 | > iDRAC8 | 00h | ok | 7.1 | Dynamic MC @ 20h > System Board | 00h | ns | 7.1 | Logical FRU @00h > . > > #From ld2 trying to contact ld2 > [root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.251 -U root -P sdr > elist all > SEL | 72h | ns | 7.1 | No Reading > Intrusion| 73h | ok | 7.1 | > iDRAC7 | 00h | ok | 7.1 | Dynamic MC @ 20h > System Board | 00h | ns | 7.1 | Logical FRU @00h > > > Jan: Actually the cluster uses /etc/hosts in order to resolve names: > 172.16.77.10ld1.mydomain.it ld1 > 172.16.77.11ld2.mydomain.it ld2 > > Furthermore I'm using ip addresses for ipmi interfaces in the > configuration: > [root@ld1 ~]# pcs stonith show fence-node1 > Resource: fence-node1 (class=stonith type=fence_ipmilan) > Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=X > pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it > Operations: monitor interval=60s (fence-node1-monitor-interval-60s) > > > Any idea? > How can I reset the state of the cluster without downtime? "pcs resource > cleanup" is enough? > Thank you, > Marco > > > Il giorno mer 4 set 2019 alle ore 10:29 Jan Pokorný > ha scritto: > >> On 03/09/19 20:15 +0300, Andrei Borzenkov wrote: >> > 03.09.2019 11:09, Marco Marino пишет: >> >> Hi, I have a problem with fencing on a two node cluster. It seems that >> >> randomly the cluster cannot complete monitor operation for fence >> devices. >> >> In log I see: >> >> crmd[8206]: error: Result of monitor operation for fence-node2 on >> >> ld2.mydomain.it: Timed Out >> > >> > Can you actually access IP addresses of your IPMI ports? >> >> [ >> Tangentially, interesting aspect beyond that and applicable for any >> non-IP cross-host referential needs, which I haven't seen mentioned >> anywhere so far, is the risk of DNS resolution (when /etc/hosts will >> come short) getting to troubles (stale records, port blocked, DNS >> server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW >> cannot handle gracefully, etc.). In any case, just a single DNS >> server would apparently be an undesired SPOF, and would be unfortunate >> when unable to fence a node because of that. >> >> I think the most robust approach is to use IP addresses whenever >> possible, and unambiguous records in /etc/hosts when practical. >> ] >> >> >> As attachment there is >> >> - /var/log/messages for node1 (only the important part) >> >> - /var/log/messages for node2 (only the important part) <-- Problem >> starts >> >> here >> >> - pcs status >> >> - pcs stonith show (for both fence devices) >> >> >> >> I think it could be a timeout problem, so how can I see timeout value >> for >> >> monitor operation in stonith devices? >> >> Please, someone can help me with this problem? >> >> Furthermore, how can I fix the state of fence devices without downtime? >> >> -- >> Jan (Poki) >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15
First of all, thank you for your support. Andrey: sure, I can reach machines through IPMI. Here is a short "log": #From ld1 trying to contact ld1 [root@ld1 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XX sdr elist all SEL | 72h | ns | 7.1 | No Reading Intrusion| 73h | ok | 7.1 | iDRAC8 | 00h | ok | 7.1 | Dynamic MC @ 20h ... #From ld1 trying to contact ld2 ipmitool -I lanplus -H 192.168.254.251 -U root -P XX sdr elist all SEL | 72h | ns | 7.1 | No Reading Intrusion| 73h | ok | 7.1 | iDRAC7 | 00h | ok | 7.1 | Dynamic MC @ 20h ... #From ld2 trying to contact ld1: root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P X sdr elist all SEL | 72h | ns | 7.1 | No Reading Intrusion| 73h | ok | 7.1 | iDRAC8 | 00h | ok | 7.1 | Dynamic MC @ 20h System Board | 00h | ns | 7.1 | Logical FRU @00h . #From ld2 trying to contact ld2 [root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.251 -U root -P sdr elist all SEL | 72h | ns | 7.1 | No Reading Intrusion| 73h | ok | 7.1 | iDRAC7 | 00h | ok | 7.1 | Dynamic MC @ 20h System Board | 00h | ns | 7.1 | Logical FRU @00h Jan: Actually the cluster uses /etc/hosts in order to resolve names: 172.16.77.10ld1.mydomain.it ld1 172.16.77.11ld2.mydomain.it ld2 Furthermore I'm using ip addresses for ipmi interfaces in the configuration: [root@ld1 ~]# pcs stonith show fence-node1 Resource: fence-node1 (class=stonith type=fence_ipmilan) Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=X pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it Operations: monitor interval=60s (fence-node1-monitor-interval-60s) Any idea? How can I reset the state of the cluster without downtime? "pcs resource cleanup" is enough? Thank you, Marco Il giorno mer 4 set 2019 alle ore 10:29 Jan Pokorný ha scritto: > On 03/09/19 20:15 +0300, Andrei Borzenkov wrote: > > 03.09.2019 11:09, Marco Marino пишет: > >> Hi, I have a problem with fencing on a two node cluster. It seems that > >> randomly the cluster cannot complete monitor operation for fence > devices. > >> In log I see: > >> crmd[8206]: error: Result of monitor operation for fence-node2 on > >> ld2.mydomain.it: Timed Out > > > > Can you actually access IP addresses of your IPMI ports? > > [ > Tangentially, interesting aspect beyond that and applicable for any > non-IP cross-host referential needs, which I haven't seen mentioned > anywhere so far, is the risk of DNS resolution (when /etc/hosts will > come short) getting to troubles (stale records, port blocked, DNS > server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW > cannot handle gracefully, etc.). In any case, just a single DNS > server would apparently be an undesired SPOF, and would be unfortunate > when unable to fence a node because of that. > > I think the most robust approach is to use IP addresses whenever > possible, and unambiguous records in /etc/hosts when practical. > ] > > >> As attachment there is > >> - /var/log/messages for node1 (only the important part) > >> - /var/log/messages for node2 (only the important part) <-- Problem > starts > >> here > >> - pcs status > >> - pcs stonith show (for both fence devices) > >> > >> I think it could be a timeout problem, so how can I see timeout value > for > >> monitor operation in stonith devices? > >> Please, someone can help me with this problem? > >> Furthermore, how can I fix the state of fence devices without downtime? > > -- > Jan (Poki) > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15
On 03/09/19 20:15 +0300, Andrei Borzenkov wrote: > 03.09.2019 11:09, Marco Marino пишет: >> Hi, I have a problem with fencing on a two node cluster. It seems that >> randomly the cluster cannot complete monitor operation for fence devices. >> In log I see: >> crmd[8206]: error: Result of monitor operation for fence-node2 on >> ld2.mydomain.it: Timed Out > > Can you actually access IP addresses of your IPMI ports? [ Tangentially, interesting aspect beyond that and applicable for any non-IP cross-host referential needs, which I haven't seen mentioned anywhere so far, is the risk of DNS resolution (when /etc/hosts will come short) getting to troubles (stale records, port blocked, DNS server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW cannot handle gracefully, etc.). In any case, just a single DNS server would apparently be an undesired SPOF, and would be unfortunate when unable to fence a node because of that. I think the most robust approach is to use IP addresses whenever possible, and unambiguous records in /etc/hosts when practical. ] >> As attachment there is >> - /var/log/messages for node1 (only the important part) >> - /var/log/messages for node2 (only the important part) <-- Problem starts >> here >> - pcs status >> - pcs stonith show (for both fence devices) >> >> I think it could be a timeout problem, so how can I see timeout value for >> monitor operation in stonith devices? >> Please, someone can help me with this problem? >> Furthermore, how can I fix the state of fence devices without downtime? -- Jan (Poki) pgpL97hDs1Edl.pgp Description: PGP signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15
03.09.2019 11:09, Marco Marino пишет: > Hi, I have a problem with fencing on a two node cluster. It seems that > randomly the cluster cannot complete monitor operation for fence devices. > In log I see: > crmd[8206]: error: Result of monitor operation for fence-node2 on > ld2.mydomain.it: Timed Out Can you actually access IP addresses of your IPMI ports? > As attachment there is > - /var/log/messages for node1 (only the important part) > - /var/log/messages for node2 (only the important part) <-- Problem starts > here > - pcs status > - pcs stonith show (for both fence devices) > > I think it could be a timeout problem, so how can I see timeout value for > monitor operation in stonith devices? > Please, someone can help me with this problem? > Furthermore, how can I fix the state of fence devices without downtime? > > Thank you > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15
Hi, I have a problem with fencing on a two node cluster. It seems that randomly the cluster cannot complete monitor operation for fence devices. In log I see: crmd[8206]: error: Result of monitor operation for fence-node2 on ld2.mydomain.it: Timed Out As attachment there is - /var/log/messages for node1 (only the important part) - /var/log/messages for node2 (only the important part) <-- Problem starts here - pcs status - pcs stonith show (for both fence devices) I think it could be a timeout problem, so how can I see timeout value for monitor operation in stonith devices? Please, someone can help me with this problem? Furthermore, how can I fix the state of fence devices without downtime? Thank you PCS STATUS root@ld1 ~]# pcs status Cluster name: ldcluster Stack: corosync Current DC: ld1.mydomain.it (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum Last updated: Tue Sep 3 09:37:27 2019 Last change: Thu Jul 4 21:36:07 2019 by root via cibadmin on ld1.mydomain.it 2 nodes configured 10 resources configured Online: [ ld1.mydomain.it ld2.mydomain.it ] Full list of resources: fence-node1(stonith:fence_ipmilan):Stopped fence-node2(stonith:fence_ipmilan):Stopped Master/Slave Set: DrbdResClone [DrbdRes] Masters: [ ld1.mydomain.it ] Slaves: [ ld2.mydomain.it ] HALVM (ocf::heartbeat:LVM): Started ld1.mydomain.it PgsqlFs(ocf::heartbeat:Filesystem):Started ld1.mydomain.it PostgresqlD(systemd:postgresql-9.6.service): Started ld1.mydomain.it LegaldocapiD (systemd:legaldocapi.service): Started ld1.mydomain.it PublicVIP (ocf::heartbeat:IPaddr2): Started ld1.mydomain.it DefaultRoute (ocf::heartbeat:Route): Started ld1.mydomain.it Failed Actions: * fence-node1_start_0 on ld1.mydomain.it 'unknown error' (1): call=221, status=Timed Out, exitreason='', last-rc-change='Wed Aug 21 12:49:00 2019', queued=0ms, exec=20006ms * fence-node2_start_0 on ld1.mydomain.it 'unknown error' (1): call=222, status=Timed Out, exitreason='', last-rc-change='Wed Aug 21 12:49:00 2019', queued=1ms, exec=20013ms * fence-node1_start_0 on ld2.mydomain.it 'unknown error' (1): call=182, status=Timed Out, exitreason='', last-rc-change='Wed Aug 21 14:26:09 2019', queued=0ms, exec=20006ms * fence-node2_start_0 on ld2.mydomain.it 'unknown error' (1): call=176, status=Timed Out, exitreason='', last-rc-change='Wed Aug 21 12:48:40 2019', queued=1ms, exec=20008ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@ld1 ~]# STONITH SHOW### [root@ld1 ~]# pcs stonith show fence-node1 Resource: fence-node1 (class=stonith type=fence_ipmilan) Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=XXX pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it Operations: monitor interval=60s (fence-node1-monitor-interval-60s) [root@ld1 ~]# pcs stonith show fence-node2 Resource: fence-node2 (class=stonith type=fence_ipmilan) Attributes: ipaddr=192.168.254.251 lanplus=1 login=root passwd= pcmk_host_check=static-list pcmk_host_list=ld2.mydomain.it delay=12 Operations: monitor interval=60s (fence-node2-monitor-interval-60s) [root@ld1 ~]# ###NODE 2 /var/log/messages## Aug 21 12:48:40 ld2 stonith-ng[8202]: notice: Child process 46006 performing action 'monitor' timed out with signal 15 Aug 21 12:48:40 ld2 stonith-ng[8202]: notice: Operation 'monitor' [46006] for device 'fence-node2' returned: -62 (Timer expired) Aug 21 12:48:40 ld2 crmd[8206]: error: Result of monitor operation for fence-node2 on ld2.mydomain.it: Timed Out Aug 21 12:48:40 ld2 stonith-ng[8202]: notice: On loss of CCM Quorum: Ignore Aug 21 12:48:40 ld2 crmd[8206]: notice: Result of stop operation for fence-node2 on ld2.mydomain.it: 0 (ok) Aug 21 12:48:40 ld2 stonith-ng[8202]: notice: On loss of CCM Quorum: Ignore Aug 21 12:48:40 ld2 stonith-ng[8202]: notice: On loss of CCM Quorum: Ignore Aug 21 12:48:40 ld2 stonith-ng[8202]: notice: On loss of CCM Quorum: Ignore Aug 21 12:48:59 ld2 stonith-ng[8202]: notice: On loss of CCM Quorum: Ignore Aug 21 12:49:00 ld2 stonith-ng[8202]: notice: Child process 46053 performing action 'monitor' timed out with signal 15 Aug 21 12:49:00 ld2 stonith-ng[8202]: notice: Operation 'monitor' [46053] for device 'fence-node2' returned: -62 (Timer expired) Aug 21 12:49:00 ld2 crmd[8206]: error: Result of start operation for fence-node2 on ld2.mydomain.it: Timed Out Aug 21 12:49:00 ld2 stonith-ng[8202]: notice: On loss of CCM Quorum: Ignore Aug 21 12:49:00 ld2 stonith-ng[8202]: notice: On loss of CCM Quorum: Ignore Aug 21 12:49:00 ld2 stonith-ng[8202]: notice: On loss of CCM Quorum: Ignore Aug 21 12:49:00 ld2 stonith-ng[8202]: notice: On loss of CCM Quorum: Ignore Aug 21 12:49:00 ld2 crmd[8206]: notic