Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-16 Thread Ken Gaillot
On Tue, 2019-09-03 at 10:09 +0200, Marco Marino wrote:
> Hi, I have a problem with fencing on a two node cluster. It seems
> that randomly the cluster cannot complete monitor operation for fence
> devices. In log I see:
> crmd[8206]:   error: Result of monitor operation for fence-node2 on
> ld2.mydomain.it: Timed Out
> As attachment there is 
> - /var/log/messages for node1 (only the important part)
> - /var/log/messages for node2 (only the important part) <-- Problem
> starts here
> - pcs status
> - pcs stonith show (for both fence devices)
> 
> I think it could be a timeout problem, so how can I see timeout value
> for monitor operation in stonith devices?
> Please, someone can help me with this problem?
> Furthermore, how can I fix the state of fence devices without
> downtime?
> 
> Thank you

How to investigate depends on whether this is an occasional monitor
failure, or happens every time the device start is attempted. From the
status you attached, I'm guessing it's at start.

In that case, my next step (since you've already verified ipmitool
works directly) would be to run the fence agent manually using the same
arguments used in the cluster configuration.

Check the man page for the fence agent, looking at the section for
"Stdin Parameters". These are what's used in the cluster configuration,
so make a note of what values you've configured. Then run the fence
agent like this:

echo -e "action=status\nPARAMETER=VALUE\nPARAMETER=VALUE\n..." | /path/to/agent

where PARAMETER=VALUE entries are what you have configured in the
cluster. If the problem isn't obvious from that, you can try adding a
debug_file parameter.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-11 Thread Marco Marino
Hi, some updates about this?
Thank you

Il Mer 4 Set 2019, 10:46 Marco Marino  ha scritto:

> First of all, thank you for your support.
> Andrey: sure, I can reach machines through IPMI.
> Here is a short "log":
>
> #From ld1 trying to contact ld1
> [root@ld1 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XX
> sdr elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> ...
>
> #From ld1 trying to contact ld2
> ipmitool -I lanplus -H 192.168.254.251 -U root -P XX sdr elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> ...
>
>
> #From ld2 trying to contact ld1:
> root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P X sdr
> elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> System Board | 00h | ns  |  7.1 | Logical FRU @00h
> .
>
> #From ld2 trying to contact ld2
> [root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.251 -U root -P  sdr
> elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> System Board | 00h | ns  |  7.1 | Logical FRU @00h
> 
>
> Jan: Actually the cluster uses /etc/hosts in order to resolve names:
> 172.16.77.10ld1.mydomain.it  ld1
> 172.16.77.11ld2.mydomain.it  ld2
>
> Furthermore I'm using ip addresses for ipmi interfaces in the
> configuration:
> [root@ld1 ~]# pcs stonith show fence-node1
>  Resource: fence-node1 (class=stonith type=fence_ipmilan)
>   Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=X
> pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it
>   Operations: monitor interval=60s (fence-node1-monitor-interval-60s)
>
>
> Any idea?
> How can I reset the state of the cluster without downtime? "pcs resource
> cleanup" is enough?
> Thank you,
> Marco
>
>
> Il giorno mer 4 set 2019 alle ore 10:29 Jan Pokorný 
> ha scritto:
>
>> On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:
>> > 03.09.2019 11:09, Marco Marino пишет:
>> >> Hi, I have a problem with fencing on a two node cluster. It seems that
>> >> randomly the cluster cannot complete monitor operation for fence
>> devices.
>> >> In log I see:
>> >> crmd[8206]:   error: Result of monitor operation for fence-node2 on
>> >> ld2.mydomain.it: Timed Out
>> >
>> > Can you actually access IP addresses of your IPMI ports?
>>
>> [
>> Tangentially, interesting aspect beyond that and applicable for any
>> non-IP cross-host referential needs, which I haven't seen mentioned
>> anywhere so far, is the risk of DNS resolution (when /etc/hosts will
>> come short) getting to troubles (stale records, port blocked, DNS
>> server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW
>> cannot handle gracefully, etc.).  In any case, just a single DNS
>> server would apparently be an undesired SPOF, and would be unfortunate
>> when unable to fence a node because of that.
>>
>> I think the most robust approach is to use IP addresses whenever
>> possible, and unambiguous records in /etc/hosts when practical.
>> ]
>>
>> >> As attachment there is
>> >> - /var/log/messages for node1 (only the important part)
>> >> - /var/log/messages for node2 (only the important part) <-- Problem
>> starts
>> >> here
>> >> - pcs status
>> >> - pcs stonith show (for both fence devices)
>> >>
>> >> I think it could be a timeout problem, so how can I see timeout value
>> for
>> >> monitor operation in stonith devices?
>> >> Please, someone can help me with this problem?
>> >> Furthermore, how can I fix the state of fence devices without downtime?
>>
>> --
>> Jan (Poki)
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-04 Thread Marco Marino
First of all, thank you for your support.
Andrey: sure, I can reach machines through IPMI.
Here is a short "log":

#From ld1 trying to contact ld1
[root@ld1 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XX sdr
elist all
SEL  | 72h | ns  |  7.1 | No Reading
Intrusion| 73h | ok  |  7.1 |
iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
...

#From ld1 trying to contact ld2
ipmitool -I lanplus -H 192.168.254.251 -U root -P XX sdr elist all
SEL  | 72h | ns  |  7.1 | No Reading
Intrusion| 73h | ok  |  7.1 |
iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
...


#From ld2 trying to contact ld1:
root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P X sdr
elist all
SEL  | 72h | ns  |  7.1 | No Reading
Intrusion| 73h | ok  |  7.1 |
iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
System Board | 00h | ns  |  7.1 | Logical FRU @00h
.

#From ld2 trying to contact ld2
[root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.251 -U root -P  sdr
elist all
SEL  | 72h | ns  |  7.1 | No Reading
Intrusion| 73h | ok  |  7.1 |
iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
System Board | 00h | ns  |  7.1 | Logical FRU @00h


Jan: Actually the cluster uses /etc/hosts in order to resolve names:
172.16.77.10ld1.mydomain.it  ld1
172.16.77.11ld2.mydomain.it  ld2

Furthermore I'm using ip addresses for ipmi interfaces in the configuration:
[root@ld1 ~]# pcs stonith show fence-node1
 Resource: fence-node1 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=X
pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it
  Operations: monitor interval=60s (fence-node1-monitor-interval-60s)


Any idea?
How can I reset the state of the cluster without downtime? "pcs resource
cleanup" is enough?
Thank you,
Marco


Il giorno mer 4 set 2019 alle ore 10:29 Jan Pokorný 
ha scritto:

> On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:
> > 03.09.2019 11:09, Marco Marino пишет:
> >> Hi, I have a problem with fencing on a two node cluster. It seems that
> >> randomly the cluster cannot complete monitor operation for fence
> devices.
> >> In log I see:
> >> crmd[8206]:   error: Result of monitor operation for fence-node2 on
> >> ld2.mydomain.it: Timed Out
> >
> > Can you actually access IP addresses of your IPMI ports?
>
> [
> Tangentially, interesting aspect beyond that and applicable for any
> non-IP cross-host referential needs, which I haven't seen mentioned
> anywhere so far, is the risk of DNS resolution (when /etc/hosts will
> come short) getting to troubles (stale records, port blocked, DNS
> server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW
> cannot handle gracefully, etc.).  In any case, just a single DNS
> server would apparently be an undesired SPOF, and would be unfortunate
> when unable to fence a node because of that.
>
> I think the most robust approach is to use IP addresses whenever
> possible, and unambiguous records in /etc/hosts when practical.
> ]
>
> >> As attachment there is
> >> - /var/log/messages for node1 (only the important part)
> >> - /var/log/messages for node2 (only the important part) <-- Problem
> starts
> >> here
> >> - pcs status
> >> - pcs stonith show (for both fence devices)
> >>
> >> I think it could be a timeout problem, so how can I see timeout value
> for
> >> monitor operation in stonith devices?
> >> Please, someone can help me with this problem?
> >> Furthermore, how can I fix the state of fence devices without downtime?
>
> --
> Jan (Poki)
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-04 Thread Jan Pokorný
On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:
> 03.09.2019 11:09, Marco Marino пишет:
>> Hi, I have a problem with fencing on a two node cluster. It seems that
>> randomly the cluster cannot complete monitor operation for fence devices.
>> In log I see:
>> crmd[8206]:   error: Result of monitor operation for fence-node2 on
>> ld2.mydomain.it: Timed Out
> 
> Can you actually access IP addresses of your IPMI ports?

[
Tangentially, interesting aspect beyond that and applicable for any
non-IP cross-host referential needs, which I haven't seen mentioned
anywhere so far, is the risk of DNS resolution (when /etc/hosts will
come short) getting to troubles (stale records, port blocked, DNS
server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW
cannot handle gracefully, etc.).  In any case, just a single DNS
server would apparently be an undesired SPOF, and would be unfortunate
when unable to fence a node because of that.

I think the most robust approach is to use IP addresses whenever
possible, and unambiguous records in /etc/hosts when practical.
]

>> As attachment there is
>> - /var/log/messages for node1 (only the important part)
>> - /var/log/messages for node2 (only the important part) <-- Problem starts
>> here
>> - pcs status
>> - pcs stonith show (for both fence devices)
>> 
>> I think it could be a timeout problem, so how can I see timeout value for
>> monitor operation in stonith devices?
>> Please, someone can help me with this problem?
>> Furthermore, how can I fix the state of fence devices without downtime?

-- 
Jan (Poki)


pgpL97hDs1Edl.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-03 Thread Andrei Borzenkov
03.09.2019 11:09, Marco Marino пишет:
> Hi, I have a problem with fencing on a two node cluster. It seems that
> randomly the cluster cannot complete monitor operation for fence devices.
> In log I see:
> crmd[8206]:   error: Result of monitor operation for fence-node2 on
> ld2.mydomain.it: Timed Out

Can you actually access IP addresses of your IPMI ports?

> As attachment there is
> - /var/log/messages for node1 (only the important part)
> - /var/log/messages for node2 (only the important part) <-- Problem starts
> here
> - pcs status
> - pcs stonith show (for both fence devices)
> 
> I think it could be a timeout problem, so how can I see timeout value for
> monitor operation in stonith devices?
> Please, someone can help me with this problem?
> Furthermore, how can I fix the state of fence devices without downtime?
> 
> Thank you
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-03 Thread Marco Marino
Hi, I have a problem with fencing on a two node cluster. It seems that
randomly the cluster cannot complete monitor operation for fence devices.
In log I see:
crmd[8206]:   error: Result of monitor operation for fence-node2 on
ld2.mydomain.it: Timed Out
As attachment there is
- /var/log/messages for node1 (only the important part)
- /var/log/messages for node2 (only the important part) <-- Problem starts
here
- pcs status
- pcs stonith show (for both fence devices)

I think it could be a timeout problem, so how can I see timeout value for
monitor operation in stonith devices?
Please, someone can help me with this problem?
Furthermore, how can I fix the state of fence devices without downtime?

Thank you
PCS STATUS

root@ld1 ~]# pcs status
Cluster name: ldcluster
Stack: corosync
Current DC: ld1.mydomain.it (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition 
with quorum
Last updated: Tue Sep  3 09:37:27 2019
Last change: Thu Jul  4 21:36:07 2019 by root via cibadmin on ld1.mydomain.it

2 nodes configured
10 resources configured

Online: [ ld1.mydomain.it ld2.mydomain.it ]

Full list of resources:

 fence-node1(stonith:fence_ipmilan):Stopped
 fence-node2(stonith:fence_ipmilan):Stopped
 Master/Slave Set: DrbdResClone [DrbdRes]
 Masters: [ ld1.mydomain.it ]
 Slaves: [ ld2.mydomain.it ]
 HALVM  (ocf::heartbeat:LVM):   Started ld1.mydomain.it
 PgsqlFs(ocf::heartbeat:Filesystem):Started ld1.mydomain.it
 PostgresqlD(systemd:postgresql-9.6.service):   Started ld1.mydomain.it
 LegaldocapiD   (systemd:legaldocapi.service):  Started ld1.mydomain.it
 PublicVIP  (ocf::heartbeat:IPaddr2):   Started ld1.mydomain.it
 DefaultRoute   (ocf::heartbeat:Route): Started ld1.mydomain.it

Failed Actions:
* fence-node1_start_0 on ld1.mydomain.it 'unknown error' (1): call=221, 
status=Timed Out, exitreason='',
last-rc-change='Wed Aug 21 12:49:00 2019', queued=0ms, exec=20006ms
* fence-node2_start_0 on ld1.mydomain.it 'unknown error' (1): call=222, 
status=Timed Out, exitreason='',
last-rc-change='Wed Aug 21 12:49:00 2019', queued=1ms, exec=20013ms
* fence-node1_start_0 on ld2.mydomain.it 'unknown error' (1): call=182, 
status=Timed Out, exitreason='',
last-rc-change='Wed Aug 21 14:26:09 2019', queued=0ms, exec=20006ms
* fence-node2_start_0 on ld2.mydomain.it 'unknown error' (1): call=176, 
status=Timed Out, exitreason='',
last-rc-change='Wed Aug 21 12:48:40 2019', queued=1ms, exec=20008ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root@ld1 ~]#


STONITH SHOW###
[root@ld1 ~]# pcs stonith show fence-node1
 Resource: fence-node1 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=XXX 
pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it
  Operations: monitor interval=60s (fence-node1-monitor-interval-60s)
[root@ld1 ~]# pcs stonith show fence-node2
 Resource: fence-node2 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=192.168.254.251 lanplus=1 login=root passwd= 
pcmk_host_check=static-list pcmk_host_list=ld2.mydomain.it delay=12
  Operations: monitor interval=60s (fence-node2-monitor-interval-60s)
[root@ld1 ~]#


###NODE 2 
/var/log/messages##
Aug 21 12:48:40 ld2 stonith-ng[8202]:  notice: Child process 46006 performing 
action 'monitor' timed out with signal 15
Aug 21 12:48:40 ld2 stonith-ng[8202]:  notice: Operation 'monitor' [46006] for 
device 'fence-node2' returned: -62 (Timer expired)
Aug 21 12:48:40 ld2 crmd[8206]:   error: Result of monitor operation for 
fence-node2 on ld2.mydomain.it: Timed Out
Aug 21 12:48:40 ld2 stonith-ng[8202]:  notice: On loss of CCM Quorum: Ignore
Aug 21 12:48:40 ld2 crmd[8206]:  notice: Result of stop operation for 
fence-node2 on ld2.mydomain.it: 0 (ok)
Aug 21 12:48:40 ld2 stonith-ng[8202]:  notice: On loss of CCM Quorum: Ignore
Aug 21 12:48:40 ld2 stonith-ng[8202]:  notice: On loss of CCM Quorum: Ignore
Aug 21 12:48:40 ld2 stonith-ng[8202]:  notice: On loss of CCM Quorum: Ignore
Aug 21 12:48:59 ld2 stonith-ng[8202]:  notice: On loss of CCM Quorum: Ignore
Aug 21 12:49:00 ld2 stonith-ng[8202]:  notice: Child process 46053 performing 
action 'monitor' timed out with signal 15
Aug 21 12:49:00 ld2 stonith-ng[8202]:  notice: Operation 'monitor' [46053] for 
device 'fence-node2' returned: -62 (Timer expired)
Aug 21 12:49:00 ld2 crmd[8206]:   error: Result of start operation for 
fence-node2 on ld2.mydomain.it: Timed Out
Aug 21 12:49:00 ld2 stonith-ng[8202]:  notice: On loss of CCM Quorum: Ignore
Aug 21 12:49:00 ld2 stonith-ng[8202]:  notice: On loss of CCM Quorum: Ignore
Aug 21 12:49:00 ld2 stonith-ng[8202]:  notice: On loss of CCM Quorum: Ignore
Aug 21 12:49:00 ld2 stonith-ng[8202]:  notice: On loss of CCM Quorum: Ignore
Aug 21 12:49:00 ld2 crmd[8206]:  notic