Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-11 Thread Marco Marino
Hi, some updates about this?
Thank you

Il Mer 4 Set 2019, 10:46 Marco Marino  ha scritto:

> First of all, thank you for your support.
> Andrey: sure, I can reach machines through IPMI.
> Here is a short "log":
>
> #From ld1 trying to contact ld1
> [root@ld1 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XX
> sdr elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> ...
>
> #From ld1 trying to contact ld2
> ipmitool -I lanplus -H 192.168.254.251 -U root -P XX sdr elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> ...
>
>
> #From ld2 trying to contact ld1:
> root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P X sdr
> elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC8   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> System Board | 00h | ns  |  7.1 | Logical FRU @00h
> .
>
> #From ld2 trying to contact ld2
> [root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.251 -U root -P  sdr
> elist all
> SEL  | 72h | ns  |  7.1 | No Reading
> Intrusion| 73h | ok  |  7.1 |
> iDRAC7   | 00h | ok  |  7.1 | Dynamic MC @ 20h
> System Board | 00h | ns  |  7.1 | Logical FRU @00h
> 
>
> Jan: Actually the cluster uses /etc/hosts in order to resolve names:
> 172.16.77.10ld1.mydomain.it  ld1
> 172.16.77.11ld2.mydomain.it  ld2
>
> Furthermore I'm using ip addresses for ipmi interfaces in the
> configuration:
> [root@ld1 ~]# pcs stonith show fence-node1
>  Resource: fence-node1 (class=stonith type=fence_ipmilan)
>   Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=X
> pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it
>   Operations: monitor interval=60s (fence-node1-monitor-interval-60s)
>
>
> Any idea?
> How can I reset the state of the cluster without downtime? "pcs resource
> cleanup" is enough?
> Thank you,
> Marco
>
>
> Il giorno mer 4 set 2019 alle ore 10:29 Jan Pokorný 
> ha scritto:
>
>> On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:
>> > 03.09.2019 11:09, Marco Marino пишет:
>> >> Hi, I have a problem with fencing on a two node cluster. It seems that
>> >> randomly the cluster cannot complete monitor operation for fence
>> devices.
>> >> In log I see:
>> >> crmd[8206]:   error: Result of monitor operation for fence-node2 on
>> >> ld2.mydomain.it: Timed Out
>> >
>> > Can you actually access IP addresses of your IPMI ports?
>>
>> [
>> Tangentially, interesting aspect beyond that and applicable for any
>> non-IP cross-host referential needs, which I haven't seen mentioned
>> anywhere so far, is the risk of DNS resolution (when /etc/hosts will
>> come short) getting to troubles (stale records, port blocked, DNS
>> server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW
>> cannot handle gracefully, etc.).  In any case, just a single DNS
>> server would apparently be an undesired SPOF, and would be unfortunate
>> when unable to fence a node because of that.
>>
>> I think the most robust approach is to use IP addresses whenever
>> possible, and unambiguous records in /etc/hosts when practical.
>> ]
>>
>> >> As attachment there is
>> >> - /var/log/messages for node1 (only the important part)
>> >> - /var/log/messages for node2 (only the important part) <-- Problem
>> starts
>> >> here
>> >> - pcs status
>> >> - pcs stonith show (for both fence devices)
>> >>
>> >> I think it could be a timeout problem, so how can I see timeout value
>> for
>> >> monitor operation in stonith devices?
>> >> Please, someone can help me with this problem?
>> >> Furthermore, how can I fix the state of fence devices without downtime?
>>
>> --
>> Jan (Poki)
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] op stop timeout update causes monitor op to fail?

2019-09-11 Thread Ken Gaillot
On Tue, 2019-09-10 at 09:54 +0200, Dennis Jacobfeuerborn wrote:
> Hi,
> I just updated the timeout for the stop operation on an nfs cluster
> and
> while the timeout was update the status suddenly showed this:
> 
> Failed Actions:
> * nfsserver_monitor_1 on nfs1aqs1 'unknown error' (1): call=41,
> status=Timed Out, exitreason='none',
> last-rc-change='Tue Aug 13 14:14:28 2019', queued=0ms, exec=0ms

Are you sure it wasn't already showing that? The timestamp of that
error is Aug 13, while the logs show the timeout update happening Sep
10.

Old errors will keep showing up in status until you manually clean them
up (with crm_resource --cleanup or a higher-level tool equivalent), or
any configured failure-timeout is reached.

In any case, the log excerpt shows that nothing went wrong during the
time it covers. There were no actions scheduled in that transition in
response to the timeout change (which is as expected).

> 
> The command used:
> pcs resource update nfsserver op stop timeout=30s
> 
> I can't imagine that this is expected to happen. Is there another way
> to
> update the timeout that doesn't cause this?
> 
> I attached the log of the transition.
> 
> Regards,
>   Dennis
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/