Re: [ClusterLabs] Issue in fence_ilo4 with IPv6 ILO IPs

Ondrej Wed, 03 Apr 2019 20:08:30 -0700

On 4/3/19 6:10 PM, Rohit Saini wrote:

Hi Ondrej,
Please find my reply below:
1.
*Stonith configuration:*
[root@orana ~]# pcs config
  Resource: fence-uc-orana (class=stonith type=fence_ilo4)
Attributes: delay=0 ipaddr=fd00:1061:37:9002:: lanplus=1 login=xyzpasswd=xyz pcmk_host_list=orana pcmk_reboot_action=off
   Meta Attrs: failure-timeout=3s
Operations: monitor interval=5s on-fail=ignore(fence-uc-orana-monitor-interval-5s) start interval=0s on-fail=restart(fence-uc-orana-start-interval-0s)
  Resource: fence-uc-tigana (class=stonith type=fence_ilo4)
Attributes: delay=10 ipaddr=fd00:1061:37:9001:: lanplus=1 login=xyzpasswd=xyz pcmk_host_list=tigana pcmk_reboot_action=off
   Meta Attrs: failure-timeout=3s
Operations: monitor interval=5s on-fail=ignore(fence-uc-tigana-monitor-interval-5s) start interval=0s on-fail=restart(fence-uc-tigana-start-interval-0s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
   start fence-uc-orana then promote unicloud-master (kind:Mandatory)
   start fence-uc-tigana then promote unicloud-master (kind:Mandatory)
Colocation Constraints:
fence-uc-orana with unicloud-master (score:INFINITY)(rsc-role:Started) (with-rsc-role:Master) fence-uc-tigana with unicloud-master (score:INFINITY)(rsc-role:Started) (with-rsc-role:Master)
2. This is seen randomly. Since I am using colocation, stonith resourcesare stopped and started on new master. That time, starting of stonith istaking variable amount of time.
No other IPv6 issues are seen in the cluster nodes.

3. fence_agent version

[root@orana ~]#  rpm -qa|grep  fence-agents-ipmilan
fence-agents-ipmilan-4.0.11-66.el7.x86_64


*NOTE:*
Both IPv4 and IPv6 are configured on my ILO, with "iLO ClientApplications use IPv6 first" turned on.
Attaching corosync logs also.
Thanks, increasing timeout to 60 worked. But thats not what exactly I amlooking for. I need to know exact reason behind delay of starting theseIPv6 stonith resources.
Regards,
Rohit


Hi Rohit,

Thank you for response.

From configuration it is clear that we are using directly IP addressesso the DNS resolution issue can be rules out. There are no messages fromfence_ilo4 that would indicate reason why it timed out. So we cannottell yet what caused the issue. I see that you have enabledPCMK_debug=stonith-ng most probably (or PCMK_debug=yes),

It is nice that increased the timeout worked, but as said in previousemail it may just mask the real reason why it takes longer to domonitor/start operation.


> Both IPv4 and IPv6 are configured on my ILO, with "iLO Client
> Applications use IPv6 first" turned on.

This seems to me to be more related to SNMP communication which we don'tuse with fence_ilo4 as far as I know. We use the ipmitool on port 623/udp.

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2

> 2. This is seen randomly. Since I am using colocation, stonith resources
> are stopped and started on new master. That time, starting of stonith is
> taking variable amount of time.

This is a good observation. Which leads me to question if the iLO hasset any kind of session limits for the user that is used here. If thereis any session limit it may be worth trying to increase it and test ifthe same delay can be observed. One situation when this can happen isthat when one node communicates with iLO and during that time thecommunication from other node needs to happen while the limit is 1connection. The relocation of resource from one note to another mightfit this, but this is just speculation and fastest way to prove/rejectit would be to increase limit, if there is one, and test it.


# What more can be done to figure out on what is causing delay?

1. The fence_ilo4 can be configured with attribute 'verbose=1' to printadditional information when it is run. These data looks similar to onesbelow and they seems to provide the timestamps which is great as weshould be able to see when what command was run. I don't have a testingmachine on which to run fence_ilo4 so the below example just shows howit looks when it fails on timeout connecting.


Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: notice:
stonith_action_async_done: Child process 4252 performing action
'monitor' timed out with signal 15
Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:
log_action: fence_ilo4[4252] stderr: [ 2019-04-03 12:33:51,193 INFO:
Executing: /usr/bin/ipmitool -I lanplus -H fe80::f6bd:8a67:7eb5:214f -p
623 -U xyz -P [set] -L ADMINISTRATOR chassis power status ]
Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:
log_action: fence_ilo4[4252] stderr: [ ]

# pcs stonith update fence-uc-orana verbose=1

Note: That above shows that some private data are included in logs, soin case that you have there something interesting for sharing make sureto strip out the sensitive data.

2. The version of fence-agents-ipmilan is not the latest when comparingthat to my CentOS 7.6 system(fence-agents-ipmilan-4.2.1-11.el7_6.7.x86_64) so you may consider totry upgrading the package and see if the latest provided in yourdistribution helps by any way if that is possible.

3. You may check if there is any update for the iLO devices and see ifthe updated version exhibits the same behavior with timeouts. From logsI cannot tell what version or device the fence_ilo4 is communicating with.

4. If there is more reliable way for triggering way triggering thesituation when the timeout with default 20s is observed you can setupnetwork packet capture with tcpdump to see what kind of communication ishappening during that time. This can help to establish the idea if thereis any response from the iLO device while we wait which would indicatethe iLO or network to be issue or if the data arrives fast and thefence_ilo4 doesn't do anything.

- In first case that would point more to network or iLO communication issue

- In second case that would be more likely issue with fence_ilo4 oripmitool that is used for communication

NOTE: In case that you happen to have a subscription for your systemsyou can try also reaching technical support to look deeper on collecteddata. That way you can save time figuring out how to strip the privateparts from data before sharing them here.


========================================================================

--
Ondrej
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Issue in fence_ilo4 with IPv6 ILO IPs

Reply via email to