Hi Ondrej, Yes, you are right. This issue was specific to floating IPs, not with local IPs.
Post becoming master, I was sending "Neighbor Advertisement" message for my floating IPs. This was a raw message which was created by me, so I was the one who was setting flags in it. Please find attached "image1" which is the message format of NA message. Attached "image2" which a message capture, as you can see "Override" and "Solicited" flag both are set. As part of solution, now only "Override" is set. Hope I answer your questions. Please let me know any queries. Thanks, Rohit On Mon, Apr 8, 2019 at 6:13 PM Ondrej <[email protected]> wrote: > On 4/5/19 8:18 PM, Rohit Saini wrote: > > *Further update on this:* > > This issue is resolved now. ILO was discarding "Neighbor Advertisement" > > (NA) as Solicited flag was set in NA message. Hence it was not updating > > its local neighbor table. > > As per RFC, Solicited flag should be set only in NA message when it is a > > response to Neighbor Solicitation. > > After disabling the Solicited flag in NA message, ILO started updating > > the local neighbor cache. > > Hi Rohit, > > Sounds great that after change you get a consistent behaviour. As I had > not worked with IPv6 for quite some time I wonder how did you disable > the 'Solicited flag'. Was this done on the OS (cluster node) or on the > iLO? My guess is the OS but I have no idea how that can be accomplished. > Can you share which setting you have changed to accomplish this? :) > > One additional note the observation here is that you are using the > "floating IP" that relocated to other machine, while the configuration > of cluster seems to be not containing any IPaddr2 resources that would > be representing this address. I would guess that cluster without the > floating address would not have issue as it would use the addresses > assigned to the nodes and therefore the mapping between IP address and > MAC address will be not changing even when the fence_ilo4 resource are > moving between nodes. If there is intention to use the floating address > in this cluster I would suggest checking if there is also no issue when > "not using the floating address" or when it is disabled to see how the > fence_ilo4 communicates. I think that there might be way in routing > tables to set which IPv6 address should communicate with iLO IPv6 > address so you get consistent behaviour instead of using the floating IP > address. > > Anyway I'm glad that mystery is resolved. > > -- > Ondrej > > > > > On Fri, Apr 5, 2019 at 2:23 PM Rohit Saini > > <[email protected] <mailto:[email protected]>> > > wrote: > > > > Hi Ondrej, > > Finally found some lead on this.. We started tcpdump on my machine > > to understand the IPMI traffic. Attaching the capture for your > > reference. > > fd00:1061:37:9021:: is my floating IP and fd00:1061:37:9002:: is my > > ILO IP. > > When resource movement happens, we are initiating the "Neighbor > > Advertisement" for fd00:1061:37:9021:: (which is on new machine now) > > so that peers can update their neighbor table and starts > > communication with new MAC address. > > Looks like ILO is not updating its neighbor table, as it is still > > sending responding to older MAC. > > After sometime, "Neighbor Solicitation" happens and ILO updates the > > neighbor table. Now this ILO becomes reachable and starts responding > > towards new MAC address. > > > > My ILO firmware is 2.60. We will try again the issue post upgrading > > my firmware. > > > > To verify this theory, after resource movement, I flushed the local > > neighbor table due to which "Neighbor Solicitation" was initiated > > early and this delay in getting ILO response was not seen. > > This fixed the issue. > > > > We are now more interested in understanding why ILO couldnot update > > its neighbor table on receiving "Neighbor Advertisement". FYI, > > Override flag in "Neighbor Advertisement" is already set. > > > > Thanks, > > Rohit > > > > On Thu, Apr 4, 2019 at 8:37 AM Ondrej <[email protected] > > <mailto:[email protected]>> wrote: > > > > On 4/3/19 6:10 PM, Rohit Saini wrote: > > > Hi Ondrej, > > > Please find my reply below: > > > > > > 1. > > > *Stonith configuration:* > > > [root@orana ~]# pcs config > > > Resource: fence-uc-orana (class=stonith type=fence_ilo4) > > > Attributes: delay=0 ipaddr=fd00:1061:37:9002:: lanplus=1 > > login=xyz > > > passwd=xyz pcmk_host_list=orana pcmk_reboot_action=off > > > Meta Attrs: failure-timeout=3s > > > Operations: monitor interval=5s on-fail=ignore > > > (fence-uc-orana-monitor-interval-5s) > > > start interval=0s on-fail=restart > > > (fence-uc-orana-start-interval-0s) > > > Resource: fence-uc-tigana (class=stonith type=fence_ilo4) > > > Attributes: delay=10 ipaddr=fd00:1061:37:9001:: lanplus=1 > > login=xyz > > > passwd=xyz pcmk_host_list=tigana pcmk_reboot_action=off > > > Meta Attrs: failure-timeout=3s > > > Operations: monitor interval=5s on-fail=ignore > > > (fence-uc-tigana-monitor-interval-5s) > > > start interval=0s on-fail=restart > > > (fence-uc-tigana-start-interval-0s) > > > > > > Fencing Levels: > > > > > > Location Constraints: > > > Ordering Constraints: > > > start fence-uc-orana then promote unicloud-master > > (kind:Mandatory) > > > start fence-uc-tigana then promote unicloud-master > > (kind:Mandatory) > > > Colocation Constraints: > > > fence-uc-orana with unicloud-master (score:INFINITY) > > > (rsc-role:Started) (with-rsc-role:Master) > > > fence-uc-tigana with unicloud-master (score:INFINITY) > > > (rsc-role:Started) (with-rsc-role:Master) > > > > > > > > > 2. This is seen randomly. Since I am using colocation, > > stonith resources > > > are stopped and started on new master. That time, starting of > > stonith is > > > taking variable amount of time. > > > No other IPv6 issues are seen in the cluster nodes. > > > > > > 3. fence_agent version > > > > > > [root@orana ~]# rpm -qa|grep fence-agents-ipmilan > > > fence-agents-ipmilan-4.0.11-66.el7.x86_64 > > > > > > > > > *NOTE:* > > > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client > > > Applications use IPv6 first" turned on. > > > Attaching corosync logs also. > > > > > > Thanks, increasing timeout to 60 worked. But thats not what > > exactly I am > > > looking for. I need to know exact reason behind delay of > > starting these > > > IPv6 stonith resources. > > > > > > Regards, > > > Rohit > > > > Hi Rohit, > > > > Thank you for response. > > > > From configuration it is clear that we are using directly IP > > addresses > > so the DNS resolution issue can be rules out. There are no > > messages from > > fence_ilo4 that would indicate reason why it timed out. So we > > cannot > > tell yet what caused the issue. I see that you have enabled > > PCMK_debug=stonith-ng most probably (or PCMK_debug=yes), > > > > It is nice that increased the timeout worked, but as said in > > previous > > email it may just mask the real reason why it takes longer to do > > monitor/start operation. > > > > > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client > > > Applications use IPv6 first" turned on. > > This seems to me to be more related to SNMP communication which > > we don't > > use with fence_ilo4 as far as I know. We use the ipmitool on > > port 623/udp. > > > https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2 > > > > > 2. This is seen randomly. Since I am using colocation, > > stonith resources > > > are stopped and started on new master. That time, starting > > of stonith is > > > taking variable amount of time. > > This is a good observation. Which leads me to question if the > > iLO has > > set any kind of session limits for the user that is used here. > > If there > > is any session limit it may be worth trying to increase it and > > test if > > the same delay can be observed. One situation when this can > > happen is > > that when one node communicates with iLO and during that time the > > communication from other node needs to happen while the limit is > 1 > > connection. The relocation of resource from one note to another > > might > > fit this, but this is just speculation and fastest way to > > prove/reject > > it would be to increase limit, if there is one, and test it. > > > > # What more can be done to figure out on what is causing delay? > > > > 1. The fence_ilo4 can be configured with attribute 'verbose=1' > > to print > > additional information when it is run. These data looks similar > > to ones > > below and they seems to provide the timestamps which is great as > we > > should be able to see when what command was run. I don't have a > > testing > > machine on which to run fence_ilo4 so the below example just > > shows how > > it looks when it fails on timeout connecting. > > > > Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: notice: > > stonith_action_async_done: Child process 4252 performing action > > 'monitor' timed out with signal 15 > > Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning: > > log_action: fence_ilo4[4252] stderr: [ 2019-04-03 12:33:51,193 > INFO: > > Executing: /usr/bin/ipmitool -I lanplus -H > > fe80::f6bd:8a67:7eb5:214f -p > > 623 -U xyz -P [set] -L ADMINISTRATOR chassis power status ] > > Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning: > > log_action: fence_ilo4[4252] stderr: [ ] > > > > # pcs stonith update fence-uc-orana verbose=1 > > > > Note: That above shows that some private data are included in > > logs, so > > in case that you have there something interesting for sharing > > make sure > > to strip out the sensitive data. > > > > 2. The version of fence-agents-ipmilan is not the latest when > > comparing > > that to my CentOS 7.6 system > > (fence-agents-ipmilan-4.2.1-11.el7_6.7.x86_64) so you may > > consider to > > try upgrading the package and see if the latest provided in your > > distribution helps by any way if that is possible. > > > > 3. You may check if there is any update for the iLO devices and > > see if > > the updated version exhibits the same behavior with timeouts. > > From logs > > I cannot tell what version or device the fence_ilo4 is > > communicating with. > > > > 4. If there is more reliable way for triggering way triggering > the > > situation when the timeout with default 20s is observed you can > > setup > > network packet capture with tcpdump to see what kind of > > communication is > > happening during that time. This can help to establish the idea > > if there > > is any response from the iLO device while we wait which would > > indicate > > the iLO or network to be issue or if the data arrives fast and > the > > fence_ilo4 doesn't do anything. > > - In first case that would point more to network or iLO > > communication issue > > - In second case that would be more likely issue with fence_ilo4 > or > > ipmitool that is used for communication > > > > NOTE: In case that you happen to have a subscription for your > > systems > > you can try also reaching technical support to look deeper on > > collected > > data. That way you can save time figuring out how to strip the > > private > > parts from data before sharing them here. > > > > > ======================================================================== > > > > -- > > Ondrej > > > >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/developers ClusterLabs home: https://www.clusterlabs.org/
